How To Drop Duplicates In Pandas

“Deduplicate” literally means to remove duplicate data. In a data set, the whole process of data deduplication is to find and delete duplicate data and finally save only one existing data item.

1. Why We Need Data Deduplication In Data Analytics.

  1. Deleting duplicate data is a common problem in data analysis.
  2. Deduplication not only saves memory space and improves write performance, but also improves the accuracy of the data set, preventing the data set from being affected by duplicate data.

2. Pandas drop_duplicates() Function.

  1. The Pandas DataFrame object provides the function drop_duplicates() for deleting duplicate data.
  2. Below is the drop_duplicates() function’s syntax.
    df.drop_duplicates(subset=['Python','Java','Javascript'],keep='first',inplace=True)
  3. subset: indicates the column name to be deduplicated. The default value is none.
  4. keep: there are three optional parameters: first, last and False. The default value is first, which means that only the first duplicates are retained and other duplicates are deleted, last means that only the last duplicates are retained. False means that all duplicates are deleted.
  5. inplace: boolean parameter. The default value is False, which means that a copy is returned after deleting duplicates. If it is True, it means that duplicates are deleted directly on the original data.
  6. Below is the example, you can see the comments for a detailed explaination.
    def drop_duplicates_example():
        
        # creat the original data source.
        source_data = {
           
           'OS':['Windows', 'Linux', 'MacOS', 'Linux', 'Windows', 'Windows'],
           
           'Coding Language':['Python', 'Java', 'Python', 'Java', 'Python', 'C++'],
           
           'Database':['MySQL', 'MongoDB', 'Oracle', 'MySQL', 'MySQL', 'SQLite']
        }
        
        df = pd.DataFrame(data = source_data)
    
        print('*************** original dataframe object ******************\r')
        print(df)   
        
        
        print('\n *************** The first occurrence of duplicates is retained by default ******************\r')
        df1 = df.drop_duplicates()
        print('df.drop_duplicates()\r', df1)
        
        
        print('\n *************** reset dataframe object index ******************\r')
        print('df1.reset_index(drop=True)\r', df1.reset_index(drop=True))
        
        
        print('\n *************** set keep=False to remove all duplicate data ******************\r')
        df2 = df.drop_duplicates(keep=False)
        print('df.drop_duplicates(keep=False)\r', df2)
        
        
        print('\n *************** drop duplicates according to the specified column label, drop all the duplicate data in OS column ******************\r')
        df3 = df.drop_duplicates(subset=['OS'],keep='first')
        print('df.drop_duplicates(subset=[\'OS\'],keep=\'first\')\r', df3)
        
        
        print('\n *************** drop duplicates according to the multiple specified column labels, drop all the duplicate data in OS & Database column ******************\r')
        df4 = df.drop_duplicates(subset=['OS', 'Database'], keep=False)
        print('df.drop_duplicates(subset=[\'OS\', \'Database\'],keep=\'last\')\r', df4)
    
    if __name__ == '__main__':
        
        drop_duplicates_example()
  7. Below is the above example source code execution result.
    *************** original dataframe object ******************
    
            OS Coding Language Database
    0  Windows          Python    MySQL
    1    Linux            Java  MongoDB
    2    MacOS          Python   Oracle
    3    Linux            Java    MySQL
    4  Windows          Python    MySQL
    5  Windows             C++   SQLite
    
     *************** The first occurrence of duplicates is retained by default ******************
    
    df.drop_duplicates()
             OS Coding Language Database
    0  Windows          Python    MySQL
    1    Linux            Java  MongoDB
    2    MacOS          Python   Oracle
    3    Linux            Java    MySQL
    5  Windows             C++   SQLite
    
     *************** reset dataframe object index ******************
    
    df1.reset_index(drop=True)
             OS Coding Language Database
    0  Windows          Python    MySQL
    1    Linux            Java  MongoDB
    2    MacOS          Python   Oracle
    3    Linux            Java    MySQL
    4  Windows             C++   SQLite
    
     *************** set keep=False to remove all duplicate data ******************
    
    df.drop_duplicates(keep=False)
             OS Coding Language Database
    1    Linux            Java  MongoDB
    2    MacOS          Python   Oracle
    3    Linux            Java    MySQL
    5  Windows             C++   SQLite
    
     *************** drop duplicate according to the specified column label, drop all the duplicate data in OS column ******************
    
    df.drop_duplicates(subset=['OS'],keep='first')
             OS Coding Language Database
    0  Windows          Python    MySQL
    1    Linux            Java  MongoDB
    2    MacOS          Python   Oracle
    
     *************** drop duplicate according to the multiple specified column labels, drop all the duplicate data in OS and Database column ******************
    
    df.drop_duplicates(subset=['OS', 'Database'],keep='last')
             OS Coding Language Database
    1    Linux            Java  MongoDB
    2    MacOS          Python   Oracle
    3    Linux            Java    MySQL
    5  Windows             C++   SQLite

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.