This tutorial explains how to identify and handle duplicate data with pandas.

Packages

This tutorial uses:

Open up a Jupyter Notebook and import the following:


import pandas as pd

Creating the data

We will create a dataframe that contains multiple occurrences of duplication for this example.


df = pd.DataFrame({'A': ['A']*2 + ['A', 'A', 'B', 'A', 'B']*3 + ['A', 'A', 'B'],
                   'B': ['A']*2 + ['A', 'a', 'B', 'A', 'b']*3 + ['A', 'a', 'B'],
                   'C': ['A']*2 + ['A', 'B', 'C']*5 + ['A', 'A', 'B'],
                   'D': ['A']*2 + ['A', 'a', 'B']*5 + ['A', 'A', 'B']
                  })
df

Identify duplicates

Duplicate in all columns

The function duplicated will return a Boolean series indicating if that row is a duplicate. The parameter keep can take on the values 'first' (default) to label the first duplicate False and the rest True, 'last' to mark the last duplicate False and the rest True, or False to mark all duplicates True.


dups = df.duplicated()
dups

To see the duplicate rows, use the Boolean series dups to select rows from the original dataframe.


df[dups]

Duplicate in selected columns

The function duplicated will return a Boolean series indicating if that row is a duplicate based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).


dups = df.duplicated(subset=['A', 'B'])
dups

Next, take a look at the duplicates


df[dups]

Delete duplicates

Delete only if all columns are duplicated

The function drop_duplicates will return a dataframe after dropping duplicates. The parameter keep can take on the values 'first' (default) to keep the first duplicate and drop the rest, 'last' to keep the last duplicate and drop the rest, or False to drop all duplicates.


dedup_df = df.drop_duplicates()
dedup_df

Delete only if specified columns are duplicated

The function drop_duplicates will return a dataframe after dropping all duplicates based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).


dedup_df = df.drop_duplicates(subset=['A', 'B'])
dedup_df