Handling Duplicate Rows in a Pandas Dataframe

Posted on 27th September 2019

One of the annoying things you have to deal with in a large data set is duplicate rows. But this become very easy and simple if you use Pandas. For those of you who are not familiar with Pandas, it is an open source Python library that provides functions and data structure for data analysis. The data structure in Pandas that is used to store tabular data is called a Data Frame. In this article you will learn how to identify and remove duplicate rows from a Pandas data frame.

To use Pandas library in your Python program, you need to import it first.

import pandas as pd

Now we create a dataframe from a dictionary.

data = {"Name": ["Tom", "Jack", "Ella", "Jeff", "Tom"], 
       "Age": [30, 28, 24, 45, 30"],
       "Height":[165, 160, 160, 170, 165],
       "Weight":[70, 60, 60, 82, 70]
        }

df = pd.DataFrame(data)

print(df.head())

The output will be:

NameAgeHeightWeight
0Tom3016570
1Jack2816060
2Ella2416060
3Jeff4517082
4Tom3016570

Finding Duplicate Rows

In the sample dataframe that we have created, you might have noticed that rows 0 and 4 are exactly the same. You can identify such duplicate rows in a Pandas dataframe by calling the duplicated function. The duplicated function returns a Boolean series with value True indicating a duplicate row.

print(df.duplicated())

Output:

0    False
1    False
2    False
3    False
4     True
dtype: bool

By default the first row in a duplicated set is marked as False and all others marked as True. In the above example rows 0 and 4 are duplicates but only row 4 is marked as True. You can change this behaviour by setting the keep parameter to one of the following values.

  • first: First occurance in the duplicate set is marked as False and others marked True. This is the default
  • last: Last occurance among the duplicates is marked as False and others marked True.
  • False: All duplicates are marked as True.

For example to mark all rows in a duplicate set as True:

print(df.duplicated(keep="last"))

Outputs

0     True
1    False
2    False
3    False
4    False
dtype: bool

When finding duplicates instead of matching all columns in a row, you can specify a subset of columns to be checked. For example, to find rows that have the sane values in height and weight columns:

print(df.duplicated(["Height","Weight"]))

Output

0    False
1    False
2     True
3    False
4     True
dtype: bool

Displaying duplicate rows

To display duplicated rows only, you can filter the dataframe like this:

print(df[df.duplicated(keep=False)])

Output:

  Name Age  Height  Weight
0  Tom  30     165      70
4  Tom  30     165      70

Removing Duplicate Rows

You can remove duplicate rows from a Pandas dataframe using the drop_duplicates function. drop_duplicates function returns a dataframe after removing duplicated rows. By default, the first occurance among the duplicates is retained and others removed. You can change this default behavior by setting the keep parameter. The following values are possible.

  • first : First occurance among the duplicates is retained and others removed.
  • last : Keep the last occurance of the duplicated row and remove others in the set.
  • False : Drop all duplicate rows.

Example

df.drop_duplicates()

Output

   Name Age  Height  Weight
0   Tom  30     165      70
1  Jack  28     160      60
2  Ella  24     160      60
3  Jeff  45     170      82
df.drop_duplicates(keep="last")

Output

   Name Age  Height  Weight
1  Jack  28     160      60
2  Ella  24     160      60
3  Jeff  45     170      82
4   Tom  30     165      70

Same as in duplicated method, you can also specify a subset of columns like below.

df.drop_duplicates(["Height","Weight"])

Output

   Name Age  Height  Weight
0   Tom  30     165      70
1  Jack  28     160      60
3  Jeff  45     170      82

Post a comment

Comments

Nothing yet..be the first to share wisdom.