Removing Duplicates from a DataFrame
Overview
The presence of duplicates in a dataframe can lead to errors and hence wrong insights from the data. Removing duplicates from a dataframe is a crucial step of data cleaning. We can use the pandas library of Python to detect and remove duplicates from a dataframe and then verify the results. It's possible to remove those rows where the values in either all or the specified columns are duplicated. The initial dataframe remains untouched unless we modify it in place or reassign it to a variable. After removing duplicates from a dataframe, it's important to reset the dataframe index.
Introduction
The presence of duplicates in a dataframe is one of the most common issues when conducting data analysis in Python. If not handled properly, duplicated rows or values can lead to miscalculations, getting erroneous statistics, and, as a result, obtaining wrong insights from the data. Hence, removing duplicates from a dataframe is a crucial step of data cleaning.
The pandas library of Python offers convenient tools for this purpose. The overall algorithm looks as follows:
- Detecting the duplicates
- Removing the duplicates
- Verifying the results
The last step virtually repeats the first one: we just check again if the dataframe contains any duplicates.
Spotting the Duplicates
To detect if a dataframe (or certain columns of it) has duplicated rows, pandas provides the duplicated() method. It returns a boolean Series indicating the rows with duplicates. The method has the following syntax:
Where:
- subset – a column name provided as a string (or column names provided as a Python list) where to search for duplicates. By default, the method searches for duplicates in all the columns, i.e., by row. Note that if we provide a list of column names, only those rows where all the columns from the list are duplicated will be labeled as duplicates.
- keep – can be set to 'first' (labels the first occurrence of duplicates as False and all the others as True), 'last' (labels the last occurrence of duplicates as False and all the others as True), or False (labels all the occurrences of duplicates as True). The default value is 'first'.
Both parameters are optional.
If instead of returning a boolean Series, we just need to return the overall number of duplicates based on the provided criteria, we can chain this method with another pandas method – sum().
We'll see how the duplicated() method works in the Examples section, as well as all the other methods we're going to consider.
Removing the Duplicates
After identifying that the issue exists, we need to fix it by removing duplicates from a dataframe. There are two possible scenarios here. We may need to either:
- remove those dataframe entries where the values in all the columns are duplicated
- remove those dataframe entries where the values in the specified columns are duplicated.
In both cases, we can use the drop_duplicates() pandas method. In the second case, we can alternatively use the groupby() method.
Let's explore both scenarios separately.
Using drop_duplicates()
For removing duplicates from a dataframe based on all the columns, we have to use the drop_duplicates() method of pandas. This method has the following syntax:
Where:
- subset – a column name provided as a string (or column names provided as a Python list) where to search for and then remove duplicates. By default, the method removes duplicated rows from a dataframe if all the corresponding values by column in two or more rows are the same. Since in this section, we're discussing only removing duplicates from a dataframe based on all the columns. We don't need to pass in this parameter.
- keep – can be set to 'first' (keeps the first occurrence of duplicates and drops all the others), 'last' (keeps the last occurrence of duplicates and drops all the others), or False (drops all the occurrences of duplicates). The default value is 'first'.
- inplace – can be True (modifies the current dataframe) or False (creates a new dataframe and hence has to be reassigned to a new or existing variable). The default value is False. If the inplace parameter is set to True, the method returns None. Otherwise, it returns a dataframe with removed duplicates.
- ignore_index – can be True (the dataframe index is reset to the default 0-based Python index) or False (the index isn't reset). The default value is False. Note that using this parameter is equivalent to chaining the drop_duplicates() method with reset_index(drop=True). To keep an integer-based index consistent after removing duplicates from a dataframe, it's important to reset the index using either of these two approaches.
Removing Duplicates Based on Other Columns
For removing duplicates from a dataframe based on the specified columns, we can use the same pandas method, drop_duplicates(). This is where the subset parameter comes into play. We set it to either a string (if we want to deal with duplicates in only one column) or a list of columns (in the case of two and more columns of interest). All the other parameters of the method – keep, inplace, and ignore_index – have the same behavior as described in the previous section.
Another approach to removing duplicates from a dataframe based on a selection of columns implies chaining the groupby() and first() (or last(), or nth()) pandas methods. The concept here is to group dataframe rows based on a list of the columns of interest and then select the first entry (or the last, or the nth) from each group. While the groupby() method has many optional parameters, for our purposes, we need to set only the by parameter passing in the list of columns. In the second chained method – first() (or last(), or nth()) – we don't pass in anything.
Examples
Let's put into practice everything we've discussed so far. First, we'll import the pandas library and create a dummy dataframe with the data for a fictitious animal care center:
Code:
Output:
1. Detecting if the dataframe contains any duplicates:
Code:
Output:
2. Marking all the occurrences of duplicates as True:
Code:
Output:
3. Marking all the occurrences of duplicates as True except for the last one:
Code:
Output:
4. Finding the overall number of duplicates:
Code:
Output:
5. Search for the duplicates only in one column:
Code:
Output:
6. Search for all the occurrences of duplicates in a selection of columns:
Code:
Output:
In the above example, note that the hedgehog Lucky 2 y.o. was marked as a duplicate since there are also two dogs called Lucky 2 y.o.
7. Removing duplicates from a dataframe based on all the columns:
Code:
Output:
Note the following things in the above results:
- The initial dataframe remained untouched since we didn't use inplace=True or reassign to a variable.
- The first occurrence of duplicated rows is kept in each case (keep='first' by default).
- Removing duplicates from a dataframe resulted in getting an inconsistent index. To reset it, we should use either the parameter ignore_index=True or the reset_index(drop=True) method.
8. Removing duplicates from a dataframe based on all the columns while keeping the last occurrence of duplicates in each case:
Code:
Output:
9. Reassigning the result of applying the drop_duplicates() method to a new variable:
Code:
Output:
10. Modifying the current dataframe in place after removing duplicates from it:
Code:
Output:
Let's restore the initial dataframe to be able to continue with further examples:
Code:
Output:
11. Removing duplicates from a dataframe based on all the columns and restoring the 0-based index – Way 1:
Code:
Output:
12. Removing duplicates from a dataframe based on all the columns and restoring the 0-based index – Way 2:
Code:
Output:
13. Verifying if both ways of restoring the 0-based index are equal:
Code:
Output:
14. Removing duplicates from a dataframe based on a single column:
Code:
Output:
Note that in the original dataframe, apart from the hedgehog Lucky, we also have two dogs called Lucky.
Code:
Output:
Both dogs were removed since they have the same name as the hedgehog, and, by default, the drop_duplicates() method keeps only the first occurrence of duplicates (in our case – the hedgehog).
15. Removing duplicates from a dataframe based on a single column while keeping the last occurrence of duplicates in each case:
Code:
Output:
In this case, the hedgehog was removed, as well as the first occurrence of the dog called Lucky. Instead, the second occurrence of the dog Lucky was preserved.
16. Removing duplicates from a dataframe based on a selection of columns:
Code:
Output:
17. Removing duplicates from a dataframe based on a selection of columns while keeping the last occurrence of duplicates in each case and restoring the 0-based index:
Code:
Output:
18. Removing duplicates from a dataframe based on a selection of columns by chaining the groupby() and first() methods:
Code:
Output:
Note that now the columns species and name became the dataframe indices. If we want to convert them back to regular columns, we have to add one more method to the above chaining – reset_index(level=['species', 'name']):
Code:
Output:
Conclusion
- The presence of duplicates in a dataframe can lead to errors and hence wrong insights from the data.
- Removing duplicates from a dataframe is a crucial step of data cleaning.
- The overall algorithm for removing duplicates from a dataframe using pandas includes detecting the duplicates, removing them, and verifying the results.
- To detect the duplicates in all or the specified columns of a dataframe, we use the duplicated() method.
- To find the overall number of duplicates in a dataframe based on the provided criteria, we chain the duplicated() and sum() methods.
- To remove the dataframe rows where the values in all of the specified columns are duplicated, we can use either the drop_duplicates() method or chaining groupby() and first() (or last(), or nth()).
- The initial dataframe remains untouched unless we modify it in place or reassign it to a variable.
- After removing duplicates from a dataframe, it's important to reset the dataframe index.