Filter a Dataframe Using Common String Methods
Overview
One of the most common data manipulation operations is dataframe filtering. A DataFrame is filtered when its data are analyzed, and only those that satisfy specific requirements are returned.
Pandas, a great data manipulation tool, is the best fit for Dataframe Filtering.
Introduction
As is well known, the most popular data structure in Python is the dataframe and the strongest tool for manipulating this data is Pandas. This data can vary in size from one unit to huge volumes. As a result, to locate specific data, we need to filter it. This is where dataframe filtering comes into play.
For example, let us consider the Titanic Dataset. Here, we need to assess how many European passengers made it out alive. Manually selecting the information needed in the study is very time-consuming. Consequently, we use dataframe filtering here, and we can analyze the data as per our requirements.
Handling missing data, Data Wrangling, and Data Cleaning are frequently used applications of Dataframe Filtering.
Methods to Filter a DataFrame
In Pandas, we have a huge range of functions for Dataframe Filtering. Let us look over some most used methods.
For this article, we will be using the Netflix Shows dataset.
-
Using isin() -
You can enter an iterable, a series, a dataframe, or a dictionary as the single argument values for the function. Whatever you supply for the values argument is evaluated against a vectorized boolean expression, which filters your dataframe efficiently.
Example 1 -
Output:
Example 2 -
Output:
-
Using str.contains() -
This method of Dataframe Filtering checks to see if a substring, regex, or a pattern is present in each row of a dataframe, then gives us a set of booleans as the returning value.
-
For two partial strings - In this instance, we can filter the dataframe for the occurrence of two or more strings. If any of the conditions is true, the output returned by this dataframe filtering method is true.
Example 1 -
Output:
Example 2 -
Output:
-
Where both strings are present -
Dataframe filtering in this case works when both of the required strings are present. In case only one of the strings is present, it returns negative output, else it returns true.
Example 1 -
Output:
Example 2 -
Output:
-
For partial string in multiple columns -
This function as the title suggests looks up for a string, pattern, or partial string present in any of the columns of the dataframe.
Example 1 -
Output:
Example 2 -
Output:
-
-
Custom apply() function -
In a lot of cases, only partial string matching and a string containing is not applicable as we need very precise results. For this, we have the apply function. This dataframe filtering technique can be customized to fit our requirements. By pairing lambda functions with the apply function, we may implement this.
Example 1 -
Output:
-
Checking column names for a given sub-string -
As the title suggests, we can locate partial strings in columns, we can also return specific indices as required.
Example 1 - Filter column names
Output:
Example 2 - Filter by index values
Output:
Example 3 -
Output:
There are other dataframe filtering functions like starts with, ends with, numeric, string, upper, and lower that can be used for the same.
-
Using query() -
The query function in a DBMS and query syntax is fairly similar. Because it is simpler and more efficient than the other ways listed, it is frequently utilized.
Example 1-
Example 2-
Conclusion
- Dataframe filtering is an integral part of data manipulation and data analysis with vast applications. There are various ways we can go over dataframe filtering. Some of these methods are specific to strings, partial strings, or series.
- We use isin and contains methods to check the existence or occurrence of strings which returns a boolean output.
- To get more specific results, we use the apply function.
- For particular columns, we use indices and column names.
- Querying is also a straightforward way for dataframe filtering.