Advanced Operations on a Pandas Dataframe
Overview
Pandas is the go-to library for machine learning engineers. It provides tools for preparing and transforming our dataset to feed them to machine learning models. It has tools for reading and writing data, handling missing data, merging datasets, subsetting datasets, and many more.
Introduction
Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool built on top of the Python programming language. The Pandas library helps to prepare the raw data in the right format for `machine learning algorithms. It is used to transform datasets to feed them to machine learning models.
It provides tools for reading and writing data, intelligent data alignment and integrated handling of missing data, reshaping, slicing, fancy indexing, subsetting, merging and joining of data sets, etc.
Some Important Pandas Functions
Some important functions of Pandas are as follows.
- Melt - This function is useful to convert a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Pivot - This function returns a reshaped DataFrame organized by given index/column values. It reshapes data (produces a “pivot” table) based on column values. It uses unique values from specified index/columns to form axes of the resulting DataFrame.
- Concat - This function performs concatenation operations along a particular axis on Pandas objects. It also allows optional set logic (union or intersection) of the indexes on the other axes.
- Merge - This function merges DataFrame or named Series objects with a database-style join.
- Reset_index - This function resets the index of the DataFrame and uses the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
- Set_index - This function sets the DataFrame index (row labels) using one or more existing columns or arrays(of the correct length).
- Groupby - This function groups the data according to the categories and applies a function to the categories. It is used to aggregate a large amount of data.
- Replace - This function replaces values given in the DataFrame with the value provided as the parameter.
- Map - This function map values of a Series according to an input mapping or function.
- Crosstab - This function computes a simple cross-tabulation of two (or more) factors. By default, it computes a frequency table of the factors and an aggregation function is then performed according to the given parameters.
- pandas any - This function returns False if there is no element within a series or along a Dataframe axis that is True or equivalent. It returns True if there exists one element that is True or equivalent.
- pandas agg - This function aggregates using one or more operations over the specified axis.
- pandas copy - This function makes a copy of this object’s indices and data.
- pandas transform - This function calls the input function on self-producing a DataFrame with the same axis shape as self.
- pandas rank - This function computes numerical data ranks (1 through n) along the axis where n is the size of the data.
How to Use the assign() Method in Pandas Dataframe
The assign method in Pandas assigns new columns to a DataFrame. It returns a new object with all original columns in addition to new ones.
Code #1
Explanation
We start by importing the Pandas and the Numpy library. We are using Jupyter Notebook to write and execute code.
Code #2
Output:
Explanation
We create a Pandas Dataframe as above.
Code #3
Output:
Explanation
In the above code, a new column(percentage) is created based on the existing column(marks_out_of_25) and added to the dataset using the assign function.
Performing Conditional Selection on the Dataframe
Let us now see how to perform conditional selection on the dataframe.
Code #1
Explanation
We start by importing the pandas and the numpy library.
Code #2
Output:
Explanation We declare a variable arr and assign it a value. We convert the list arr into a Pandas Dataframe and assign it to variable df. We name the columns of df as Lemonade_Sales and Temperature_°C. Then df is printed on the screen.
Simple Conditional Solution
Code #3
Output:
Explanation This is a simple conditional statement. Every instance which satisfies the input condition is selected and displayed.
One-line Conditional Solution
Code #4
Output:
Explanation
The condition, as well as the columns to be displayed, are passed in a single line.
Code #5
Output:
Explanation
The condition, as well as the columns to be displayed, are passed in a single line.
Multiple Conditional Solutions
Explanation
Two conditions are passed, and the displayed result contains only those instances which satisfy both conditions.
Code #6
Output:
Explanation
Two conditions are passed, and the displayed result contains those instances which satisfy either of the two conditions.
Handling Missing Data
Just like everything, our data is also not perfect. For many instances, it may have missing values,i.e., null values for more than one feature. What should we do in such cases? There are two options.
- Fillna - The null values for any feature are replaced by the statistical measure of the non-null values in that feature by this method. We can take the statistical mean or median of the non-null values or any relevant value to fill the null values in a dataset. The following examples illustrate how to use the fillna method.
- Dropna - Any instance which contains null values for any feature is removed by this method.
Code #1
Code #2
Code #3
Output:
Explanation We convert the dictionary arr into a Pandas Dataframe df and print it.
Code #4
Output:
Explanation By default, the axis setting is 0. Therefore, the dropna function removes all the instances which has any feature value equal to false.
Code #5
Output:
Explanation When we set axis=1, the dropna function will remove all the features which have any value equal to false.
Code #6
Output:
Explanation The fillna function fills the null values with the statistical mean of the true values in the feature column.
Creating Multi-level Data Frames - Cross-Section
This section will show us how to create multi-level data frames in python using the Pandas library.
Code #1
Explanation
We start by importing the pandas and the numpy library.
Code #2
Explanation
We declare two dictionaries, first and second, and zip them as a list into the variable top_index. MultiIndex is an array of tuples where each tuple is unique. A MultiIndex is created from an array of tuples (using MultiIndex.from_tuples()).
Code #3
Output:
Explanation
The randn function generates random numbers and fills the top_index, converts it to the Pandas Dataframe, and assigns it to the variable df.
Code #4
Output:
Explanation
The loc function in pandas accesses a group of rows and columns by label(s) or a boolean array. It is used to access F1.
Code #5
Output:
Explanation
The loc function in pandas accesses a group of rows and columns by label(s) or a boolean array. It is used to access F2.
Code #6
Output:
Explanation
To return row 2 in F2, we write the code as above.
Code #7
Output:
Explanation
To return row 2, column 1 in F1, we write the code as above.
Code #8
Output:
Explanation
This is a Pandas index names object types. This means that the indices don’t have any names.
Code #9
Output:
Explanation
In the above code, we are naming the indices Groups and Num.
Code #10
Output:
Explanation
The xs() function is used to get a cross-section from the Series/DataFrame. The above code outputs the first index of each of the two groups.
How to Analyse Samples of Data Frames? (using df.groupby().iter() )
When we want to know what our data looks like, we dive down to see the dataset row by row or column by column. This is more difficult to do in Jupyter Notebook than in Excel. This problem is solved by creating a generator with .groupby() and adding the .__iter__().
Then we can use Ctrl+Enter to iterate over different samples from the dataset. We can run the cell as many times as we wish to observe and understand what the data looks like.
Code
Code
Explanation
In the first cell, we create a generator generator over the dataset df by using the .groupby() function and adding the .__iter__() function. In the second cell, we call the __next__() on the generator which gives us two variables - group_id and grouped_data. The grouped_data variable stores the dataset for each unique group_id.
Multi-chaining in Pandas
Multi-chaining is one of the widely used and important concepts in Pandas. Adding operations within the same line of code is called chaining. In this way, we won't have to create new datasets and variables every time you operate on your dataset.
Code #1
Code #2
Explanation
Both the above lines of code give the same result. They perform the same operations on the dataset in the same sequence. However, the second code cell is much more readable and doesn't require extra variables or create new datasets. The second code cell uses multichain to achieve this.
Conclusion
- Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool built on top of the Python programming language.
- Pandas provides tools for reading and writing data, intelligent data alignment and integrated handling of missing data, reshaping, slicing, fancy indexing, subsetting, merging and joining of data sets, etc.
- Adding operations within the same line of code is called multi-chaining.
- The null values for any feature are replaced by the statistical measure of the non-null values in that feature by the fillna method.
- Any instance which contains null values for any feature is removed by the dropna method.
- We can query our dataset by performing conditional selection on the dataset.