Pandas Cheat Sheet

Learn via video courses
Topics Covered

Overview

Pandas library is an open-source library that is built on top of another very useful Python library, i.e., the NumPy library. In the Pandas module, we work with a large set of data in tabular format (in row-column format). So, the Pandas module supports two types of data structures i.e., Series and DataFrame. The Pandas module provides various function that makes the Pandas module so easy and friendly to work with large sets of data. Some of the most widely used functions are:Series(), DataFrame(), mean(), median(), std(), max(), min(), info(), count(), iloc(), describe(), etc.

Pandas Cheat Sheet

Before going into the Pandas cheat sheet, let us first learn about the Pandas module.

Pandas library is an open-source (free to use) library that is built on top of another very useful Python library, i.e., NumPy library. Pandas library is widely used in the field of data science, machine learning, and data analytics as it simplifies data importing and data analysis. The prime reason for the Pandas package's popularity is its easy importing feature and easy data analyzing data feature. Pandas module is quite fast and comes in very handy because of its high performance and productivity.

Pandas Data Structure

As we know that Python supports a wide variety of data types and data structures like integers, floating point numbers, lists, strings, tuples, dictionaries, etc. In the Pandas module, we work with a large set of data in tabular format (in row-column format). So, the Pandas module supports two types of data structures i.e., Series and DataFrame. Let us briefly discuss them.

1. Series

A series is nothing but a column present in the Pandas DataFrame (which can be seen as a table). Technically, a Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). For example, let us see how we can create categorical data when creating a Series. If we want the series to a categorical data, we can specify the dtype (data type) as a category.

The following examples are provided for more clarity.

Output:

  • We can also convert an already created series into category data (Categoricals or category data are a Pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values) by assigning a category as astype in the astype() function.
  • We can also pass the pandas.Categorical object the Series() function to create a categorical series.

2. DataFrame

A DataFrame is a multidimensional table comprising a collection of columns known as a series. In simple terms, we can define a DataFrame as a collection of Series. It is a two-dimensional data structure. In Pandas DataFrames, the data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Example:

Output:

Importing Data in Pandas

For importing the data in the Pandas module, we generally use CSV or Excel files that have the data in row-column format. So, to import the data from an external file, we use the pandas.read_csv() function or the pandas.read_excel() function.

Exporting Data from Pandas

For exporting the data in the Pandas module, we use the DataFrame.to_excel() function.

Not only the CSV files, but we can also even import and export the SQL queries using functions like: pandas.read_sql() and DataFrame.to_sql() functions, respectively.

Viewing and Inspecting

To print the first n records (i.e., view or inspect the first n records), we use By default. The entire DataFrame gets printed.

For example:

We can even use functions like: DataFrame.info() or the DataFrame.describe() function.

Pandas DataFrame.info() function is used to get a concise summary of the dataframe. It comes in really handy when doing exploratory analysis of the data. To get a quick overview of the dataset, we use the DataFrame.info() function. Similarly, the Pandas describe() function is used to view some basic statistical details like percentile, mean, std, etc., of a data frame or a series of numeric values. When this method is applied to a series of strings, it returns a different output. The Pandas describe() returns a statistical summary of the data frame.

Selecting

To get one element from a series, we can simply use the list indexing technique as well as the iloc[] function. Let us learn about the iloc[] function. The iloc[] function in Python is one of the functions defined in the Pandas module that helps us to select a specific row or column from the data set. Using the iloc function in python, we can easily retrieve any particular value from a row or column using index values. Using the iloc[] function in python, we can easily retrieve any particular value from a row or column using index values. We can invoke the iloc[] function in python on the data set to retrieve rows and columns. The iloc function in python takes two optional parameters i.e. row value(s) and column value(s). The iloc function in python uses the parameters provided and returns the data set or series present at the specified position.

Example:

Output:

We can also use the loc attribute. The Pandas DataFrame.loc attribute accesses a group of rows and columns by label(s) or a boolean array in the given DataFrame. Since loc is an attribute so it does not take any input. It returns Scalar, Series, and DataFrame.

Adding / Dropping

To add a new value i.e. a new column, we can use the list indexing technique for example: DataFrame["new_column"] = value. For dropping or removing a column, we have the DataFrame.drop() function. This function takes a lot of parameters. The syntax of this function is:

Let us briefly discuss the parameters:

  • labels: It denotes the index or column labels to drop.
  • axis: It denotes whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
  • index: It denotes the alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
  • columns: It denotes the alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
  • level: It denotes the level for the MultiIndex, the level from which the labels will be removed.
  • inplace: If its values is set to True, then do operation inplace and return None.
  • errors: If it is set to ignore, then suppress the error and only existing labels are dropped.

Combining

For combining two or more DataFrames, we can use the concat() function. For example:

Similarly, if we want to merge the columns of a DataFrame, we can use the merge() function.

Filtering

Filtering means that if we want to extract a certain row with the matching index values, we can use the loc() function. For example:

We can even use the indexing technique that we use to print a segment of a list. For example: DataFrame.iloc[0:3] will print the first, second, and third rows.

Sorting

For sorting the DataFrames according to a certain value according to the column, we have a simple function like: DataFrame.sort_values(column-name).

Aggregating

Aggregate functions are the function that we generally use in the SQL world to perform some mathematical operations like count(), sum(), etc.

We must know that to perform the aggregation, we first need to group the data of certain columns using the function: groupby(column-name). Some of the aggregate functions that we use in the Pandas module are:

  • count(): counts the total occurrences of a certain value.
  • sum(): calculates the sum of the data in a column.
  • mad(): calculates the mean absolute deviation of the data in a column.
  • prod(): calculates the product of all the data in a column.
  • std(): calculates the standard deviation of the data in a column.
  • var(): calculates the variance of the data in a column.

Cleaning

In the Pandas dataFrames and Series, we often encounter various rows and columns having NaN and 0 values. These false and useless values need to be removed as they can cause a lot of computation, and we can get deviated from the actual result due to these faulty numbers. Now to remove these values, we can fill any values in the place of these faulty values. So, to fill a certain value, we can use the function fillna()

Create Test/Fake Data

We can use the NumPy module to generate fake data. We can use the rand() function of the Numpy library to generate some random fake data. For example:

Groupby

The group by aggregate function can be used to partition and group the entire data frame by some column. We can specify the column name in the parameter of the pandas.groupby() function for grouping the specified column data.

For example, if we have the data of customers of a shopping application, the data can have an entry of the same user multiple times as a single user can buy various items. So, in such scenarios, we can use the DataFrame.groupby() aggregate function to group all the products of the same customer. For grouping the customer, we need to pass the column name (here, customer_name is the column name of the DataFrame) as the parameter to the DataFrame.groupby() function.

Let us take an example to see how the transform() function works with the groupby() function.

Output:

In the above example, we are grouping the address and sales data and then transforming it using the transform() function. The transform() function is calculating the sum and then appends the calculated sym to the totalSales column.

Functions

The Pandas module provides various function that makes the Pandas module so easy and friendly to work with large sets of data. Some of the most widely used functions are:

  • DataFrame.mean()
  • DataFrame.median()
  • DataFrame.std()
  • DataFrame.max()
  • DataFrame.min()
  • DataFrame.count()
  • DataFrame.info()
  • DataFrame.iloc()
  • DataFrame.describe()

Plotting

We can even plot the Pandas data frames in the form of histograms and scatter plots. So, we can use functions like:

  • DataFrame.plot.hist(): To plot a histogram.
  • DataFrame.plot.scatter(): To plot a scatter plot.

Conclusion

  • A series is nothing but a column present in the Pandas DataFrame (which can be seen as a table). A DataFrame is a multidimensional table comprising a collection of columns known as a series.
  • To import and export the data from an external file, we use the Pandas.read_csv(), Pandas.read_excel(), DataFrame.to_excel(), DataFrame.to_sql() functions.
  • To get one element from a series, we can simply use the list indexing technique as well as the iloc() function. We can easily retrieve any particular value from a row or column using index values.
  • To add a new value i.e., a new column, we can use the list indexing technique. For dropping or removing a column, we have the DataFrame.drop() function.
  • For combining two or more DataFrames, we can use the concat() function. For filtering a certain row with the matching index values, we can use the loc() function.
  • For sorting the DataFrames according to a certain value according to the column, we have a simple function like: DataFrame.sort_values(column-name).
  • The group by aggregate function can be used to partition and group the entire data frame by some column. We can specify the column name in the parameter of the pandas.groupby() function.