Viewing Data in Pandas
Overview
When working with large datasets, sometimes we only need specific data from the datasets, like particular columns, etc., or some specific data. For this purpose, pandas have lots of methods for Viewing data pandas with some specific information from the large dataset.
Introduction
When we need to work with some set of data from large datasets, Pandas have various methods to view the specific data, like using the head() function for the top view and tail() function for the bottom view of the data. There are many more functions like loc, iloc, between() etc.
Methods to View the Data in Pandas
We frequently work with huge data sets that contain several attributes when performing data analysis. Not all of the characteristics must be equally significant. We only want to use a specific set of the data frame's columns as a result. There are lots of methods for Viewing data on pandas. Let's look at how we can use the Dataframe to create views that allow us to pick the columns we want while excluding the rest.
Print the Data
Let's look at the dataset. The dataset is imported from this GitHub link
Code#1:
Output:
Explanation: Import pandas as pd and dataset using GitHub, .read_csv() is used to read the CSV file. Here .head() is used to display the first 20 rows of the data.
View Data Frame Rows and Columns with shape()
In python, we have the shape command to get the information of the data. It returns a tuple that contains the number of rows and columns present in the DataFrame. There is another property, ndim , which gives the number of dimensions present in the DataFrame, which is usually 2. Let's see, with the help of the below examples, how these properties work.
Code#2:
Output:
Explanation: In the above example, .shape is used to get the shape of the data. As shown in the above example, the dataset has 365 columns and two columns.
Code#3:
Output:
Explanation: ndim is used to get the dimensions of the data as the above data is 2-dimensional data.
Preview the Top and Bottom five rows
Once the data is loaded in python, you want to check the data that desired data or rows and columns are present in the dataset or not. To confirm, these pandas have head() and tail() functions which help to see the data from the start and the end. Let's see the working of these functions with the help of examples.
head() To get the starting rows of the data, the head() function is used. By default, it takes the top 5 rows. Let's see how this function works with the help of the below examples.
Code#4:
Output:
Explanation: In the above example, .head() function is used to get the data from the start. It, by default, takes the top 5 rows, if not passed anything inside the function.
Code#5:
Output:
Explanation: In the above example, as the head is used to get the data from the start, .head(8) give the eight rows from the top of the data.
tail()
Code#6:
Output:
Explanation: The .tail() function is used to get the data from the bottom. By default, it takes the five rows from the bottom if nothing is mentioned inside the function.
Code#7:
Output:
Explanation: In the above example, .tail(8) gives eight rows from the bottom of the data.
View Data Types of Columns
Most of the data frames are of mixed data types of numbers and strings. Some are dates, etc. all of the data in DataFrame is just characters; there is no information about the different data types that are present in each column. While importing the data, Pandas infers the data types if columns contain numbers only, then pandas sets the data type of the column's data as an integer of float, and if it contains strings, then the data type of the column's data is an object, etc. types is used for checking the data type of the columns.
Strings are loaded as ‘object’ datatypes.
Code#8:
Output:
Explanation: In the above code example, pandas are imported as pd. The .columns is used to get the column names as shown in the output, and the dtype object means string datatype.
Code#9:
Output:
Explanation: The dtypes is used to get the data types of each column of the data. As shown in the above example, the Date column is of string data type (object is used for the string data type), and the Births column has int64 data type.
To change the data type of the specific column, the astype() function is used.
Code#10:
Output:
Explanation: The .astype is used to change the data type of the data. In the above example, Births of the integer and datatype are changed to string datatype.
Describe the Data with describe()
To get the numeric information of the data like count, mean, standard deviation, minimum, maximum, and percentile ranges, the describe() function is used.
Code#11:
Output:
Explanation: In the above code example, .describe() function is used to get the numeric information of the data, like the above-mentioned total count of 365 births, 41 is the men of the data, 25% of the data contains 37 births, etc.
Viewing the Counts using value_counts()
Using the value_counts() function, we can get to know how many are there in a specific category. In the example given below, there is a total count of 35 births on the date 1959-01-01.
Syntax:
Parameters:
- normalize: It is of bool type and, by default False. If True, the object that is returned will include the unique values relative frequencies.
- sort: It is of bool type and by default True, Sort by frequencies.
- ascending: It is of bool type and, by default False. It sorts the data in ascending order.
- bins: It is the optional parameter of the int type. Values should be grouped into half-open bins rather than being counted. This is a convenient feature for pd.cut that only functions with numeric data.
- dropna: It is of bool type and, by default True. It doesn’t include the counts of NaN.
Code#12:
Output:
Explanation: In the above code example, value_counts() calculates the total times of values appearing in the data.
Code#13:
Output:
Explanation: In the above code example, value_counts() counts the number of times the particular value appears in ascending order.
Code#14:
Output:
Explanation: If normalize is True, then the item that was returned will include the relative frequencies of the unique values.
Using .loc and .iloc
The loc() function is a label-based method which means we have to pass the name of the specific column or row which we want to select inside the function. The last element of the passing range is included, and it can also accept boolean data. While the iloc() function is an index-based method means data can be extracted using index labels, and the last element is excluded in this method.
Code#15:
Output:
Explanation: In the code above example, the values Births columns which are greater or equal to 55 are filtered using the .loc() function.
By using the & symbol, we can filter the DataFrame by passing two different conditions. Let's see with the help of the below examples:
Code#16:
Output:
Explanation: In the above example, there are two different conditions applied using the & parameter. i.e. Values of the Births that are greater than equal to 55 and less than 70 are filtered using the .loc() function.
Code#17:
Output:
Explanation: Data from rows 100 to 120 are extracted from the dataset. The last value is also included in the loc function.
Code#18:
Output:
Explanation: In the above code example, rows are sliced from 100 to 120. The last 120 indexed are not included.
Code#19:
Output:
Explanation: In this example, data is selected as rows from 150 to 170 and columns from 0 to 1.
Using .between()
By using the .between() function, data can be filtered in a specific range.
Code#20:
Output:
Explanation: In this example, data between values 30 and 50 is selected from the dataset.
Code#21:
Output:
Explanation: In the above code example, data that is not between 30 and 70 are selected, ~ is for negation here.
Viewing Statistical Details in Pandas
There are various methods to get statistical details of the data.
Code#22:
Output:
Explanation: In the above example, the.describe() function gives the statistical details of the data like mean, count, standard deviation, etc.
Code#23:
Output:
Explanation: Here, all the statistical details of the data are changed to integer data type using the astype() function.
Code#24:
Output:
Explanation: To consider all the values including empty values of the data, include='all' have to be mentioned inside the .describe() function.
Code#25:
Output:
Explanation: .counts() is used to calculate the total values in the data.
Code#26:
Output:
Explanation: To calculate the mean of the data .mean() is used.
Code#27:
Output:
Explanation: The .meadian() is used to get the median of the Births column of the data.
Code#28:
Output:
Explanation: To calculate the standard deviation of the Births column of the data .std() function is used.
Code:
Output:
Explanation: The .max() function is used to get the maximum value among the Births column of the dataset.
Ready to Dive Deeper? Explore the Practical Applications of These Concepts in Our Data Science Course and Turn Knowledge into Expertise.
Conclusion
- There are various methods for Viewing data on pandas.
- The head() function is used for Viewing data in pandas from starting rows and tail() is used to view the data from the bottom. By default, it views the five rows.
- Using describe(), we can get the description of the data, like mean, standard deviation, percentage of the same data, etc.
- .loc and iloc is used for getting the specific data from the dataset.