Tidyverse Package in R
Overview
In the R programming language, many packages are available to help us handle complex data and create visual representations of the data. Before using any package, we have to install and load the package in the R environment. For each data science project, we might be required to load multiple packages individually, which can be challenging for beginners. To address this issue, the tidyverse package makes it easy to install and load all the core packages with just a single line of code.
The tidyverse package includes a collection of specialized R packages targeted to simplify data tasks and helps beginners and experienced analysts work faster and better with data in R.
What is Tidyverse?
Before diving into the details of the tidyverse package, let us start by understanding what it is. The tidyverse package in R is a collection of several R packages developed by Hadley Wickham and his team. The primary goal of the tidyverse package is to make data analysis easier. It follows the "tidy data" principles where each variable has its column, each row represents a single observation, and each cell has only one piece of information to ensure clarity and quick access to specific information. This approach helps us to handle data easily, leading to more reproducible and collaborative research workflows.
Tidyverse Packages in R
Let's ensure tidyverse is installed before we begin using and exploring the set of packages it provides in the R environment. We can install the tidyverse package using the following command:
After the package is installed, we can load it into the current R session using the command shown below:
With the tidyverse package installed and loaded in the R environment, we can now easily access all the packages it provides.
Data Wrangling and Transformation
Data wrangling and transformation are essential components of data analysis as they involve the process of data cleaning, organizing, and transforming raw data into a structured format to make it suitable for analysis. Here, we will discuss the dplyr, tidyr, stringr, and forcats packages, each designed to handle different data manipulation challenges.
dplyr
dplyr is a widely used R package for data manipulation. It offers a set of functions that simplifies common data-wrangling tasks like filtering, arranging, selecting, mutating, and summarizing data frames. This package is popular among data analysts and researchers because of its simple syntax and ability to link functions using the pipe operator (%>%).
For exploring the top key functions provided by dplyr in R, we will use the built-in txhousing dataset.
Output:
Here we used the head() function to print the first few rows of a data frame, and by default, it prints the first six rows. It allows us to get a glimpse of the data and the columns present in the txhousing dataset.
-
filter():
This function extracts specific rows from a data frame based on certain conditions.For Example:
Output:
Here we filtered the txhousing dataset using the filter() function by selecting only rows where the "city" column is equal to "Dallas". Also, we used the pipe operator (%>%) to chain the operations together. Then, we stored the filtered data in a new data frame called filtered_data and then displayed the first few rows of the filtered data using the head() function.
-
arrange():
This function sorts the rows of a data frame based on specified variables.For Example:
Output:
Here we used the arrange() function to sort the data based on a specific column, in this case, "year". Also, we used the desc() function inside the arrange() function to sort the "year" column in descending order. Then, we stored the sorted data in a new data frame called arranged_data and then displayed the first few rows of the new data frame using the head() function.
-
select():
This function helps to pick specific columns from a data frame, excluding others.For Example:
Output:
Here, we used the select() function to select only two columns, "city" and "sales", from the original txhousing dataset. Then, we stored the selected data in a new data frame called selected_columns and displayed the first few rows of the new data frame using the head() function.
-
mutate():
This function helps us create or modify the existing columns in the data frame.For Example:
Output:
Here, we created a new column called "new" in the txhousing dataset using the mutate() function. This new column's values are calculated by dividing the "sales" column by the "listings" column. Then, we stored the mutated data in a new data frame called mutated_data and displayed the first few rows using the head() function.
-
summarise() and group_by():
The summarise() function aggregates data, summarizing it into a concise form, and the group_by() function groups the data based on specific variables, allowing for grouped operations.For Example:
Output:
Here, we grouped the data by the "city" column using the group_by() function. Then, we used the summarize() function to calculate the mean value of the "sales" column for each city. Finally, we displayed the first six rows of the new data frame called summarized_data using the head() function.
tidyr
tidyr is another essential R package that makes it easier to work with data by transforming and reshaping it. This package has different functions that let us change the way data is arranged, like switching between wide and long formats. It also allows us to split or combine data and handle missing data.
Before exploring the five important key functions in the tidyr package, let's create a simple data frame named "veggies". In this dataset, we record the different vegetables counted on two days.
Output:
With this "veggies" dataset, we will explore and use the five key functions in the tidyr package.
-
gather():
This function converts data from a wide format to a long format.For Example:
Output:
Here we used the gather() function to gather the names of the vegetables into a new column called "veggie", and the corresponding value of the vegetable into a new column called "count".
-
spread():
This function performs the opposite operation of the gather() function, i.e., it transforms data from long to wide format.For Example:
Output:
Here, the spread() function spreads the vegetables into separate columns, with the corresponding counts as their values.
-
separate():
This function splits a single column into multiple columns based on a separator. Next, we will add a new column, 'color', to the gather_func data frame.Output:
Next, we will split the color column into two separate columns by specifying the sep argument as "-" inside the separate() function.
Output:
-
unite():
This function concatenates multiple columns into a single column. Here, we will combine the color and price columns from the above example into a single column.For example:
Output:
-
fill():
This function can be used to fill missing values in a dataset with the previous non-missing value. Now, we will change the last two values of the count column with NA values in the gather_func data frame.Output:
Next, let us fill in the missing values using the fill() function, as shown in the code below:
Output:
stringr
stringr is widely used for the manipulation of character strings and text data. It offers a collection of functions that simplify tasks like pattern matching, substring extraction, and text replacement.
Let's create a sample text using the following code and then explore the five important key functions in the stringr package.
-
str_detect():
This function checks if a pattern is present in the text.For Example:
Output:
-
str_replace():
This function replaces the occurrences of a pattern with a replacement string.For Example:
Output:
-
str_sub():
This function extracts substrings based on character position in the text.For Example:
Output:
-
str_split():
This function is used to split the text into multiple substrings using a delimiter.For Example:
Output:
-
str_count():
This function counts the occurrences of a pattern in the text.For Example:
Output:
forcats
forcats is an essential R package specifically designed to work with categorical (factor) variables. This package offers various functions that help efficiently reorder, expand, and manage factor levels.
Now, let's explore the key functions provided by forcats for which we will create a dataset with random values
Output:
-
fct_count():
This function takes a factor as input and returns a data frame with two columns: the levels of the factor and their respective counts.For Example:
Output:
The fct_count() function provides the x entries with their respective counts in the above example.
-
fct_reorder():
This function reorders the levels of a factor based on the values of another variable.For Example:
Output:
In the above example, column x has been reordered based on the median value of y for each level (i.e., median for each factor a to e).
-
fct_infreq():
This function reorders the levels of a factor based on their frequency, with the most frequent level first.For Example:
Output:
Using the fct_infreq() function in the above code, we can see that the count column 'n' has been reordered based on the frequency of letters a to e in the data.
-
fct_lump():
This function collapses or groups the least/most frequent values of a factor into "other".For Example:
Output:
In the example provided, the least frequent levels of the x factor are combined into a single entity, i.e., "other", leaving only the two most frequent levels since we selected n=2.
Data Import and Management
In this section, we'll look at the data import and management packages tibble and readr. These packages play an important role in importing data into R effectively and ensuring that the data is well-structured and ready for analysis.
tibble
tibble is a data frame class, an improved and modernized version of R's standard data frame. Its main purpose is to make working with data simpler and more consistent. Tibbles has several advantages over regular data frames as it does not convert strings into factors by default. Also, while printing the data frame, it displays only the first few rows and columns preventing overwhelming outputs, etc.
Let's create a simple data frame using the data.frame() function as shown below:
Output:
Let's create the same data frame using the tibble() function.
Output:
We can even create a tibble from an existing object with the as_tibble() function, as shown in the code below:
Output:
readr
readr is a powerful R package that allows reading structured data quickly and efficiently from various file formats, including CSV, TSV, and fixed-width files. It aims to provide a reliable and consistent approach to importing data into R, automatically detecting column types, and dealing with missing values.
Let's discuss some key functions associated with readr:
- read_csv():
This function reads one of the most common data file formats, comma-separated values (CSV) files. - read_tsv():
This function is designed to read tab-separated values (TSV) files similar to CSV files, but tabs instead of commas separate the values. - read_csv2():
This function is specifically used for reading semicolon-separated values, where a comma represents the decimal mark. - read_delim():
This function allows us to read delimited files, where the delimiter can be customized. - read_fwf():
This function is used for reading fixed-width files where the data values are arranged in columns with a predefined width for each column.
Functional Programming
Functional Programming is a technique of writing code that focuses on using functions to perform tasks and avoids changing information as much as possible. In this section, we will discuss the next package of tidyverse - purrr.
purrr
The purrr package provides an efficient set of functions that help us to work with lists and vectors, making it easier to perform various operations and transformations on data. Here, we'll explore the top key functions of purrr.
Let's first define a vector of numbers as shown in the code below:
-
map():
This function allows us to apply a given function to each element of a list or a vector and returns a new list with the results.For Example:
Output:
Here we used the map() function to calculate the square of each element in the numbers vector.
-
map2():
This function can be used when working with two vectors simultaneously and performing operations on corresponding elements.For Example:
Output:
Here we used the map2() function to compute the sum of elements from two vectors, n1 and n2.
-
pmap():
This function can be used where we have more than two vectors and need to perform parallel operations.For Example:
Output:
Here we used the pmap() function to concatenate the text present in two different vectors.
-
reduce():
This function is used to cumulatively apply a binary function to elements of a list or vector.For Example:
Output:
Here we used the reduce() function to calculate the total sum of the numbers vector.
Data Visualization and Exploration
Data Visualization and Exploration involves creating graphical representations of data to understand patterns, trends, and insights better.
ggplot2
In R, ggplot2 is a widely used package for data visualization. It follows the grammar of the graphics approach with which users can create a wide variety of plots and easily customize them.
Let's create a simple bar chart using the following code:
Output:
Here, we used the geom_bar() function to create the bar chart using the ChickWeight dataset with the Diet variable on the x-axis. Also, we used the fill aesthetic to color the bars based on the diet categories.
Some More Tidyverse Packages
Several other useful packages are part of the Tidyverse ecosystem, which complement the core Tidyverse functionality and further enhance data manipulation, visualization, and analysis capabilities in R.
- lubridate:
This package simplifies working with dates and times in R. It provides easy functions to parse, manipulate, and format date-time objects. - readxl:
While readr focuses on general data import, readxl is specialized for reading Microsoft Excel files (.xls and .xlsx). It allows us to read data from Excel spreadsheets into R data frames efficiently. - haven:
This package makes it easier to import and export data from other statistical software formats, including SAS, SPSS, and Stata. - tidymodels:
It is a collection of packages that focuses on machine learning and modeling tasks in a tidy, consistent framework.
Conclusion
- The tidyverse package offers a collection of popular R packages to facilitate data manipulation, visualization, and analysis, adhering to tidy data principles.
- The data wrangling and transformation packages - dplyr and tidyr enable us to manipulate data easily by filtering, summarizing, and reshaping it.
- The string manipulation package - stringr offers functions for working with character data, whereas the forcats package helps us to handle categorical variables and factors efficiently.
- The data import and management packages - tibble and readr offer different functions for fast data import capabilities. Also, the purrr package provides us with functional programming functions for working with lists and vectors.
- ggplot2 is a powerful tool for creating elegant and customizable data visualizations.
- Additional Tidyverse packages, such as lubridate, readxl, haven, tidymodels, and more, are available to extend the tidyverse's package capabilities and enhance data analysis workflows.