mutate() Function in R

Learn via video courses
Topics Covered

Overview

The mutate() function in R is a fundamental tool in the dplyr package for data manipulation. It allows users to create new variables or modify existing ones within a data frame. By employing a combination of data transformation functions and arithmetic operations, mutate() generates columns based on specified computations or conditions. This function streamlines data augmentation tasks, enhancing data frames with calculated values while preserving the original structure. With its versatility and integration within the dplyr framework, mutate() facilitates efficient and expressive data manipulation pipelines in R, contributing to enhanced analysis and insights.

Syntax

Here's the syntax of the mutate() function in R:

Parameters

Here are the parameters of the mutate() function in R explained in detail:

  • .data:
    This parameter is the input data frame or tibble that you want to modify. It's the data on which you'll perform the mutations. It can be a data frame, tibble, or any other object that can be coerced into a data frame.
  • ...:
    The ellipsis (...) is used to pass a variable number of arguments. In the context of mutate(), it represents a series of expressions, each separated by commas. Each expression defines a new column or modifies an existing one. You can use existing columns in your expressions and apply various functions, operations, or transformations to generate new column values.
  • across_cols:
    This is an optional parameter that allows you to specify a subset of columns to which the mutations defined in "..." should be applied. It's used in conjunction with the across() function. When across_cols is not specified (the default), the mutations are applied to all columns in the data frame that are referred to in the expressions within "...".

Return Value

The mutate() function in R returns a modified version of the input data frame or tibble with new columns added or existing columns modified according to the specified transformations. It maintains the original row order and structure of the input data frame, and the modifications are performed based on the expressions provided within the function.

How to Use mutate() in R?

Using the mutate() function in R allows you to modify or add columns to a data frame in a flexible and organized manner. This function is part of the dplyr package, a widely used package for data manipulation. Here's a detailed guide on how to use mutate() effectively:

  1. Installing and Loading Required Packages:

    Before using mutate(), make sure you have the dplyr package installed. You can install it using install.packages("dplyr"). Then, load the package using the library(dplyr).

  2. Creating a Sample Data Frame:

    For demonstration purposes, let's create a sample data frame that we'll work with:

    Output:

  3. Adding a New Column:

    To add a new column to the data frame, you use the mutate() function. Let's say we want to calculate the annual bonus based on salary. We'll add a new column named "bonus" to the data frame:

    Output:

    In this example, a new column "bonus" is created by multiplying the "salary" column by 0.1. The resulting data frame includes the original columns "name", "age", "salary", and the newly added "bonus" column.

  4. Modifying Existing Columns:

    You can also modify existing columns using mutate(). Let's say we want to increase the age of each individual by 2 years:

    Output:

    This code modifies the "age" column by adding 2 to each value. The updated data frame now reflects the changes in the "age" column.

  5. Applying Functions:

    You can apply functions within mutate() to create or modify columns. For example, let's create a new column "category" based on age groups:

    Output:

    Here, the ifelse() function assigns "Young" to individuals with age less than 30 and "Adult" to those 30 and above.

  6. Using across() with mutate():

    The across() function enhances the power of mutate() by allowing you to apply transformations to multiple columns simultaneously. Let's double the "salary" and "bonus" columns:

    Output:

    This code doubles the values in both the "salary" and "bonus" columns using the across() function.

  7. Renaming New Columns:

    When creating new columns with mutate(), you can rename them using the .names argument within across(). For instance, let's add a column that represents age in months:

    Output:

    This creates a new column named "age_months" by multiplying the "age" column by 12.

  8. Chaining with Other Functions:

    One of the strengths of dplyr is its ability to chain functions. You can use the pipe operator %>% to chain multiple data manipulation steps together. For example:

    Output:

    This code achieves the same transformations as before but in a more concise and readable way. The mutate() function in R, part of the dplyr package, is a versatile tool for adding new columns and modifying existing ones in data frames.

Applying Mutate Across Multiple Columns

Applying the mutate() function across multiple columns in R using the dplyr package can be achieved using the across() function. The across() function enables you to simultaneously apply transformations to a selection of columns within a data frame. This is particularly useful for situations where you want to perform the same operation on multiple columns. Here's how you can use mutate() along with across() to achieve this:

Output:

In this example, the mutate() function is used along with across() to double the values of columns x, y, and z. Here's what's happening:

  • The %>% operator (pipe) is used to pass the data frame data into the first function, which is mutate().
  • Inside mutate(), across(c(x, y, z), ...) specifies that the following transformation should be applied across columns x, y, and z.
  • The ~ . \* 2 part represents the transformation to double the values in each selected column.
  • The .names = "double*{.col}" argument specifies that new column names should be generated by appending "double*" to the original column names.
  • The resulting result data frame will contain the original columns x, y, and z, along with the new columns double_x, double_y, and double_z containing the doubled values.

Conditional Mutations

Conditional mutations in R using the dplyr package allow you to modify data frame columns based on specified conditions. This is helpful when you want to transform or update values in response to specific criteria. The mutate() function, in combination with conditional statements, enables you to achieve these transformations effectively.

Here's how to perform conditional mutations in R:

  1. Sample Data Frame Setup:

    First, create a sample data frame to work with:

    Output:

  2. Basic Conditional Mutation:

    Let's say you want to add a column that categorizes individuals based on their scores as "Pass" or "Fail". You can use the ifelse() function within the mutate() function:

    Output:

    In this example, the result column is created. If the score is greater than or equal to 70, the corresponding value in the result column will be "Pass"; otherwise, it will be "Fail".

  3. Multiple Conditions:

    You can have more complex conditions by using logical operators like & (AND) and | (OR). For instance, let's create a new column indicating whether someone passed with distinction:

    Output:

  4. Combining Mutations with case_when():

    The case_when() function is a more versatile alternative to ifelse() when you have multiple conditions. Suppose you want to categorize individuals into different grade levels based on their scores:

    Output:

    In this example, the grade column is created based on different score ranges.

  5. Updating Existing Columns Conditionally:

    You can also modify existing columns based on conditions. For instance, let's give a bonus to individuals who scored above 80:

    Output:

    This code increases the score by 10 for individuals who scored above 80.

  6. Handling Missing Values:

    Conditional mutations can handle missing values too. You can use functions like is.na() to create conditions based on missing data:

Output:

Here, individuals with missing scores are categorized as "Missing".

Examples

Example - 1: Adding a Derived Column

In this example, we'll calculate a derived column representing the BMI (Body Mass Index) of individuals based on their weight and height.

Output:

Explanation:

  • We start by creating a data frame called data with columns for name, weight, and height.
  • Using the mutate() function with the pipe operator %>%, we add a new column named bmi to the data frame. This column is calculated by dividing the weight by the square of the height.
  • The resulting data frame, after applying the mutation, contains the original columns (name, weight, height) and the newly added bmi column representing the BMI of each individual.

Example - 2: Conditional Mutation with ifelse()

In this example, we'll add a column indicating whether a student passed or failed a course based on their exam scores.

Output:

Explanation:

  • We create a data frame named students containing columns for student name, math_score, and english_score.
  • Using the mutate() function along with ifelse(), we add a new column named pass_status. The condition in the ifelse() function checks if both the math_score and english_score are greater than or equal to 70. If true, the student is marked as "Pass"; otherwise, they are marked as "Fail".
  • The output data frame includes the original columns (name, math_score, english_score) and the newly added pass_status column, indicating whether each student passed or failed.

Example - 3: Scaling Values

In this example, we'll normalize the values in a column to be between 0 and 1 using the mutate() function.

Output:

Explanation:

  • We have a data frame named scores containing student names and their test_score.
  • By utilizing the mutate() function, we add a new column named normalized_score. This column is created by scaling the test_score values between 0 and 1. The formula used (test_score - min(test_score)) / (max(test_score) - min(test_score)) normalizes the values.
  • The output data frame includes the original columns (student, test_score) and the newly added normalized_score column with scaled values.

Example - 4: Categorizing Data

In this example, we'll categorize values in a column into different buckets using the mutate() function.

Output:

Explanation:

  • We define a data frame called people with name and age columns.
  • Using the mutate() function along with case_when(), we add a new column named age_group. This column categorizes individuals into age groups based on their ages. The case_when() function allows for multiple conditions.
  • The output data frame displays the original columns (name, age) and the newly added age_group column, indicating the respective age groups for each individual. Run the above code in your editor for a better and clearer explanation.

Conclusion

  • Data Transformation:
    The mutate() function in R, part of the dplyr package, is a powerful tool for data manipulation and transformation within data frames or tibbles.
  • Column Addition:
    It allows you to add new columns to a data frame based on calculations, operations, or conditions, without altering the original data.
  • Existing Columns:
    While adding new columns, existing columns remain untouched, preserving the integrity of the original data.
  • Conditional Transformations:
    You can perform conditional transformations using logical operations like ifelse() and case_when(), creating columns that adapt to specific criteria.
  • Multiple Columns:
    The across() function enhances mutate() by enabling simultaneous transformations across multiple columns, streamlining complex operations.
  • Efficient Pipelines:
    By combining mutate() with the pipe operator %>%, you can create concise and readable data manipulation pipelines, improving code organization and efficiency.
  • Flexibility:
    You can apply arithmetic operations, functions, or custom calculations to create columns that reflect new insights derived from the existing data.