Chapter 4 - Tibbles and dplyr

Author

Government Analysis Function and ONS Data Science Campus

To switch between light and dark modes, use the toggle in the top left

1 Learning Objectives

  • Understand the importance of clean variable names.
  • Be able to clean column names using the janitor package.
  • Understand the use of the pipe operator.
  • Be able to sort data with dplyr’s arrange verb.
  • Be able to select data with dplyr’s select verb.
  • Be able to filter data with dplyr’s filter verb.
  • Be able to transform data with dplyr’s mutate verb.
  • Be able to join datasets together.

2 Packages and Data

Remember, the first steps when starting a new script are:

  • Load in the packages required for the work.
  • Read in datasets required and assign them to a variable in memory.

2.1 Exercise

  1. Load the following packages:
  • Tidyverse
  • janitor
  1. Read in the titanic.csv file and assign it to the name “titanic_data”. Remember to assign null values properly (as in Chapter 3) using the “na” parameter.
  • Remember that you are in your R project, which sets the working directory to be inside the Course_content folder.
  1. Have a glimpse of your dataset.
# Load packages

library(tidyverse)
library(janitor)


# Read in titanic.csv and set null values to be specific symbols

titanic_data <- read_csv("Data/titanic.csv", 
                           na = c("*", ".", "", "NULL"))

# Have a peak

glimpse(titanic_data)

As a reminder, in the titanic dataset our columns are:

  • Pclass: Passenger’s class, 1 = 1st (Upper), 2 = 2nd(Middle), 3 = 3rd(Lower)
  • Survived: Survived (1) or died (0)
  • Name: Passenger’s name
  • Sex: Passenger’s sex
  • Age of Passenger: Passenger’s age
  • SibSp: Number of siblings/spouses aboard (excluding the person)
  • Parch: Number of parents/children aboard (excluding the person)
  • Ticket: Ticket number
  • Fare: Fare
  • Cabin: Cabin number
  • Embarked: Port of embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

We can see more details on the Data Dictionary

3 Column Names

In the previous session we stated that every column in a tibble is a variable and it is good practice to not have spaces within variable names, as spaces makes it harder for us to call on the variables when we need to use them.

When you enter data in Excel, you most often don’t think too much about what you call each column. After all, you just label them once and as long as they are documented, this isn’t given too much thought.

When you are working with variables in R though, you need to type the name of each variable, every time you want to work with it. So, it makes sense to make your column names as simple, but meaningful as possible.

3.1 Retuning columns by name

In base R, to call a column by name and return its contents as a single vector (remember, each column in a tibble is a vector) we use the dollar sign $ operator.

You will notice the list of column names will pop up and you can move through them with arrow keys and select the one you want.

3.1.1 Example

Let’s return the column “Pclass”.

# Return Pclass with base R

titanic_data$Pclass

This returns the entire vector (usually up to 1000 entries) so it would be useful to use glimpse() or other inspection functions for a sense check.

# Return Pclass and glimpse

glimpse(titanic_data$Pclass)

This could already prove frustrating due to needing to remember the capital letters, particularly if the autocomplete is slowing down.

However, it can get worse if spaces are included in the column name.

3.1.2 Example - Returning column with spaces

Let’s take the “name of Passenger” column and try to return it without the auto-complete.

titanic$name Of Passenger

This will throw an error as spaces in syntax are not allowed, R cannot process code in this way as a space should usually denote the end of a line of code.

To get around this we enclose column names with spaces in backticks ` ` and you will notice that the autocomplete does the same.

This allows the entire column name to be read as one entity.

# Selecting a column with spaces in the names

glimpse(titanic_data$`name Of Passenger`)

Whilst this works, it is bad practice to use capitalisation and spaces, as it complicates things for us as well as others we collaborate with.

3.1.3 A word of warning

Whilst this is completely fine in Base R, there are unexpected consequences of using this technique, particularly when making changes to a column using assignment.

  • With the tidyverse, when we use its myriad of functions, we can check manipulations on the data without overwriting the underlying variable unless we explicitly ask it to do so with <-.

  • However, with the $ method, we can overwrite a column for example and this will implicitly and permanently change the underlying variable, which we must be careful of.

Should you make a mistake following this choice, you would have to revert back to the original data and read it back in to start from scratch, as recreating data is a nightmare.

3.1.4 Accessing column names

We can see the column names by using the “names()” function to print a character vector of the column names.

# Getting the column names using the names function

names(titanic_data)

We will need to do some work on these to remove the use of capitalisation and spaces.

3.2 Cleaning Column Names

3.2.1 The Janitor Package

The janitor package offers many functions used to manipulate data, such as finding duplicates. In this chapter we will use it to clean column names.

The function to use is called “clean_names()” and automatically formats the column names as snake_case, but this can be altered with an additional parameter.

# Clean the column names and overwrite the variable

titanic_data <- clean_names(titanic_data)

# Getting the column names of the dataset

names(titanic_data)

4 The Data Manipulation Package dplyr

This is one of the most powerful packages in the tidyverse, which makes data manipulation simple and code easy to read.

We will look at how to perform the following actions:

  1. arrange/sort
  2. select
  3. filter
  4. mutate
  5. joining data

with the aim of the package to provide a function for each basic verb of data manipulation. This has led to them being referred to as such in the documentation.

Each of the verbs have the same structure:

verb(.data, info,….), note the full stop which is syntax to allow us to reference variables from the dataset (enables auto-completion too!)

and the cheat sheet is incredibly useful for a reference piece.

Before we jump into the verbs, let’s see dplyr’s version of renaming columns!

4.1 Rename

We may wish to remain within the tidyverse when cleaning column names (say, Janitor is not available to you), which invites the use of rename().

This allows you to change column names one at a time using the following syntax:

rename(.data, new_name = old_name)

As an example, let’s rename the age_of_passenger column to simply “age”.

# Rename the age_of_passenger column to age

rename(.data = titanic_data,
       age = age_of_passenger)

If we glimpse the data again:

# Take a peak at titanic data

glimpse(titanic_data)

We see that age has not carried through to the underlying dataset.

This is an excellent feature of the tidyverse, in that if we do not:

  • Directly overwrite the variable
  • Or create a new one using assignment <-

then it just shows us what that process will do, so we can then make the decision on whether it’s what we want.

4.1.1 Renaming multiple columns

One of the best things about dplyr’s functions is that the .data argument clues the function in to the columns themselves, which means we can just continually list them, or make changes to them, without wrapping them in a vector c(), like many other libraries.

As an example, let’s create a new variable and rename the “of_passenger” columns.

# Rename of_passenger columns and create a new variable

titanic_renamed <- rename(.data = titanic_data,
                       name = name_of_passenger,
                       sex = sex_of_passenger,
                       age = age_of_passenger) # Notice autocompletion

glimpse(titanic_renamed)

We also didn’t need to give them in the order specified either, the function automatically knows what columns we are referring to in the underlying dataset.

There is much more we can say about the structure of these functions, but we will see this as we proceed through the verbs themselves.

4.2 Arrange

Our data is displayed in the same order as the source data.

We may want to sort our data based on specific columns.

To do so, we use the verb arrange().

4.2.1 Example - Single column sort

From here, we will only create a new variable or overwrite the existing one when a change we wish to permanently keep is performed.

We highly encourage you to do the same, as creating multiple new variables you will never use again will clutter up your environment.

# Sort titanic by fare

arrange(.data = titanic_data,
        fare)

We get a larger output here that does not show us all columns at a glance, so checking that the outcome is what we expect is more difficult.

We should instead use glimpse() to check the output, which needs to be written first in the code.

# Sort by fare and then glimpse

glimpse(arrange(.data = titanic_data,
                fare))

Notice that by default, arrange() sorted the fare column in ascending order.

Small Aside - Functions

Notice that glimpse() is written first but executed last, since R evaluates functions from the inside out.

After all, if glimpse() has no input, we can’t see anything at all!

Keep this in mind as it will become very important later.

4.2.2 Example - Desending Order Sort

To sort a column in descending order, we use the desc() function and the column name as its input.

# Sort titanic in descending order 

glimpse(arrange(.data = titanic_data,
                desc(fare))) 

4.2.3 Example - Multi-Column Sort

We can also sort by multiple columns, but this creates a chain of dependence, in that the first column sort is maintained before the second sort is performed.

This means that the second sort won’t change the order from the first sort.

  • When this becomes useful is if there are numerous examples of the same value, such as the 0.0000 values in the fare column.
    • Whilst the position of 0.0000 will not change, the values in the adjacent column will be sorted. Some of those paying £0 fare may have been younger than others, after all.
# Multi-column sort

glimpse(arrange(.data = titanic_data,
                fare,
                age_of_passenger))

Notice that we were able to just continually reference columns from the data as if they were arguments to the function, and the verb understands that immediately, just like with rename().

4.2.4 Exercise

Sort the titanic data set by age in descending order, then fare in ascending order.

Glimpse the result.

# Sort by age desc, fare asc

glimpse(arrange(.data = titanic_data,
                desc(age_of_passenger),
                fare))

4.3 Select

Sometimes we will want to work with smaller tibbles that contain just a subset of available columns.

The select() verb is perfect for this, and it takes the arguments:

  • The first is our dataset, made even simpler with “.data”
  • From here we list as many columns by name as we would like to be retained in our selecting process.
    • The order in which we specify them is the order they will be in the smaller tibble.

4.3.1 Example - Single Column

# Selecting data

glimpse(select(.data = titanic_data, 
                name_of_passenger))

Notice that this has returned a tibble with one column, not the underlying vector itself that $ would have.

4.3.2 Example - Multiple Columns

If we want to select multiple columns, we can continue to list them, separating with commas, as we did with arrange().

# Selecting data

glimpse(select(.data = titanic_data,
                name_of_passenger, 
                age_of_passenger,
                pclass))

4.3.3 Example - Columns in a range

We can select consecutive columns (next to one another) with the syntax we used for indexing in chapter 2, the colon : operator.

# Selecting from passenger class to the age of passenger in order

glimpse(select(.data = titanic_data,
                pclass:age_of_passenger))

Notice that we return the columns from pclass to age_of_passenger in the order of the source data.

4.3.4 Selecting with exclusion

Up to this point we have selected with inclusion, a.k.a we specify the columns we want to include in our smaller tibble.

The real power of select comes from it’s flexibility, in that we can reduce the amount of code to write by using exclusion instead, a.k.a specify the columns we don’t want to include.

Say we want to keep 7 of 9 columns, instead of typing out the names of all 7, we can just exclude the 2 we don’t want instead.

Example

To exclude, we use the minus sign operator - which signifies “do not select this” or “select, not this”.

# Selecting by excluding columns we don't want

glimpse(select(.data = titanic_data, 
                               -name_of_passenger, 
                               -age_of_passenger, 
                               -pclass))

There is an even more streamlined way to do this, by wrapping the columns in a vector with the c() function and using one minus sign outside it.

You can think of this as expanding brackets in mathematics:

-c(col1, col2) = c(-col1, -col2) = -col1, -col2

# Using a vector for exclusion

glimpse(select(.data = titanic_data, 
                               -c(name_of_passenger, 
                                  age_of_passenger, 
                                  pclass)))

4.3.5 Selecting with Index Position

We can also select the columns using their index position, starting from 1, just like we did with data structures in chapter 2.

Let’s select columns 1 to 4, and also column 7.

# Selecting sequential and out of sequence columns with index position

glimpse(select(.data = titanic_data, 
                               1:4, 
                               7))

Exclusion works similarly here with the minus sign.

Note that whilst this may be required in some cases, it is usually better to be explicit with the exact column names, if possible.

4.3.6 Select Helper Functions

As if select() wasn’t already helpful enough, it even has helper functions that allow us to select on specific patterns, such as a prefix or suffix in a column name.

They are as follows:

  • starts_with(match): Starts with a prefix.
  • ends_with(match): Ends with a suffix.
  • contains(match): Contains a literal string.
  • matches(match): Matches a regular expression.
  • everything(): Selects all variables.
  • last_col(): Selects the last column.

These can be passed instead of column names and alongside selected column names as well.

Examples

Let’s return columns that start with an “s”.

# Selecting columns

glimpse(select(.data = titanic_data, 
               starts_with("s")))

As a second example, let’s return everything.

# Selecting everything

glimpse(select(.data = titanic_data,
               everything()))

4.3.7 Exercise

  1. Select the second, third and fourth columns from titanic_data, without typing all three.

  2. Select all columns except “fare”, “cabin” and “embarked” from titanic_data. Note that these are consecutive.

  3. Select just the last column from titanic_data using a helper function.

  4. Select the columns that end in the suffix “passenger” using a helper function.

# Select second, third and fourth column

glimpse(select(.data = titanic_data, 
                survived:sex_of_passenger))
# Using exclusion on a range

glimpse(select(.data = titanic_data, 
               -fare:-embarked))

# OR

# glimpse(select(.data = titanic_data,
#                -c(fare, cabin, embarked)))

# OR

# glimpse(select(.data = titanic_data,
#                -fare,
#                -cabin,
#                -embarked))
# Selecting last column only

glimpse(select(.data = titanic_data,
               last_col())) 
# Selecting on a suffix

glimpse(select(.data = titanic_data,
               ends_with("passenger"))) 

4.4 The Pipe Operator

Up until now, we have run verbs one at a time on our dataset, be it to sort or select columns.

The real power of the tidyverse comes from the ability to chain these functions together in a sequence for more complex data manipulation tasks.

However, out of the box, this becomes laborious quickly due to the nature of functions.

4.4.1 Returning to Functions - Composition

As we saw earlier with glimpse(), to apply many functions to a single output, known as composition of functions, we must nest them, with the function applied last being written first, and evaluated from the inside to the outside.

Example

For example, let’s take the sqrt() and round() mathematical functions and apply them together, so that we obtain the square root of a number and hence round it to a specified number of decimal places.

# Compose mathematical functions

round(sqrt(2))

Notice that the functions are evaluated in reverse order (from the inside, out) just as you would with writing this mathematically. You would perform the square root operation first, then round the result.

Should we wish to use an extra parameter, we need to start thinking about the brackets, as they are the opening and closing doors of a function itself.

Thus, the digits parameter needs to go inside the round() brackets, not the sqrt() ones.

# Compose mathematical functions with extra parameters

round(sqrt(2), digits = 2)

Notice that the input to the function round() was the output from the sqrt() function.

4.4.2 Composition of functions in the tidyverse

This becomes tricky to comprehend when working with the tidyverse functions, as we are effectively performing actions in sequence, such as selecting the columns we want and then sorting the result.

As such, the output of one function is the input to another:

# Select passenger columns and fare then arrange by fare

glimpse(arrange(.data = select(.data = titanic_data,
                                ends_with("passenger"),
                                fare),
                desc(fare)))

Notice that the input to arrange was the result of a select operation, so we had to keep track of our brackets and ensure that the desc(fare) was placed within arrange(), not select().

Going forward, we will see more verbs that can be applied on top of this, and it becomes difficult to manage.

This is where the pipe operator comes in.

4.4.3 Introducing the Pipe

The Pipe Operator makes it possible to chain a sequence of functions starting with the base data as an input to the sequence.

This removes the need to type the .data argument each time.

  • Prior to R 4.1, the pipe operator came from a package known as magrittr, which would be loaded alongside the tidyverse, and it took the form %>%.

  • From R 4.1 onwards, the operator comes as standard with base R, as it has become the universal standard for data analysis. It now takes the form |>, and this option must be turned on in the Tools –> Global Options tab.

The Code, Editing pane with the native pipe operator tick box.

The shortcut for this operator is CTRL + SHIFT + M and is one you will use alot from here on.

Example - Single Value

Let’s return to our square root and rounding functions. Let’s apply one, and then both in sequence using our new operator.

  • First, we write the value or dataset we wish to apply functions to, it is at the entrance to the “pipe” if you will.
  • Then, we write the function we wish to use and any additional parameters we may need.
  • The pipe operator passes the input to the function, so we do not need to write it within the brackets.
# Our first pipe - with a single value

2 |> sqrt()

You can read the pipe operator as AND THEN.

It takes the output of one function AND THEN uses that as the input of the next function, and so on.

This means that to chain a sequence of functions, we will use the pipe operator again to pipe our output to our next function as its input.

# Our first sequence of pipes

2 |> sqrt() |> 
     round(digits = 2)

This is quite the shift from what we have done so far, but is more readable in plain english:

To the value 2, apply the square root function, and then, round that output to 2 decimal places.

Notice, that following the second pipe, we insert a new line, which automatically moves us in a few spaces. This is optional but is easier to read as a sequence.

4.4.4 Using the Pipe Operator with the tidyverse

What was done above was an oversimplification of why the pipe was created, in reality it was created to pipe datasets into functions, as opposed to a single value.

Even in the case of datasets, we do not need to pipe if we are only applying a single function such as glimpse().

It is when we are performing a chain of steps that the pipe completely simplifies our code, enhacing readability.

Example

Let’s return to our previous example, where we selected passenger demographics and sorted by fare.

  • First, we write the dataset we want to apply this sequence of steps to.
  • AND THEN pipe that into the select() function, specifying what columns we wish to retain.
  • AND THEN pipe that output to the arrange() function, so that it can be sorted by whatever column(s) we wish.
# Repeat earlier operation with pipes

titanic_data |> 
    select(ends_with("passenger"),
           fare) |> 
    arrange(desc(fare))

This way, the code is written in the order that it executes, as opposed to the reverse order without the pipe operator.

To the titanic dataset, we select columns that end with “passenger” as well as the fare column, and then, sort that in descending order of fare paid.

We could also pipe the output to a glimpse() for easier sense checking.

# Pipeline with glimpse

titanic_data |> 
    select(ends_with("passenger"),
           fare) |> 
    arrange(desc(fare)) |> 
    glimpse()

Variables with tidyverse chains

Notice that we didn’t create a variable in the prior conversion.

This is because it can cause confusion between the assignment operator <- and the pipe operator |> if not covered separately.

In front of the previous sequence, known as a pipeline, we would write our variable name and assignment operator, which saves the final output in memory.

# Save pipeline as a variable 

titanic_demographics_sort <- titanic_data |> 
    select(ends_with("passenger"),
           fare) |> 
    arrange(desc(fare))


titanic_demographics_sort |> 
    glimpse()

Be careful not to confuse the meaning of the two operators, this will take some practice to get used to.

4.4.5 Exercise

Speaking of pratice, let’s try out our new tool!

Create a pipeline that:

  • Selects the first five columns of the data
  • Arranges them in ascending order of age.

Assign this output to a new variable and glimpse it.

# Selecting and ordering by age with pipes

titanic_age_sort <- titanic_data |> 
    select(pclass:age_of_passenger) |> 
    arrange(age_of_passenger)

titanic_age_sort |> 
    glimpse()

4.5 Filter

Often, we are only interested in groups of rows that adhere to a specific condition, such as:

  • Passengers that paid over or under a certain fare.
  • Passengers who are in a particular age threshold.
  • Passengers who embarked from a particular port.

and of course, combinations of the above.

The next verb, filter() allows us to subset our rows in this way. To understand this section, we first need to consider conditions.

4.5.1 Conditional Statements

A conditional statement is one that returns TRUE or FALSE dependent on the outcome.

We saw examples of these back in Chapter 1, when we were producing logicals with comparisons. For example:

  • 4 < 5 is a condition statement that evaluates to TRUE
  • 4 != 4 is a conditional statement that evaluates to FALSE
Logical Operator Description
< Less Than
<= Less Than or Equal To
> Greater Than
>= Greater Than or Equal To
== Equal To
!= Not Equal To
| Or
& And
! Not
any() Checks if any value in a logical vector are TRUE
all() Checks if all values in a logical vector are TRUE
is.na() Is the value missing (NA)?
between() Is between 2 numbers

Note: The “!” allows us to flip or invert an expression. Basically, if an expression returns c(TRUE, TRUE, FALSE), the inverted expression (place ! in front of it) will return c(FALSE, FALSE, TRUE).

These statements can of course be much more complex than comparing two single numbers, we can apply them across columns with dplyr’s filter() verb!

4.5.2 Single Conditional Filtering

We will first look at filtering by a single condition, which are constructed as follows:

  • We begin with the column to compare with.
  • Next is the logical operator of choice, such as <, >.
  • Last up is the value to compare each entry in the column to, which generates the set of TRUEs and FALSEs.

This is generated in the background, and filter() will keep only the rows that return TRUE from this comparison.

For example, to subset down to only second class passengers, we would write: “pclass == 2”.

Example - Categorical

# Filter to retain only second class passengers

second_class <- titanic_data |> 
    filter(pclass == 2) |> 
    glimpse()

A quick way to check that the filter has worked is to use base R’s unique() function on the column.

We need to use our $ here to reference columns by name as this is not a tidyverse method.

Let’s check the original data first.

# Return unique values from the original data

unique(titanic_data$pclass)

We see that the pclass column has 3 unique levels of 1, 2 and 3.

# Return unique values from the filtered data

unique(second_class$pclass)

We see that in the filtered data, we have just second class passengers, as expected.

Example - Numeric

Let’s filter to passengers who paid above £200.

# Select passengers who paid more than 200

titanic_data |> 
    filter(fare > 200) |> 
    glimpse()

Notice that we didn’t save this as a variable, as this was just an example, one we will not carry forward in our analysis.

However, if you are being asked to answer specific questions, then saving the outputs of your manipulation pipelines as variables is very useful, to prevent overwriting the base data.

4.5.3 Exercise

  1. Use filter to return the row for the passenger named: ‘Birkeland, Mr. Hans Martin Monsen’

  2. Filter for passengers that are male and save it as a variable. Can you count how many there were?

  3. Filter for passengers that are under 18 years of age and save it as a variable. Can you count how many there were?

  4. Extension - Harder: What percentage of passengers in the dataset survived? Remember, to compute a percentage, we must multiply the proportion by 100.

# Filtering to a specific passenger

titanic_data |> 
    filter(name_of_passenger == 
               'Birkeland, Mr. Hans Martin Monsen') |> 
    glimpse()
  1. To display the number of males in the filtered data, we simply need the number of rows, the length of the data!
# Males only

titanic_male <- titanic_data |> 
    filter(sex_of_passenger == 'male')

# Counting the number of males

nrow(titanic_male)
  1. To display the number of passengers below 18 years of age in the filtered data, we simply need the number of rows, the length of the data!
# Underage passengers 

titanic_underage <- titanic_data |> 
    filter(age_of_passenger < 18)


# Counting the number of underage passengers

nrow(titanic_underage)
  1. Firstly, we need to filter the dataset to those who survived, and the percentage would be calculated as:

Number of those who survived/Number of passengers as a whole * 100

Remember, we can compute the number who survived and the number of passengers, we just need the number of rows in each subsequent tibble.

# Passengers who survived

titanic_survived <- titanic_data |> 
    filter(survived == 1)

# Survival percentage

(nrow(titanic_survived) / nrow(titanic_data)) * 100

So we see that, tragically, only 38% of the passengers aboard the titanic survived.

4.5.4 Multiple Conditional Filtering

We have thus far filtered on conditions for a single column, but there is no reason we can’t use multiple conditions to filter by several conditions and/or columns at once.

For example:

  • We want male passengers who survived.
  • We want over 18 passengers who embarked from Southampton.

However, we do need to think about how the conditions relate to each other.

Relationships between conditions

There are two ways we can combine conditions together:

  • AND relationships are given by the & symbol.
    • This implies both/all conditions must be met for a row to evaluate to TRUE before the filter is applied.
  • OR relationships are given by the | symbol.
    • This implies that if any of the conditions can be met (one or more) a given row evaluates to TRUE before the filter is applied.

This does mean that combinations of TRUE and FALSE conditions can lead to different outputs with AND/OR relationships. This is summarised in the table below.

Condition 1 Condition 2 AND Equates to OR Equates to
True True True True
True False False True
False True False True
False False False False

4.5.5 Examples

For our first example, let’s filter to first class, female passengers.

This is an AND relationship, as they must be first class and female. Thus, any row that does not satisfy both of these conditions will be filtered out in the process.

Remember that AND combinations are very strict, so ensure that the loss of a possibly large amount of data is appropriate for your analysis.

# Filter to first class female passengers

titanic_data |> 
    filter(pclass == 1 & sex_of_passenger == "female") |> 
    glimpse()

For our next example, let’s select passengers who were male OR adults (over 18).

This means that males under 18 will be kept and adult females will also be kept, as only one of the conditions needs to be true to be retained by filter!

# Filter to males or adults

titanic_data |> 
    filter(sex_of_passenger == "male" | age_of_passenger > 18) |> 
    glimpse()

4.5.6 Special Cases of AND/OR

Whilst the above conventions are easy to read from left to right, when constructing much more complex chains of conditions, it becomes laborious quickly.

To alleviate this, dplyr has some special functions that allow us to streamline cases of numerous conditions on the same column.

The between() function

First is the between() function, which is similarly to an and relationship for conditions on numeric columns.

It allows us to specify:

  • An upper bound for the value in the column, a.k.a the value for which it should not be larger than or equal to.
  • A lower bound for the value in the column, a.k.a the value for which it should not be smaller than or equal to.

For example, between(3.5 and 7.5) will retain values greater than or equal to 3.5 and less than or equal to 7.5.

Let’s filter to those that paid between £250 and £500 for their tickets.

# Filter to those that paid between two values

titanic_data |> 
    filter(between(fare, left = 250, right = 500)) |> 
    glimpse()

Which is equivalent to saying:

filter(fare >= 250 & fare <= 500)

Of course, we can combine this with other conditions across columns as well.

# Filter tho those that paid between ?250 and ?500 who were female

titanic_data |> 
    filter(between(fare, left = 250, right = 500) & 
                   sex_of_passenger == "female") |> 
    glimpse()

The %in% function

Secondly is the %in% function, which checks for membership of the column value in a vector of options we provide.

This is similar to an OR relationship, as it allows us to bring through multiple values from a column.

Let’s filter to capture those that embarked from Southampton or Cherbourg.

# Filter to those that embarked from S or C

titanic_data |> 
    filter(embarked %in% c("S", "C")) |> 
    glimpse()

We can check with distinct() whether this accomplished what we were looking for.

# Filter to those that embarked from S or C and check

titanic_data |> 
    filter(embarked %in% c("S", "C")) |> 
    distinct(embarked)

This process would be equivalent to performing:

filter(embarked == “S” | embarked == “C”)

and is incredibly useful for larger categorical variables we wish to trim.

As with between(), we can also combine this with other conditions.

# Select passengers embarking from S or C that were male

titanic_data |> 
    filter(embarked  %in% c("S", "C") &
           sex_of_passenger == "male") |> 
    glimpse()

4.5.7 Negating Conditions

We can also use the negation operator ! to reverse the outcome of the condition.

This is useful for cases where exclusion is quicker than typing out conditions for inclusion.

Say for example you have a categorical variable with 9 unique categories.

You are interested in filtering so that only 8 of the categories remain. Instead of chaining OR conditions together, you can instead write the condition that would return just that one, and negate it to remove it instead.

Example

Let’s see this in action by negating the statement to select passengers who embarked from Southampton,

This is the opposite of writing out filters to retain Cherbourg and Queenstown. This is a great time saving measure for larger datasets.

# Use negation to filter down to passengers embarking from Cherbourg or Queenstown

titanic_data |> 
    filter(!embarked == "S") |> 
    distinct(embarked)

This is, of course equivalent to using the not equal != comparison in such a simple case:

# Filter for passengers who did not embark from Southampton

titanic_data |> 
    filter(embarked != "S") |> 
    distinct(embarked)

Example - AND/OR Negation

Of course, if you negate and and/or relationships:

  • Not only is each individual condition negated
  • The and/or is reversed to or/and respectively

This comes from logical statements in mathematics, specifically, De Morgan’s laws.

Negating a combined statement negates each individual statement, as well as the logical operator combining them.

Let’s take an example we we required those that embarked from Southampton or paid above £100 in fare.

Mathematically speaking, if we negate this condition, we get:

!(southampton OR > 100 fare) = (!southampton AND !>100 fare)

It is recommended to wrap your OR/AND condition in brackets, to ensure the negation is carried throughout.

# Harder example

titanic_data |> 
    filter(!(embarked == "S" | fare > 100)) |> 
    distinct(embarked)

We see that we have indeed negated the selection of Southampton, as expected.

4.5.8 Exercise

  1. Filter so that only second or third class passengers are included in the data.

  2. Filter so that only passengers who travelled alone are included in the data. How many were there?

  3. Extension Exercise - What percentage of passengers who embarked from Cherbourg or Queenstown and paid a fare between £100 and £350 survived?

# Filter for second or third class passengers

titanic_data |> 
    filter(pclass  %in% c(2, 3)) |> 
    glimpse()
  1. For those travelling alone, their sibsp and parch values should be 0.
# Filtering for passengers travelling alone

lone_passengers <- titanic_data |> 
    filter(sibsp == 0 & parch == 0) |> 
    glimpse()

# How many there are

nrow(lone_passengers)
  1. This is a multi-stage problem, so we must filter for the required conditions (saving it to a variable), then compute the percentage.

Since we need all of these conditions to be TRUE, we must use & to combine them.

# Filter for those that embarked from C or Q, paid certain fare and survived

cherb_queens_survivors <- titanic_data |> 
    filter(embarked %in% c("C", "Q") &
           between(fare, left = 100, right = 350) &
           survived == 1)

# Compute final percentage

round((nrow(cherb_queens_survivors) / nrow(titanic_data)) * 100, digits = 2)

4.6 Mutate

When cleaning and transforming data, we often want to apply changes at the column level, such as:

  • Converting to lower case/any character work
  • Rounding to a specified number of dp
  • Converting the datatype
  • Combine columns into ratios/proportions - known as calculated columns

To accomplish all of the above, we need our next verb, mutate(). To create a new column, we use the following syntax:

mutate(new_column_name = contents_of_column)

4.6.1 Constant Value Variables

The simplest example would be adding a constant column, which contains the same value all the way down.

Whilst the applications of this are minimal, it is the best first example for understanding mutate.

# Add a constant character column

titanic_data |> 
    mutate(character_col = "two") |> # Column filled with the string "two"
    glimpse() 

As another example, let’s add a numeric constant column.

# Add a constant numeric column

titanic_data |> 
    mutate(numeric_col = 3.14) |> # Column of pi
    glimpse()

4.6.2 Modifying existing variables

The mutate function is incredibly smart, as it allows us to apply functions to an entire column, with each cell transformed in turn.

For example, should we wish to make the name of passenger column lower case, we can use the str_to_lower() function from stringr.

To apply a function to an existing column:

mutate(existing_column = function(existing_column,..))

# Convert names column to lower case

titanic_data |> 
    mutate(name_of_passenger = str_to_lower(name_of_passenger)) |> 
    glimpse()

Another example might be to round the age column to the nearest whole number, which could allow for conversion to an integer.

# Round the age column to the nearest whole number

titanic_data |> 
    mutate(age_of_passenger = round(age_of_passenger)) |> 
    glimpse()

This has the unexpected consequence of rounding the passenger with an age of 0.1667 (around 2 months) to 0, which makes no sense given the context. As such, it may be better to round to 2 decimal places.

Let’s do the same thing to the fare column as well, since money in GBP (£) is presented as two decimal places.

# Round the fare column to 2 decimal places

titanic_data |> 
    mutate(fare = round(fare, digits = 2)) |> 
    glimpse()

Notice that we have forgotten something, as age still has 4 decimal places.

Exercise

Identify the mistake we have made when converting existing columns and fix this mistake so that we permanently update them.

We did not overwrite and reassign the titanic_data variable, we just observed the changes without permanently making them.

# Round the fare and age columns to 2 decimal places

titanic_data <- titanic_data |> 
    mutate(age_of_passenger = round(age_of_passenger, digits = 2),
           fare = round(fare, digits = 2))

glimpse(titanic_data)

Of course, this does not mean we should be overwriting titanic_data everytime, as often we are just overviewing a transformation and sense checking it.

4.6.3 Calculated Columns

Let’s see an example of creating a bespoke column by combining existing ones.

Here we will create the family size of each person. To compute this we need:

  • The number of siblings or spouses on board (sibsp)
  • The numbr of parents or children on board (parch)
  • The person themselves, as they aren’t factored in to sibsp or parch
# Determining family size per passenger

titanic_data <- titanic_data |> 
    mutate(family_size = sibsp + parch + 1) # Add 1 for the person themselves

glimpse(titanic_data)

4.6.4 Conditional Columns

We can also use conditions like when we filtered, to create a logical column of TRUEs and FALSEs.

These are very common and are often known as binary flags, as they denote whether a certain condition is attained or not.

This is structured as follows:

mutate(new_col = (condition)) where the condition is “column operator value”, such as “fare < 200”.

# Creating a conditional column on whether passenger is female

titanic_data <- titanic_data |> 
    mutate(is_female = (sex_of_passenger == "female"))

glimpse(titanic_data)

Usually we want binary flags in numeric form, as most models we supply our data to require numeric variables.

This means we need to convert TRUE and FALSE to their numeric representations of 0 and 1.

We can use complex functions for matching and converting to perform this step. In particular, the case_match() function, which takes the following form:

case_match(column, value_to_change ~ new_value, value_to_change ~ new_value)

# Recode is_female to numeric

titanic_data <- titanic_data |> 
    mutate(is_female = case_match(is_female,
                                  TRUE ~ 1, # Convert instances of TRUE to 1
                                  FALSE ~ 0)) # Convert instances of FALSE to 0

glimpse(titanic_data)

You may not have seen this notation before with the tilde ~, this is known as a formula in R, where the right hand side is equivalent to the left hand side.

This is common convention in R, seen most with linear modelling to define the equation. See this technical article on the tilde for more information.

4.6.5 Exercise

Create a new column called “fare_dollars”, which converts the fare from GBP (£) to USD ($). The current exchange rate as of January 2024 is:

£1 = $1.27

Ensure that the column is rounded to 2 decimal places.

# Create fare dollars and rounding the output

titanic_data <- titanic_data |> 
    mutate(fare_dollars = round(fare * 1.39, digits = 2))

glimpse(titanic_data)

4.7 Joining Data

Much of the information we need to answer questions of interest is featured across multiple smaller datasets, so we can join or merge them together for one cohesive dataset.

We do require something essential for this to be possible, however, a column in common such as a unique identifier or reference, such as:

  • NHS number, allowing for data linkage across the NHS.
  • Any account number allows for data linkage aross financial institutions.
  • Email, phone numbers, social media hangles allow for data linkage across social media and online shopping.
  • Addresses allow for spatial data linkage, for geospatial applications

Naming Conventions for Join

The dplyr package has a family of functions for joins.

They take the form type_join where the prefix “type” denotes the type of join itself.

A naming convention we must establish here is that of the tibbles themselves, namely:

  • The larger dataset, the one we just to join data to, is the left tibble.
  • The smaller dataset, the one we want to join to another, is the right tibble.

4.7.1 Types of Join

  • inner_join - A stricter join where only data common to both tibbles is retained.

  • full_join - All data from both tibbles is retained, matched up where possible.

  • left_join - All data from the left tibble is retained, and only matching rows are merged from the right tibble.

  • right_join - All data from the right tibble is retained, and only matching rows are merged from the left tibble. This is the inverse process of the left join.

Graphically:

Venn diagrams for each of the prior examples, with the included data shaded.

There are also Semi Joins and Anti Joins for filtering, which you can read about in Hadley Wickhams R for Data Science Chapter 19

4.7.2 Examples

Joins are best explored using examples of each type.

We will use 2 very small datasets that come with dplyr to build this up.

The first is band_members, which we must assign to a variable to save in the global environment.

members_data <- band_members

members_data

The second is band instruments, which seems to have some connection to the prior dataset.

instruments_data <- band_instruments

instruments_data

These data sets have one common column which is name, so we can join them to have both band and plays in the same dataset.

Let’s assume that members_data is the left dataset and instruments_data is the right.

Pre-empting what will happen based on the above venn diagrams:

  • John and Paul appear in both datasets, so will be retained for most common joins.
  • Mick only appears in the members_data, so may disappear with stricter joins.
  • Keith only appears in the instruments_data, so may disappear with stricter joins.

Join Syntax

The join functions take numerous arguments, with the following being required:

  • x - the left tibble
  • y - the right tibble
  • by - the column to join on

and take the form:

type_join(x = left_tibble, y = right_tibble, by = join_col) where we can pipe in the left_tibble.

Inner Join

Starting with the inner join, this will only retain rows common to both datasets, which would see Mick and Keith removed.

# Perform an inner join on band members and instruments data.

members_data |> 
    inner_join(y = instruments_data,
               by = "name")

As expected, we kept John and Paul, and now have another column of information about them!

Full Join

A full join retains all rows across both datasets, but this raises the question of what happens to the entry in the column that they aren’t featured? i.e.

  • Mick has no plays entry
  • Keith has no band entry

Simply, they are filled with NA, as they are missing.

# Perform a full join on instruments and band data

members_data |> 
    full_join(y = instruments_data,
              by = "name")

We see that, indeed, Mick has no entry for plays and Keith has no entry in band.

Left Join

A left join retains all rows in the left tibble, and just the matching data in the right. This means Mick will come along with an NA in the plays column, but Keith will be lost entirely.

# Perform left join on band and instruments data

members_data |> 
    left_join(y = instruments_data,
              by = "name")

Right Join

A right join is the opposite of a left join, so we will retain all values in the right tibble, including Keith, but only matching data in the left tibble, which means we will lose Mick.

# Right join on band and instruments data

members_data |> 
    right_join(y = instruments_data,
               by = "name")

4.7.3 Matching columns with different names

A very common problem that occurs when joining is matching columns being named differently.

This can sometimes be easily rectified with a rename(), but there will be cases where modifying the data in this way is not allowed.

This is where the join_by() function from dplyr comes in, providing us a way to specify that two differently named columns are indeed the same.

Example

Here we introduce the band_instruments2 data, which is similar to the former but with one key difference.

# Instruments data

instruments2_data <- band_instruments2

instruments2_data

The name variable is called “artist” here.

Within join_by(), we must set the column names of note equal to each other, using:

join_by(left_col_name == right_col_name)

Let’s join together band_instruments2 with the members data, using the join_by() function.

# Join columns with names that don't match

members_data |> 
    full_join(y = band_instruments2,
              by = join_by(name == artist))

4.7.4 Binding Datasets

Another common way in which datasets are combined is by binding them, where one dataset is appended on top of the other either row by row or column by column.

With row binding:

  • Columns are matched by name automatically, so the positioning of them does not matter.
  • Missing columns will be filled with NA, so we must be careful of this.

With column binding:

  • Rows are matched by position, so the tibbles we bind absolutely must have the same number of rows, otherwise we will get an error.

Binding Rows

We use the bind_rows() function from dplyr, which will widen the dataset by adding more entries.

# Row bind the instruments tibbles

instruments_data |> 
    bind_rows(instruments2_data)

Since instruments2 has the name variable and instruments has the artist variable, these are seen as independent, and hence filled with NAs since they don’t match.

Binding Columns

We use the bind_cols() function from dplyr, which will lengthen the dataset by adding more variables.

# Column bind instruments 1 and 2

instruments_data |> 
    bind_cols(instruments2_data)

Notice that despite plays being a repeating column, it is binded as we asked, and differentiated by its position, with the second column being “plays…2” and the fourth being “plays…4”.

Why we bind

When data is collected consistently, it is often in the same form, be that monthly, weekly, annually and so on.

To create a historic time series, data must be sequentially appended unless separate storage is required.

This is often part of an automated process where data is ingested and then appended onto the existing, larger, collected dataset.

4.7.5 Union()

Another very useful function is union() which will stack the data vertically, but keep only the distinct (unique) rows.

This means that any duplicated rows that exist in the current tibble are not brought over in the process.

However, this does NOT remove duplicated values, only entire duplicated rows in the combined data, an important distinction to make.

# Band members and instruments

instruments_data |> 
    union(instruments2_data)

You will see that this causes an error, because Union is very strict about it’s combination, in that all columns must match.

Let’s create a small tibble with matching columns to test this out.

# Create new entries for instruments data

artist <- c("Mick", "Bono", "Ringo", "John")

plays <- c("guitar", "guitar", "piano", "guitar")

instruments3_data <- tibble(artist = artist, plays = plays)

instruments3_data

Now let’s union this instruments2, who’s columns match exactly:

# Union the instruments datasets

instruments2_data |> 
    union(instruments3_data)

Notice that John, who would have been duplicated in the combination, was removed as intended by union().

5 Summary

We covered alot of content around dplyr here, but the possibilities are nearly endless with this incredible data manipulation library.

By no means are you meant to remember and recall the above. Instead, what is most important is that you understand the problem you are trying to solve, and can use the resources here (or your own) to then solve it.

Next up, we will look at aggregating data and obtaining summary statistics.