Chapter 2 - Data Structures in R

Author

Government Analysis Function and ONS Data Science Campus

To switch between light and dark modes, use the toggle in the top left

1 Learning Objectives

  • Be familiar with data structures in R.
  • Understand how vectors operate.
  • Be familiar with lists.
  • Be familiar with data frames and tibbles.

2 Data Structures

We rarely work with single data values, we often work with a combination or collection of data.

R organises these in certain structures, and stores this data so that we can manipulate and work with it.

3 Vectors

In the previous chapter we looked at variables, where we said we could assign a name to a value.

Crucially, each of these values (and by extension the variable itself) has an associated datatype.

In the below example, we create a variable with the character data type.

# To assign a variable

my_friends <- "ian"

Suppose you wanted to store more than one value in “my_friends”, e.g “hannah”, “mike”, “almas”.

The following seem like possible solutions to this problem, but unfortunately are not the correct way to store multiple data points.

# This code will throw an error

my_friends <- "ian", "hannah", "mike", "almas"

my_friends <- ("ian", "hannah", "mike", "almas")

You will get the error below,

Error: unexpected ‘,’ in “my_friends <-”ian”,“

as the comma should not be used in this way.

This brings us to the c() function.

3.1 The c() Function

The “c()” function is used for creating a vector in R, which is it’s most simple data structure. The c stands for combine.

#To access the R help documentation

?c()

# or

help(c)

We can use this function to store multiple elements (data points) in a single variable, with each element separated by a comma.

# Creating a vector

my_friends <- c("ian", "hannah", "mike", "almas")

# To display the data

my_friends
[1] "ian"    "hannah" "mike"   "almas" 

Let’s check the datatype of our new vector.

# Check datatype of my_friends

typeof(my_friends)
[1] "character"

We see that this is a character vector, a.k.a, a vector containing character data. This is because each element contained within the vector are character type.

3.2 Vector Definition

A vector is a collection of values that are all of the same type (doubles, characters, etc.).

Since R insists all elements in a vector have to be of the same type we then can have types of vectors. These are:

  • Logical vector - contains only logical values.
  • Numerical vector - contains only numerical values.
  • Character vector - contains only character values.

There are more types of vectors, but for the purpose of our learning these are sufficient.

Visually:

Diagram of a vector as a column (collection) of the same red rectangle.

3.3 Creating Vectors

There are several different ways of creating a vector. You can create a vector by using:

  • The combine function c() that we saw previously.

  • The sequence function seq() to create any sequence of numbers.

  • The colon operator to create a vector of consecutive numbers.

3.3.1 The c() function (combine)

Let’s create a numeric vector with c() this time!

# Creating a vector using the combine function

sample_vector <- c(1, 2, 3, 4, 5)

# To display the created vector

sample_vector  
[1] 1 2 3 4 5

3.3.2 The seq() function (sequence)

A sequence of values has:

  • A starting point
  • An ending point
  • A “step”, a.k.a the value we jump by from number to number.

For example, if we start at 2, end at 10 and step by 2 we will have (2, 4, 6, 8, 10) as our final sequence.

The seq() function takes the arguments:

  • from - the start point
  • to - the end point
  • by - the step value
# Creating a vector using the sequence function,

sequence_vector <- seq(from = 2, to = 6, by = 2)

# To display the created vector

sequence_vector
[1] 2 4 6

Earlier we saw in the console the number [1] next to our output. This represents the position of the element to the right of this number in the vector.

Thus, 2 is element 1 of our vector, 4 is element 2 and so on. This is essential information for later.

3.3.3 The Colon Operator

A range of values (i.e., consecutive values) could be generated as a vector using a colon : in R.

Since this is consecutive, we can’t have broken sequences like seq() allows.

# Creating a vector using a colon

colon_vector <- 6:10

# To display the created vector

colon_vector
[1]  6  7  8  9 10

3.3.4 Exercise

  1. Create two vectors, one with numeric data and one with character data and name them appropriately. For example favourite_movies, lucky_numbers.

  2. Use the c() function to create a new vector with your vectors as inputs. This combines the smaller vectors into a larger vector, further showcasing their versatility.

  3. Find the type of your new vector. Why do you think this is the case?

# Creating a character vector

favourite_movies <- c("Frozen", "The Lion King", "Moana", "The Dark Knight") 

# To display the data

favourite_movies
[1] "Frozen"          "The Lion King"   "Moana"           "The Dark Knight"
# Creating a numeric vector 
  
lucky_numbers <- c(7, 12, 15, 1)  

# To display the data

lucky_numbers
[1]  7 12 15  1
# Creating a vector of vectors

movies_and_numbers <- c(favourite_movies, lucky_numbers)

movies_and_numbers
[1] "Frozen"          "The Lion King"   "Moana"           "The Dark Knight"
[5] "7"               "12"              "15"              "1"              
# Checking the type of vector

typeof(movies_and_numbers)
[1] "character"

We end up with a character vector as opposed to an error for incompatible types (numeric and character).

As vectors are supposed to be of the same data type, R will convert values so that they are all the same datatype. This is called coercion, which works on a complex heirarchy:

  • Any datatype can be converted to a character by wrapping it in quotation marks ” “.
  • Any integer can be converted to a double by adding in the decimal place, i.e. 1L –> 1.0
  • Any logical can be converted to an integer or double as they have numeric representation as 1 and 0 (or 1.0, 0.0) in the background.

However, a character such as “hello” cannot be converted to any other type.

As such, we have:

Logical < Integer < Double < Character and we see that character is the most all encompassing, with logicals being converted from but not to.

3.4 Vectorised Language

Vectors aren’t just containers for homogeneous data (data of the same type).

As R is a vectorised language, this means operations are applied to each element of the vector automatically, without the need to loop through the vector.

This is not a common behaviour among programming languages and is a key advantage of R’s nature as a vectorised language.

This also means that every variable we have created so far have also been vectors with one element, and the operators (such as adding 5 to them) have been vectorised operations.

Let’s see this in practice!

# Add sample and colon vector

vector_addition <- sample_vector + colon_vector


vector_addition
[1]  7  9 11 13 15

We see that each first element is added, each second element is added and so on, this is an element wise operation (element by element).

# Multiplying vectors (element wise)

vector_multiplication <- sample_vector * colon_vector


vector_multiplication
[1]  6 14 24 36 50

We see that each first element is multiplied, each second element is multiplied and so on.

# Multiply vector by a value

sample_vector <- sample_vector * 3


sample_vector
[1]  3  6  9 12 15

We see that each element is multiplied by 3, like expanding a bracket!

3.4.1 Exercise

  1. Try and add “sample_vector” and “sequence_vector”.

  2. Can you figure out what has happened?

# Adding two vectors of different lengths

vector_after_addition <- sample_vector + sequence_vector
Warning in sample_vector + sequence_vector: longer object length is not a
multiple of shorter object length
vector_after_addition  
[1]  5 10 15 14 19
  1. Here we have one vector with 3 elements and one with 5 elements. The number of elements in a vector is referred to as the length of the vector, so sample vector has length 5.

When we try to add vectors of different lengths R gives a warning message as the longer object length is not a multiple of the shorter object length.

As we can see R has recycled the elements in the shorter vector, wrapped them around to reach the length of the longer one, before adding them together.

When applying arithmetic to two vectors their lengths should either be equal, or the length of the longer one a multiple of the length of the shorter one, i.e. adding a vector of length 3 to a length 6.

3.5 Indexing a Vector

Often, we want to select specific elements or ranges of them to work with going forward.

Elements in a vector can be selected using square brackets [ ].

We are going to use indexes, which are the position of each element within the vector.

Note that R indexes start at 1 (not 0 like other languages).

Example - Indexing a single element

Take the favourite_movies vector from the earlier exercise.

# To display the vector

favourite_movies
[1] "Frozen"          "The Lion King"   "Moana"           "The Dark Knight"

If we want to select “The Lion King” from this vector, which is at position 2, we need to use the number 2 in square brackets.

# Selecting the second element in the vector

favourite_movies[2] 
[1] "The Lion King"

Reading this left to right we are:

  • From the favourite_movies vector
  • Select [ ]
  • The element at position 2

Example - Index Sequentially

We can also index multiple elements using the colon operator. Remember that this creates a sequence of consecutive numbers, so allows us to index sequentially.

For example, if we wanted to select from “The Lion King” to “The Dark Knight”, we select from index 2 to index 4, i.e. 2:4.

# Selecting sequentially

favourite_movies[2:4] 
[1] "The Lion King"   "Moana"           "The Dark Knight"

Example - Index out of sequence

We can even index out of sequence, such as obtaining the elements at indexes 1, 3 and 4.

This is interesting as most other programming languages require external packages to be able to do this, whereas R does it out of the box.

To do this, we need to create a vector of values inside the square brackets, using the c() function.

# Selecting first, third and fourth items

favourite_movies[c(1,3,4)] 
[1] "Frozen"          "Moana"           "The Dark Knight"

3.5.1 Exercise

Use the vector below containing days of the week (Monday - Sunday), for the exercises.

# Days of the week vector

days_of_the_week <- c("Monday",
                      "Tuesday",
                      "Wednesday",
                      "Thursday",
                      "Friday",
                      "Saturday",
                      "Sunday")
  1. Select Wednesday only.

  2. Select the week days.

  3. Select Tuesday and Thursday.

# Selecting Wednesday

days_of_the_week[3]
[1] "Wednesday"
# Selecting the week days

days_of_the_week[1:5]
[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"   
# Selecting Tuesday and Thursday

days_of_the_week[c(2,4)] 
[1] "Tuesday"  "Thursday"

Run the following lines of code one at a time and try to determine what is happening.

# Find out what the code below does

days_of_the_week[3] <- "Wensday"


# Find out what the code below does

days_of_the_week[c(-6,-7)]

# Find out what the code below does

days_of_the_week[8]
  • The first line of code modifies the element at index position 3 and overwrites it (notice the assignment operator) to something else.

  • Using the minus sign when selecting elements reverses the process, i.e. tells R not to select that value. So here we are saying select all except index positions 6 and 7.

  • Giving an index position outside of the range of vector elements returns NA (Not Available) as there is no 8th value in the vector.

4 Lists

Lists are similar data structure to vectors in that they are an ordered collection of elements.

They differ to vectors because their elements can be of any type, including vectors and lists themselves!

Unlike with vectors, where combining multiple vectors creates one vector, in lists, each element can also be a collection of elements, and no coercion or combination takes place in its creation.

So why do you need lists?

Lists enables you to gather a variety of objects with different contents and lengths under one name in an ordered way.

Visual of a list, with each element being a collection of boxes with different colours.

To create a list we will use the list() function.

# Creating a list

movies_numbers_friends <- list(favourite_movies, lucky_numbers, my_friends)

# To display the list

movies_numbers_friends
[[1]]
[1] "Frozen"          "The Lion King"   "Moana"           "The Dark Knight"

[[2]]
[1]  7 12 15  1

[[3]]
[1] "ian"    "hannah" "mike"   "almas" 

Each of the contents will appear on a new line:

  • Starting with double-square brackets [[1]], this denotes the first element of the list, since it is a container of containers there are 2 levels to select from.
  • Next is the singular square bracket [1] denoting the first element of the numeric vector inside the list.

This is interesting as indexing from a list is a little more complicated.

4.1 Indexing from a list

We can select at two levels from lists, and this differs by what we actually return as a result.

Previously when we indexed from vectors:

  • If we selected one value, we return the value itself as a single data point.
  • If we selected a few values, we return a vector of those chosen values, as it is still a collection after all.

With lists:

  • If we select with double square brackets, [[ ]], we return the collection at position 1 in this list, this could be a vector, another list, whatever the value is.
  • If we select with single square brackets, [ ], we return a smaller list just containing the element we asked for.

Let’s see this in practice

# Index to return the collection at position 1

movies_numbers_friends[[1]]
[1] "Frozen"          "The Lion King"   "Moana"           "The Dark Knight"
# Index to return a smaller/sub-list with element 1 only

movies_numbers_friends[1]
[[1]]
[1] "Frozen"          "The Lion King"   "Moana"           "The Dark Knight"

We can even double-select to select the collection with double brackets, and then a specific element or collection of elements from there.

# Access "The Dark Knight" from the list

movies_numbers_friends[[2]][4]
[1] 1

So we access the second element of the list, which is a character vector, then the 4th element of the vector.

5 Data Frames

Data frames are like the tables we are used to from Excel and other programs, with columns and rows that form a 2 dimensional object.

They are collections of vectors, where:

  • Each vector contains the same data type
  • Subsequent vectors contain different data types (i.e. column 1 could be numeric, column 2 could be character)

As a whole, DataFrames have the following features:

  • Columns are variables.
  • Rows are observations (i.e. an entry for each variable forms an observation/row).
  • They can hold variables of different types.

Visual of a dataframe where each column has a name.

To create one, we can use the data.frame() function on vectors you would like to be your columns, they must be the same length.

Using data.frame()

Within this function, the argument we assign the vectors to becomes the column name, providing us a handy shortcut to doing these tasks separately.

# Create a dataframe with favourite movies and numbers

about_me_df <- data.frame(numbers = lucky_numbers,
                          movies = favourite_movies)

about_me_df
  numbers          movies
1       7          Frozen
2      12   The Lion King
3      15           Moana
4       1 The Dark Knight

You see that this just prints the column names and content in the console. We would need to investigate the data further to obtain other information about it.

When it comes to working with data, these are outdated, and the tidyverse, which we will be introduced to in the next chapter, provide us an excellent upgrade to the data frame, known as the tibble.

We will see this in the next chapter.

6 Summary

We have covered a lot material in R and yet there is still so much more to cover in terms of functionality, as R has so much to offer.

By no means are you expected to remember all the above, what is better is that you understand the problems you want to solve and can then use the references or material provided (or you find yourself) to go about solving it.

Next up, we will introduces packages, importing and exporting data.