Chapter 3 - Importing and Exporting Data

Author

Government Analysis Function and ONS Data Science Campus

To switch between light and dark modes, use the toggle in the top left

1 Learning Objectives

Understand the importance of argument order in functions.
Have an understanding of what packages are.
Be able to load and install a package.
Be able to check package versions and R version.
Be able to import data from multiple formats.
Be able to inspect loaded data.
Be able to export data.
Be able to explore data.

2 Returning to Functions

So far we have seen many functions, such as:

sqrt()
round()
c()
list()

You should make it a habit to explore the help files when you are using a function for the first time so you know:

What required arguments there are.
What optional arguments there are.
What default arguments there are (some arguments have a value by default so we can exclude them without error).

Recall that they follow the form:

functionName(argument1 = value1, argument2 = value2, and so on)

2.1 How functions work

The seq() function from chapter 2 is the perfect example to reinforce how functions work, as well as common pitfalls.

Let’s take a closer look at the help file for seq().

# Help doc for seq()

?seq

# or

help(seq)

Seq() function help file.

2.1.1 Function help files

Every help file will have a series of sections describing what the function does. It is worth focusing on the description, the usage and especially the arguments first.

Description

For example, in the help file for seq() under Description, it tells us it is a function to “Generate regular sequences”.

Usage

We can see that seq() takes the required arguments:

from (which is 1 by default)
to (which is 1 by default)
by (which is calculated by default)

and the optional arguments:

length.out
along.with

Arguments

Here, we can find out what these arguments are:

from, to: the starting and maximal end values of the sequence.
by number: increment of the sequence.

2.1.2 Execution of function arguments

We used this function when creating vectors, here’s a reminder.

# Creating a sequence of numbers

seq(from = 2, to = 6, by = 2)

Let’s consider what happens instead when we don’t specify the arguments, just their values.

Example

# Creating a sequence of numbers - not specifying arguments

seq(1,10)

This has generated a sequence of numbers from 1 to 10. In this case we did not supply a value for by, so it took the default value, which in this case is 1.

What if we flip 10 and 1 instead?

# Creating a sequence of numbers

seq(10,1)

So it is clear that where we place the value is important, because arguments are resolved by position, in the order specified in the help documnentation.

So above:

In the first example, it is assumed that we want a sequence from 1 that goes to 10.
Then if we swap the numbers it is assumed we want to sequence from 10 that goes to 1.

However, if we name the arguments explicitly using argument = value, the order we specify them does not matter.

Let’s see this in action:

# Reversing arguments but using argument name

seq(to = 10, from = 1)

You can sometimes run into errors with more complicated functions by taking the arguments for granted.

Often there are many optional arguments that are resolved (position wise) before the required ones.

We would encourage you to specify the arguments and parameters as it makes your code easier to understand.

3 Packages

Our default R installation, often referred to as base R gives us a lot of functionality out of the box.

If we want to implore the newest methodologies, adopted by the wider R community, then we need to install packages to achieve this.

Packages are a collection of functions, compiled code and sometimes data sets which are stored in a library of code within R.

In order to use a package, we first need to install it:

R installs packages from CRAN The Comprehensive R Archive Network. that contains over 20,000 packages.
You can install packages from outside of CRAN (such as from GitHub) with specific functions. Please be aware of the source and quality in these cases.

Your department may have a slightly different way of installing packages, so clearing this up is a good port of call.

3.1 Installing Packages

To install a package, we use the code below for each new package.

install.packages(“package_name”, dependencies = TRUE) where dependencies allows the install to also take into account other packages your chosen one needs to function correctly.

Dependent on your Operating System, another parameter is also recommended:

For windows, use type = win.binary.
For mac, use type = mac.binary (this may differ for Apple Silicon).

Important: You will only need to install packages once, you should either do this in the console, or comment out the line in your script where this is done.

Install the packages below using the R console one at a time.

tidyverse
janitor

Note that tidyverse is a collection of R packages that follow the same programming paradigm, so will take quite some time to install.

# Installing packages

install.packages("tidyverse", dependencies = TRUE, type = "win.binary")

install.packages("janitor", dependencies = TRUE, type = "win.binary")

Successful installation should finish with messages as below, sometimes interspersed with some warnings.

## package 'janitor' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\bandai1\AppData\Local\Temp\Rtmpm0ZY69\downloaded_packages

3.2 Loading Packages

Think of packages as owning a book; you purchase (install) the book once, and after that when you need to reference it you can pick it up off your bookshelf.

In R we call this loading, and you should always load packages at the very top of your script.

To load a package, use the code:

library(package_name) where the package name is not a string this time.

3.2.1 Exercise

Exercise
Show Answer

Load the packages you installed in the prior exercise:

tidyverse
janitor

# loading packages

library(tidyverse)

library(janitor)

There is almost always some output we get when loading packages. Some options are:

package “X” was built under R Version “Y” - this states that your R version may be lower than the one the package was written using. This is not always an issue, but you should endeavour to remain updated with your software.
The following objects are masked from “package::package_name”: - this arises when functions from your newly loaded package have identical names to either a function in base R, or from another outside package.
- As such, the package you loaded takes precedence, and it’s function under that name will be used.
- You can get around this by using the syntax package_name::function_name as R will attempt to autofill the functions from that package, and there is no way to misconstrue what package the function comes from.

3.3 Checking Versions

We can check R version by running this command.

# To check the version of R

version

You can see that we are running 4.1.3

If your version is a little older, this is fine, but we thoroughly recommend versions 4.1 and above and discourage versions beginning with a 3.

Older versions, such as 3.6.3 (which is popular) are no longer supported by the creators, and will conflict with almost all training you will engage with whilst learning.

Use the packageVersion() function with the package supplied as a string.

# Checking the package version

packageVersion("tidyverse")

You see that we have the up to date 2.0.0, this is backwards compatible with some older versions, but beware of argument name changes to functions.

If you are working collaboratively you should always check that you are all using the same versions of packages.

One of the masks at play with tidyverse is the filter() functions:

Base R has a filter() function to apply on time series.
Whereas dplyr (a tidyverse package for data manipulation) has a filter() function to select rows based on columns.

These both take very different parameters, so it is important to know what we have masked. After loading the tidyverse, it will assume we want to use the dplyr version of filter going forward.

To use the alternative we would need to type

stats::filter()

4 Tidyverse

Here we will introduce the tidyverse, a collection of R packages that changed the way many work with R forever.

The packages in tidyverse share a common philosophy for data manipulation and exploration so they work well together.

This philosophy is that of Tidy Data, described first in the seminal paper by Hadley Wickham, the tidyverse’s creator.

4.1 Advantages of the Tidyverse

The tidyverse is:

Well documented. Each sub-library has its own website containing a ‘cheat-sheet’ and vignettes. We thoroughly recommend bookmarking these.
Well established in the R data science community, meaning common issues and queries are already answered on platforms such as Stack Overflow.
Designed such that all sub packages follow a core ‘philosophy’ which encourages best practice.
Open-source software and free to use. As are the books written by the tidyverse creator Hadley Wickham. The highest recommendation we can make is R for Data Science.

4.2 Disadvantages of the Tidyverse

Like R, tidyverse can have a steep learning curve, and its reliance on functional programming can confuse beginners.
It is incredibly flexible, which makes it hard to determine which solutions to problems are the best.
Quite verbose (wordy), which can lead to long scripts.

4.3 Tidyverse Breakdown

Below is a list of the core packages in tidyverse to provide some awareness into what they make possible:

readr - Data import
tibble - Tibbles, a modern re-imagining of data frames
tidyr - Data Tidying
dplyr - General data anipulation
stringr - String anipulation
forcats - Factor variables
ggplot2 - Data Visualisation
purrr - Functional Programming
lubridate - For dealing with dates and times - included in tidyverse 2.0.0 onwards.

Tidyverse workflow of import, tidy, transform/model/visualise and communicate.

The first of the core packages we will delve into is readr, which deals with reading in data, and by extension tibbles, the excellent update to dataframes that the tidyverse provides.

However, we need an understanding of the working directory beforehand.

5 Working Directory

R has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.

We often refer to this as the “starting point” when R looks for a file you specified the path for.

Thankfully, we are using an R project, which makes filepaths and directories reproducible, by ensuring everyone who opens the project has this set by default.

If you are not using a project (we recommend you do) you will need to set your own working directory with the setwd() function that requires a full path to the directory to change it manually.

5.1 Checking Working Directory

The getwd() function (get working directory) is ideal.

# Getting the working directory

getwd()

If you are inside the project created within these materials, you should have the same final step in the “path”, that of the “Course_content” folder.

In Windows file paths are specified using back slashes, but in R a backslash already has a meaning, so we use a forward slash or two back slashes instead.

6 Reading in Data

There are a variety of ways of reading data into R, in this chapter we will look at reading data using the packages:

readr - loaded with tidyverse
readxl - installed with tidyverse, but loaded separately.

6.1 Readr

The package provides a fast and friendly way to read data from:

Comma Separated Value (csv) files
Tab Separated Value (tsv) files

converting them to tibbles, which are the required data structure in the tidyverse.

Let’s formally introduce them now.

6.1.1 Tibbles

Tibbles are data frames, but they tweak some older behaviours to make life a little easier, R is over 20 years old after all:

Tibbles complain more when variables do not exist - leads to easier error checking.
Tibbles don’t change variable names.
Tibbles don’t tweak variable types from the source data.

The key benefits of tibbles are:

In addition to its name, each column reports its type.
The dimensions of the tibble are shown at the top.
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data.

6.1.2 How readr works

This package reads in datasets we supply by using a family of functions, ones that have the same prefix.

read_filetype()

The most common, and one we will use throughout the course is read_csv().

6.1.3 Considerations to make

Before importing your data you need to know:

Where it is stored?
What kind of file is it?
Are there missing values in the data?
- Missing values in R are denoted by NA.

The code will take the following form:

data_name <- read_csv(file_path)

and readr will:

Assume the first row of your data is the headings of the columns.
Attempt to guess the datatype of columns, given their content. If a numeric column contains 99 doubles and one character, then the same coercion that happened with vectors will happen again (since columns are vectors) and we get a character column.
- One of the first data checks you should do is that the types of the columns match what you expect.

6.1.4 Example - Our first filepaths

Let’s load in the titanic dataset in the “Data” folder.

We need to figure out where this is and how to get there from our current working directory, so that we can tell R.

An absolute or full filepath is constructed as:

“starting_drive/step_1/step_2/step_3/…./destination”

this details the full path taken to reach the file.

To reach “titanic.csv” the absolute filepath for us is

“C:/Users/marshj1/af_introduction_to_r/data/titanic.csv”

Note that your usernames and drives will differ to ours.

A relative filepath is the path to reach the file relative to the current working directory.

Thus we are already part of the way there, and just need to tell R where to go from here:

“working_directory/step_1/step_2/…/destination”

However, in our case, our working directory is one level deeper than we’d like to be, in the course_content folder.

To reach the dataset, we must exit this folder to get back to the root, then enter the data folder, and select titanic.csv

To do so, we need to know how to go back one folder level, or exit the current directory, this is written as ../ where the two full stops denote going back.

As such, the relative filepath we need to reach the dataset is

“Data/titanic.csv”

Visually, to understand the tree-like folder structure, we have something like the following going on:

Top level is introduction to R, folders are at level 2, items in the folders are level 3.

Let’s read in titanic using a relative filepath.

We simply need to go into the data folder, then select the titanic.csv file to load in.

# Read in titanic with read_csv()

titanic_data <- read_csv("Data/titanic.csv")

You will get some information on:

rows and columns
Counts of each column datatype

Let’s call the name of the variable we have created to see the output of our first tibble.

# Display the titanic data

titanic_data

Notice the refined print we get from this, that provides so much more information than the data frame output did in chapter 2.

It may take more time to read and understand, but prevents you from having to use many additional functions to find out things such as dimensions (rows and cols).

Another example of inspection is with View(), which opens a separate tab in the code editor pane with the dataset in spreadsheet form.

# Viewing the data - Note the capital V

View(titanic_data)

6.1.5 Exercise

Exercise
Show Answer

Having read in the titanic data above, have a look at the column Age of passenger.
What type of data would you expect this column to be?
Use the “str()” function to see the data type R has set it to be.

Have a look at the column Age of passenger.
The column looks numeric.

# Using the str() function

str(titanic_data)

We can see that R has classed the column as character because of the . and asterisk within it.

Notice how much of the information this provides is also represented in the tibble output, very impressive!

6.1.6 Dealing with Missing Values at Read-in

Whilst we cannot observe all missing values at this stage, examples that cause columns to be cast to unexpected data types are often spotted quickly.

We can easily correct this by adding the na paramter to the read_csv() function.

# Specifying missing values as a vector to read_csv()

titanic_data <- read_csv("Data/titanic.csv", 
                           na = c("*", ".", "", "NULL"))

This is read as:

Where there is the a full stop, asterisk, NULL (entirely empty) value or a blank space, class is it as a missing value.

We can now see that the . and * in the age column have been replaced with NA’s and the age column is now numeric.

You may see all sorts of missing values in practice, deriving from data entry:

negative numbers where it makes no sense
abnormally large values such as 999999

There are many other useful arguments you can use when reading in data, check the help documentation for read_csv() for details.

6.2 Readxl

We use readxl to read excel data into R, it supports both .xls and .xlsx formats.

Even though it is installed alongside the tidyverse, it must be loaded separately.

The code for read in is very similar to the read_csv() example, just using the read_excel() function.

# Load in readxl

library(readxl)

Excel workbooks are more complex than flat files as they are workbooks, featuring multiple sheets.

We can output their names using the excel_sheets() function, which will become important shortly.

6.2.1 Example

Let’s read in the police dataset.

# Reading in excel data using the readxl package

police_data <- read_excel("Data/police_data.xlsx")

View(police_data)

We see that this is the first sheet in our workbook which is just the “Notes”. This is the default behaviour of read_excel() unless we specify otherwise.

# Observe sheet names in police data

excel_sheets("Data/police_data.xlsx")

6.2.2 Exercise

Exercise
Show Answer

Use an additional argument in “read_excel()” to read in the second sheet of the police dataset.

You will need to look at the help documentation for read_excel() to help you with this.

You can use the name of the sheet or the number/index.

# Using the sheet parameter in 2 ways

police_data <- read_excel("Data/police_data.xlsx",
                          sheet = 2)

# Alternatively

police_data <- read_excel("Data/police_data.xlsx",
                          sheet = "Table P1")

police_data

This is better but still not ideal:

The top columns are mostly blank with no real significant data.
- We can get around this by specifying the range parameter in read_excel, to denote the range of cells to capture.

# Using the range parameter to avoid empty rows


police_data <- read_excel("Data/police_data.xlsx",
                          sheet = 2,
                          range = "A5:AA48")

police_data

7 Exporting the Data

When you read a file into R, the data is loaded into memory. This means that any changes you make won’t be reflected in the original file you loaded.

If you want to preserve the changes you make to the dataset you have to export the data object to its own file.

We have a family of functions for this, just like reading in, the write_filetype() functions.

7.1 Example

As an example, let’s write out the police data which now has the correct sheet and range as a csv (flat) file instead.

We must supply two arguments to write_csv():

dataset itself
file, which must be the filepath of where to save your file.

The path is constructed similarly as before

“../folder/my_file.csv” as we must back out of the course_content folder first

If you specify a folder that doesn’t exist, the function will create it for you. In this case, let’s create a separate folder called “outputs”.

# Exporting data using write_csv()

write_csv(police_data, file = "../outputs/police.csv")

This stores our police_data in the outputs folder under the name “police.csv”.

8 Inspecting the Data

After importing our data, the first thing we may want to do is have a quick look at it:

We can check it looks similar to the source file.
We can check if there are the same number of rows and columns we expect.
We can check if the columns loaded in as the datatype we expect.
We can check if there are obvious missing values at the tail ends.

We can do these checks in a variety of ways, be that from base R, or in the tidyverse.

8.1 Inspecting Data Functions

We can output the first or last 6 rows of the dataset by using the head() and tail() functions.

# Head of titanic

head(titanic_data)

# Tail of titanic

tail(titanic_data)

We can output the number of rows and columns or each separately with the following functions.

# Number of rows in titanic

nrow(titanic_data)

# Number of columns in titanic

ncol(titanic_data)

# Dimensions - Vector of nrow and ncol

dim(titanic_data)

We can output the column names of a dataset as a vector using the names() function.

# Column names in titanic

names(titanic_data)

Up to here, the functions prior were from base R, which the tibble deals with for us with its refined print method.

The tidyverse library has dplyr, a data manipulation package, that provides arguably the best inspection function, known as glimpse().

This refines the print even more, to just a snapshot of the content of the columns themselves.

# Have a glimpse at titanic

glimpse(titanic_data)

We still retain much of the information from before:

The number of rows and columns
Column names
Column data types
The first 5-10 observations

We recommend using this function after each manipulation you make to the data, as a sense check.

9 Summary

Well done for working your way through this solid introduction to the readr package in the tidyverse,

By no means are you expected to remember all the above, what is better is that you understand the problems you want to solve and can then use references or provided material to go about solving it.

In the next chapter we will look at data manipulation with the dplyr package from the tidyverse.