Chapter 2 - Exploratory Data Analysis

Author

Government Analysis Function and ONS Data Science Campus

1 Learning Objectives

The goal of this session is to learn about:

Inspecting the dataset
Finding missing values
Univariate exploratory data analysis
Multivariate exploratory data analysis

We need to load the following packages for this chapter:

# Libraries needed

library(tidyverse) # For data manipulation 
library(visdat)    # For missing values   
library(naniar)    # For missing values
library(moments)   # For skewness
library(GGally)    # For correlation plots

2 Inspecting the dataset

Exploratory data analysis is usually the first step in any data analysis project. First, we will explore the raw_energy_data to familiarise ourselves with the structure of the data.

Import the fake_energy_data and assign it to the variable name raw_energy_data.
Find the following things about the raw_energy_data:
1. First 6 observations
2. Last 6 observations
3. Structure of the dataset (using str() or dplyr::glimpse() functions)

Import the fake_energy_data and assign it to the variable name raw_energy_data.

# Importing fake energy data and assign to 
# variable name raw_energy_data

raw_energy_data <- readr::read_csv("../Data/fake_energy_data.csv")

Find the following things about the raw_energy_data:

First 6 observations

# Calculating first 6 observations using head() function

head(x=raw_energy_data)

Last 6 observations

# Calculating last 6 observations using tail() function

tail(raw_energy_data)

Structure of the dataset (using str() or dplyr::glimpse() functions)

# Looking at structure of dataset using glimpse() in dplyr

dplyr::glimpse(raw_energy_data)

# Alternative is to use str()

str(raw_energy_data)

3 Finding missing values

Arguably one of the first things we will need to establish in our data, before even looking at summary statistics, is whether it contains any missing values. Having missing values in data is a common occurrence. We can use the packages visdat to visualise our missing values and base R functions to find the number of missing values in each column (variable).

# check if there are any NAs in our dataset 

anyNA(x=raw_energy_data)

# Visualise the missing value 
# x can be a dataframe

visdat::vis_dat(x=raw_energy_data)

3.1 Quiz

Quiz
Show Answer

From the plot above, which variable has missing values

The Income variable has missing values.

3.2 Missing values by columns

We can also check the number of missing values in each column by running the code shown below:

# colSums() function gives the sum of missing value in each columns
# is.na() function which checks if the data has missing values

colSums(is.na(x=raw_energy_data))

Also, we can check if there are other values which might represent missing data in categorical variables.

# Select only categorical variables by using if_select() function in dplyr

cat_raw_energy_data <- raw_energy_data %>% 
                       dplyr::select_if(is.character) 


# See unique values by each column 

sapply(cat_raw_energy_data, function(x) unique(x))

As you can see, there are a few Unknown and unknown in the categorical variables. These unknown/Unknown can be a missing value or one of the options as well for example the categorical variable Country has three values England, Wales and Unknown. This unknown can be treated as a missing value or it can be a possibility that the individual’s property is not in either England or Wales. Hence, you need to understand the context of your dataset before assuming them to be a missing value. In this course, we will treat these unknown/Unknown as a missing value. We can also convert our missing values to NA to find the sum of all missing values in raw_energy_dataset as shown below:

# Convert "Unknowns" into NA

raw_energy_data[raw_energy_data == "Unknown"] <- NA
raw_energy_data[raw_energy_data == "unknown"] <- NA

# This will show the sum of missing values in all columns

colSums(is.na(x=raw_energy_data))

If there are values in the data, NA or not, which represent missing or unknown values, we have to decide what to do with them. For now, we will not do anything with missing values in categorical variables. However, for the variable income, we might choose to impute the NA values based on what we know about the rest of the data (for example the mean of income variable can be used). We would do this rather than dropping the variable so that we retain as much information as possible for our future analyses.

# Specify your dataset and then use mutate() function
# Using if_else specify the new column name and conditions
# na.rm omits missing value while calculating mean 

raw_energy_data <- raw_energy_data %>%
                  dplyr::mutate(income =
                  dplyr::if_else(condition = is.na(income),     
                                 true = mean(income, na.rm = TRUE), 
                                 false = income))

# check if there are any NAs in income column after mean imputation
anyNA(x=raw_energy_data$income)

As you can see we do not have any missing values since we have imputed missing values with the mean of income. However, it is not the best practice since mean imputation can reduce the variance of the variable. Please refer to the GSS Editing and Imputation Course for details on how to best impute your missing data.

3.2.1 Quiz

Quiz
Show Answer

In the above code block for imputing missing values, what does the second argument in the mean() function do?

It turns all the observations to NA
It removes the observations which are NA when calculating the mean
It returns that mean
Nothing

It removes the observations which are NA when calculating the mean. The argument na.rm = TRUE means that NA observations aren’t included in the calculation, which means that a non-NA value can be returned.

4 Univariate exploratory data analysis

Univariate exploratory data analysis is the simplest form of exploratory data analysis where the data being analysed contains just one variable. Since it’s a single variable we don’t deal with causes or relationships between variables. The main purpose of the univariate analysis is to describe the data and find patterns that exist within it. First, we will look at continuous data and then categorical data.

4.1 Continuous variables

Continuous variables are numeric variables that have an infinite number of values between two values. An example of a continuous variable could be a weight variable which contains the weight of people as observations. In raw_energy_data there are two numeric variables, the first one is the response variable consumption and the other one is the explanatory variable income. A response variable (or dependent variable) is that variable whose variation depends on other variables. The response variable is often related to the independent variable, sometimes denoted as the explanatory variable. In short, the response variable is the subject of change within an experiment, often as a result of differences in the explanatory variables.

First, we will look at the descriptive statistics of these two numeric variables and then plot them. When summarizing continuous variables, we are typically interested in questions like:

What is the “centre” of the data? (Mean, median)
How spread out is the data? (Standard deviation, variance)
What are the extremes of the data? (Minimum, maximum and outliers)
What is the “shape” of the distribution? Is it symmetric or asymmetric? Are the values mostly clustered about the mean, or are there any values in the “tails” of the distribution? (Skewness)

We can use the following function to examine the summary statistics of consumption variable:

# Get the summary statistics of consumption variable

summary(raw_energy_data$consumption)

This minimum value of consumption variable looks strange since gas consumption can only be positive. Since, the minimum value seems to be a potential outlier, removing or replacing the minimum value can affect the mean or median. We can also see that the mean gas consumption of households in England and Wales is 14505 kWh, while the median gas consumption is slightly greater at 14578 kWh, suggesting the data is slightly negatively skewed (the “tail” of the distribution points to the left). We can check the skewness by using the skewness() function in moments package.The mathematical equation of calculating skewness is shown below:

\[Skew= \frac{ \sum_{i=1}^{n}(x_{i}-\bar{x})^3}{\sum_{i=1}^{n}(x_{i}-\frac{\bar{x}^2}{n})^\frac{3}{2}}\]

If skewness is less than −1 or greater than +1, the distribution is highly skewed.
If skewness is between −1 and -0.5 or between +0.5 and +1, the distribution is moderately skewed.
If skewness is between −0.5 and +0.5, the distribution is approximately symmetric.

# Calculating skewness using moments package 

moments::skewness(x=raw_energy_data$consumption, na.rm = TRUE)

The value is between -0.5 and 0.5, suggesting the data are fairly symmetrical.

To find the summary statistics of income variable we will use the same function.

# Get the summary statistics of income variable

summary(raw_energy_data$income)

This maximum value of the income variable looks very strange since earning exactly £1,000,000 per year seems odd especially considering the interquartile range (The interquartile range, denoted as IQR is the distance between the first quartile Q1 and the third quartile Q3. So 50% of the data are within this range, in our case IQR = Q1-Q2 = 13009). The minimum and maximum values are (Q1 - 1.5 x IQ) and (Q3 + 1.5 x IQ) so it seems like the maximum value is a potential outlier. Also, the mean income of households in England and Wales is £34272 per year. The mean income per year is greater than the median income per year suggesting a positively skewed distribution of the income variable (the “tail” of the distribution points to the right).

# Skewness using moments package

moments::skewness(x=raw_energy_data$income, na.rm = TRUE)

The skewness value supports the suggestion that the income is positively skewed.

Generally, the summary statistics can indicate if there are strange or unexpected values in our data other than NA. For example, the strange values we just saw in for both consumption and income. We’ll look at these in detail in the next section. An important step in understanding our data and the subsequent statistics properly is actually looking at all of the data. We can do this with functions in base R or ggplot2 packages in R. Base R plots are useful to get an immediate sense of the data and identify potential distributions and outliers. However, ggplot graphs are more customisable than base R plots and for that reason, they are probably better to use as ‘final outputs’ for any report.

4.1.1 Plotting continuous variables

Let’s examine our variable income in raw_energy_data. What does this variable look like when visualised? First, we’ll use a histogram, and then a density plot to visualise the data. The most basic graph is the histogram, where each bar represents the frequency (count) or proportion of cases for a range of values. The range of the data for each bar is called a bin.

# Select thedata
# Create a histogram
# Specify the variable to display in aes(), aesthetic attributes are mapped to variables
# Specify the labels
# Specify the title
# Centre the title

ggplot2::ggplot(raw_energy_data) +   
         aes(x=income) + 
         geom_histogram(bins=15) +
         labs(x="Income (£)", y ="Count") +
         ggtitle("Histogram of Income") +
         theme(plot.title = element_text(hjust = 0.5))

A density plot visualises the distribution of data over a continuous interval. The peaks of a density plot help display where values are concentrated over the interval. An advantage density plots have over Histograms is that they’re better at determining the distribution shape because they’re not affected by the number of bins used. The density plot of income variable is shown below:

# Create a density plot - we just use geom_density instead of geom_histogram

ggplot2::ggplot(raw_energy_data) +
                aes(x = income) +
                geom_density() +
                labs(x="Income (£)", y ="Density") +
                ggtitle("Density Plot of Income") +
                theme(plot.title = element_text(hjust = 0.5))

We can see that some very strange plots are being produced. This is yet another way to see if there’s something strange going on in our data, and from our summary statistics and the plots, we can confirm that there is at least one value in the income variable which is abnormally high (the maximum income £1,000,000 per year). Hence, we’ll replace the maximum value of income with the mean income of a household.

# Specify your dataset and then use mutate function
# Using if_else specify the new column name and conditions
# Replace the maximum value of income with mean of income

raw_energy_data <- raw_energy_data %>%
                  dplyr::mutate(income =
                  dplyr::if_else(condition = income==max(income),
                                 true = mean(income), 
                                 false = income))

Now we will again look at the summary statistics and the histogram to see if the data looks like what we were expecting it to be.

# Find the summary statistics of the income variable

summary(raw_energy_data$income)

# Plot the histogram again 

income_hist <- ggplot2::ggplot(raw_energy_data) + 
                               aes(x=income) +                        
                               geom_histogram(color="black", fill="white",bins=25) +
                               ggtitle("Histogram of Income") +
                               labs(x="Income", y ="Count") +
                               theme(plot.title = element_text(hjust = 0.5)) 
               
# Add a mean line in the histogram 

income_hist + geom_vline(aes(xintercept=mean(income)),
              color="blue", linetype="dashed", size=1)

As we can see, the maximum value is now £122,394 which is much lower than the previous maximum value (£1,000,000) suggesting the previous maximum value is indeed an outlier. Also, the mean is now less than the median suggesting the income is slightly negatively skewed. Lastly, the histogram looks much better, although the new maximum value (£122,394) still seems to be a potential outlier.

4.1.1.1 Exercise

Exercise
Show Answer

Find the strange value in the consumption variable and impute it appropriately?
Check the summary statistics of the consumption variable after imputing the data?
Create a histogram of the consumption variable after imputation

Find the strange value in the consumption variable and impute it appropriately?

The summary statistics show the minimum value (-434 kWh) is negative which looks strange since gas consumption in kWh should only be positive.

# The summary statistics of consumption variable

summary(raw_energy_data$consumption)

We will impute the minimum value with the mean of consumption as shown below:

# Specify your dataset and then use the mutate function
# Using if_else specify the new column name and conditions
# Replace the minimum value of consumption with mean of consumption

raw_energy_data <- raw_energy_data %>%
                dplyr::mutate(consumption=
                dplyr::if_else(condition = consumption==min(consumption),
                               true = mean(consumption), 
                               false = consumption))

Check the summary statistics of the consumption variable after imputing the data?

# The minimum value is now positive

summary(raw_energy_data$consumption)

The minimum value is now 41.51 which does not look too strange.

Create a histogram of the consumption variable after imputation

# Create a histogram 

ggplot2::ggplot(raw_energy_data) +
                aes(x=consumption) +            
                geom_histogram() +
                labs(x="Consumption", y ="Count") +             
                ggtitle("Histogram of Consumption") +
                theme(plot.title = element_text(hjust = 0.5))

The histogram looks fairly symmetric.

4.1.2 Scaling continuous variables

We can also look at the spread of our data by using boxplots. Boxplots are a very useful univariate graphical technique. They are very good at presenting information about the central tendency, symmetry and skew, as well as outliers. First, we will create a boxplot of income variable in the raw_energy_data dataset.

# Create a boxplot
# Add whiskers(whiskers are the two lines outside the box 
# that extend to the highest and lowest observations)

ggplot2::ggplot(raw_energy_data) +
                aes(x = income) +
                geom_boxplot() +
                labs(x="Income") +
                ggtitle("Boxplot of Income") +
                theme(plot.title = element_text(hjust = 0.5)) +
                stat_boxplot(geom = 'errorbar',width = 0.2)

# For older version R you will get error
# so use cord_flip() to flip your coordinates

alternative <-ggplot2::ggplot(raw_energy_data) +
aes(y=income) +
coord_flip() +
geom_boxplot() +
labs(x="Income") +
ggtitle("Boxplot of Income") +
theme(plot.title = element_text(hjust =0.5)) +
stat_boxplot(geom = 'errorbar',width = 0.2)

Our box plotting for income shows there are 13 observation beyond the spread of the rest of the data (whiskers). This can be potential outliers. Outliers are unusual values in a dataset which can increase the variability.

Another way to check for outliers is by standardising the variables of interest. Standardising in the process of transforming a variable to have a mean of 0 and a standard deviation of 1. It is done by standardising each value in the variable as shown below:

\[z=\frac{\bar{X}-\mu}{\sigma}\] Where:

\(z\) is the transformed observation
\(\bar{X}\) is the observation (observation in the income variable)
\(\mu\) is the mean of the variable (income mean)
\(\sigma\) is the standard deviation (standard deviation of income variable)

We standardise the variable and then consider any standard value (z-score) above 3 or below -3 to be an outlier. This can be achieved with the scale() function in R. Use the code below to check for outliers using this approach.

# Scale the value variable to give standard value

standard_value <- scale(raw_energy_data$income)

# Check for values greater than 3
standard_value[standard_value > 3]

# check for scores smaller than -3
standard_value[standard_value < -3]

Finally, given that we have used multiple approaches to visualise and assess our data, we can choose to remove outliers by using the dplyr::filter() function as shown below:

# Create a new column using mutate()
# standardised income variable as shown above 
# Keep all observation with a standard value less than 3 and greater than -3 using filter()

energy_data <- raw_energy_data %>%
  mutate(scaled_income = scale(income)) %>%
  filter(scaled_income < 3) %>% 
  filter(scaled_income > -3)

Our dataset is reduced to 1126 observation. You can check the difference between scaled and unscaled income variable in energy_data below:

# Our dataset is reduced to 1126 observations

nrow(energy_data)

# See the difference between scaled and unscaled data 
# Using sample() function 
# Sample of 10 observation of unscaled income

sample(energy_data$income, 10)

# Sample of 10 observation of scaled income

sample(energy_data$scaled_income, 10)

4.1.2.1 Exercise

Exercise
Show Answer

Create a boxplot of the consumption variable in energy_dataand check if there any potential outliers?
Standardise the consumption variable in energy_data and find the outliers (Hint: Standardised values greater than 3 and less than -3 are considered outliers)

Create a boxplot of the consumption variable in energy_dataand check if there any potential outliers?

# Create a boxplot
# Add whiskers 

ggplot2::ggplot(energy_data) +
                aes(x = consumption) + 
                geom_boxplot() +
                labs(x="Consumption") +
                ggtitle("Boxplot of Consumption") +
                theme(plot.title = element_text(hjust = 0.5)) +
                stat_boxplot(geom = 'errorbar',width = 0.2)

# For older version R you will get error
# so use cord_flip() to flip your coordinates

alternative <- ggplot2::ggplot(raw_energy_data) +
aes(y=consumption) +
coord_flip() +
geom_boxplot() +
labs(x="Consumption") +
ggtitle("Boxplot of Consumption") +
theme(plot.title = element_text(hjust =0.5)) +
stat_boxplot(geom = 'errorbar',width = 0.2)

As you can see there are few potential outliers, we will standardise the consumption variable and confirm it.

Standardise the consumption variable in energy_data and find the outliers (Hint: Standardised values greater than 3 and less than -3 are considered outliers)

# Scale the value variable to give standard value
standard_value <- scale(raw_energy_data$consumption)

# Check for values greater than 3
standard_value[standard_value > 3]

# check for scores smaller than -3
standard_value[standard_value < -3]

We will not filter our data because it is not considered as good practice to remove values from your response (dependent) variable in exploratory data analysis. This concludes the cleaning that we need to do with our data. From now on, we can refer to the cleaned data set simply as energy_data.

4.2 Categorical variables

Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order for example gender and material type. In energy_data there are several categorical variables such as property_type, tenure and country. The property_type variable contains the type of property such as semi-detached, mid-terraced, etc. as observations. We can explore the count of the unique number of levels of this variable by running the code shown below:

# Counting six values with dplyr
# This operation gets us the count for each level in descending order

energy_data %>% count(property_type) %>% arrange(desc(n))

4.2.1 Exercise

Exercise
Show Answer

Create a frequency table of the tenure variable in the cleaned energy_data.

Create a frequency table of thetenure variable in the cleaned energy_data.

# Frequency of property type by
# Aggregating the data using group_by()

energy_data %>% group_by(tenure) %>% count()

4.2.2 Plotting categorical variables

We can also plot variables that have discrete values and different levels. Let’s examine the variable property_type from energy_data using a bar chart as shown below:

# Creating a bar chart 
# With angled axis labels

ggplot2::ggplot(data = energy_data) + 
                aes(x = property_type) +
                geom_bar() + 
                labs(x="Type of Property", y ="Count") +             
                ggtitle("Frequency of Different Property Type") +                    
                theme(plot.title = element_text(hjust = 0.5),
                      axis.text.x = element_text(angle = 45, hjust = 1))

This bar chart shows that the most frequent property type is the end terrace house, and that semi-detached houses are the least common. There are 50 NA which means that there are 50 unknown property types in the energy dataset (we converted unknowns to NA in Finding missing values section), it could be because either their type of property didn’t fit into any of the categories or they might be an actual missing value.

4.2.3 Exercise

Exercise
Show Answer

Create a bar chart of the property_age variable in energy_data.

Create a bar chart of the property_age variable in energy_data.

# Creating a bar chart 
# With angled axis labels

ggplot2::ggplot(data = energy_data) + 
                aes(x = property_age) +
                geom_bar() +
                labs(x="Age of Property", y ="Count") +             
                ggtitle("Frequency of Age of Property") +     
                theme(plot.title = element_text(hjust = 0.5), 
                      axis.text.x = element_text(angle = 45, hjust = 1))

5 Multivariate exploratory data analysis

Another essential lens to examine data through is what patterns emerge between variables. To find the relationship between variables, we will do some multivariate exploratory data analysis. First, we will look at the continuous data and then the categorical data. However, since we have only two continuous variables in energy_data we will be using airquality_data to see relationships between different continuous variables, while we will be using energy_data for exploratory analysis of categorical data.

5.1 Continuous variables

Since we are using airquality_data data in this section, we can explore the structure of the data again.

# Clean column names

airquality_data <- janitor::clean_names(airquality)


# Structure of airquality_data

dplyr::glimpse(airquality_data)

We will assume ozone to be response variable and everything else to be the explanatory variables. month is an explanatory variable which has unique values. First, we will look at the unique values in the month variable by running the code below:

# Unique values in the variable month in airquality_data

unique(airquality_data$month)

The variable month is a categorical variable since it has categories. In R factors are used to represent categorical data. The set of values that the elements of a factor can take are called its levels (so in the month variable the levels are 5,6,7,8,9). Factors can be ordered or unordered and are an important class for statistical analysis and for plotting; whenever you use the variable month you should make it a factor.

# Using the categorical data as factor using factor() when plotting graphs

airquality_data$month <- factor(x=airquality_data$month)

We will also convert the variables solar_r, ozone and temp from integers to numeric because it is easier to create a scatterplot matrix (in section Scatterplot Matrix )

# Convert all variables from integers to numeric

airquality_data <- airquality_data     %>% 
                   mutate(solar_r=as.numeric(solar_r)) %>% 
                   mutate(ozone=as.numeric(ozone))     %>% 
                   mutate(temp=as.numeric(temp))

Now, we will look at the structure of the data again and then look at the summary statistics of air quality data by running the following code below:

# The variables ozone, temp and solar_r are numeric now

dplyr::glimpse(airquality_data)


# Find the summary statistics of airquality_data 

summary(airquality_data)

As you can see the mean is slightly greater than the median in variables wind and ozone suggesting a slightly positively skewed data. We can check the skewness by using the code below:

# Finding skewness of ozone using moments package

moments::skewness(x=airquality_data$ozone, na.rm = TRUE)

The data is skewed heavily towards the right. We can apply transformations to make it symmetric for example we can take the square root of the ozone variable. We will look at transformations in Chapter 5.

We can also see in summary statistics that there are 37 missing values in the ozone variable and 7 missing values in solar_r variable. We can visualise the missing values in variables ozone and solar_r by each month from May till September.

# Creating scatterplot 
# Specify x and y axes
# geom_miss_point() has shifted the missing values to now be 10% below the minimum value
# facet_wrap() is used when your variable has several levels
# legend.position specifies the position of the legend

ggplot2::ggplot(data = airquality_data) +
                aes(x = ozone, y = solar_r) + 
                geom_miss_point() + 
                facet_wrap(~month, ncol = 2) + 
                theme(legend.position = "bottom")

We can see how missing values are affecting the spread of the data, you can impute it using mean of ozone and solar_r.

# Impute missing values in solar_r using mean imputation

airquality_data <- airquality_data %>%
                  dplyr::mutate(solar_r=
                  dplyr::if_else(condition=is.na(solar_r),     
                  true = mean(solar_r, na.rm = TRUE), 
                  false=solar_r))

# Impute missing values in ozone using mean imputation

airquality_data <- airquality_data %>%
                  dplyr::mutate(ozone=
                  dplyr::if_else(condition=is.na(ozone),     
                  true = mean(ozone, na.rm = TRUE), 
                  false=ozone))

However mean imputation can cause variability in the data and it is not the best practice. Refer to Editing and Imputation to gain a better insight of possible imputations.

5.1.1 Plotting two continuous variables

As part of our analyses we probably also want to understand the relationship between possibly related variables. We can do this by plotting a graph that has one variable on the x-axis and the other on the y-axis. A typical graph to choose is a scatter plot for two continuous variables. First, we will examine the relationship between two explanatory variables solar_r and wind in the airquality_data by creating a scatterplot as shown below:

# Create a scatter plot
# Specify the x and y axes
# Make the colour conditional on a variable month
# Remember to make month variable as a factor variable
# Specify the default geom_point()
# Specify the titles, x and y labels

ggplot2::ggplot(data = airquality_data) +
                aes(x =solar_r, y = wind, colour = month) +
                geom_point() +
                labs(x="Solar Radiations (Langleys)", 
                     y ="Wind (miles per hours)" ) +
                ggtitle("Wind Speed and Solar Radiations from May till September") +     
                theme(plot.title = element_text(hjust = 0.5))

As you can see the data points are randomly scattered and do not form a pattern. This suggests that the solar radiation and the wind speed don’t have a linear relationship. Hence, there is a very weak correlation between the variables wind and solar_r.

5.1.1.1 Exercise

Exercise
Show Answer

Create a scatterplot of the variables wind and temp.
Comment on the scatterplot.

Create a scatterplot of variables wind and temp.

# Create a scatter plot
# Specify the x and y axes
# Make the colour conditional on a variable month
# Specify the default geom_point()
# Specify the titles, x and y labels

ggplot2::ggplot(data = airquality_data) +
                aes(x =temp, y = wind, colour = month) +
                geom_point() +
                labs(x="Temperature (°F)", 
                     y ="Wind (miles per hours)" ) +
                ggtitle("Wind Speed and Temperature from May till September") +     
                theme(plot.title = element_text(hjust = 0.5))

Comment on the scatterplot.

The data points are not scattered randomly, it is showing a slight negative linear relationship between the speed of wind (in miles per hours) and temperature (in degrees Fahrenheit). This suggests that there is a moderate correlation between the variables temp and wind. Also, the data is clustered by months as well e.g. we can see red points (representing temperature and wind speed in May) is clustered around the range from 55 degrees Fahrenheit to 70 Fahrenheit which makes sense as well since it is less windy in summers and temperature is usually higher in summers than winters.

5.1.2 Covariance and Correlation

We can quantify and explore such a relationship further by retrieving the statistical values for the covariance and correlation between the two variables.

The covariance of two variables X and Y in a data set (denoted by \(Cov(X,Y)\)) measures how the two are linearly related. A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite - as one variable increases, the other decreases.

The correlation coefficient of two variables X and Y (denoted as \(Corr(X,Y)\) or \(r\)) in a data set equals to their covariance (\(Cov(X,Y)\)) divided by the product of their standard deviations (\(\sigma_x\)\(\sigma_y\)).

\[ r=Corr(X,Y)=\frac{Cov(X,Y)}{\sigma_x\sigma_y}\]

The values for correlation range between -1 and 1. Here is the rule of thumbs for correlation coefficient interpretation:

Rule of thumb for correlation coefficients

In R we can calculate the correlation coefficient by using the functions shown below.

# Correlation coefficient of variables temp and wind
# Default method is pearson

cor(x=airquality_data$temp,y=airquality_data$wind)

5.1.2.1 Quiz

Quiz
Show Answer

What does the correlation coefficient of temp and wind shows us:

There is no correlation between the two variables
There is a very strong negative correlation between the two variables
There is a moderate correlation between the two variables

There is a moderate correlation between the two variables

5.1.3 Scatterplot Matrix

We can also create a scatterplot matrix to visualise our data. Scatterplot matrices are tables containing multiple scatterplots, showing all pairwise relationships between variables. This is particularly useful for multiple regression where one of the assumptions is that each response variable shows a linear relationship with the explanatory variable. It is also important that pairs of explanatory variables are not strongly related or else it causes multicollinearity.

Collinearity occurs when explanatory variables are correlated. This correlation is a problem because explanatory variables should not be related to each other. If the degree of correlation between variables is high enough, it can cause problems when you fit the regression model and interpret the results. It is very important to check the collinearity in exploratory data analysis.

The simplest way to plot scatterplot matrices in Base R is to use pairs(), but we will be using ggpairs() from the package GGally because as well as showing the scatterplot of all variable combinations, but also the shape of the distribution and correlation coefficients as well.

# Creating a scatterplot matrix

scatterplot_matrix <- GGally::ggpairs(data=airquality_data)

# Displaying the results 

scatterplot_matrix

The graph shows the following things:

Very weak correlation between solar radiation and wind speed, since the data points are scattered randomly. The correlation coefficient between solar_r and wind is -0.055 which is very small confirming a very weak correlation between both variables.
Ozone concentration is weakly correlated with solar radiation since the correlation coefficient is 0.303. Also, the scatter plot suggests a positive linear relationship between ozone and solar_r.
The variables ozone and wind seem moderately correlated since the correlation coefficient is -0.531. Also, the data points are not randomly scattered and suggest a fairly strong linear relationship between ozone and wind.
Ozone and temperature are strongly correlated as well since the correlation coefficient is 0.609. As the temperature increases, the mean ozone in parts per billion increases.
Solar radiation and temperature are also weakly correlated since the correlation coefficient is 0.263 and the data points appear randomly scattered.
Wind and temperature are moderately correlated since the correlation coefficient is -0.458 and the data points do not have a linear relationship.
Since the variable month is a factor, the scatterplot is not very useful in interpreting the results, that is why we have boxplots and histogram which are easier to interpret. Each boxplot is representing the spread for each month e.g. the spread of ozone concentration in May is represented by the first boxplot in the first row and fifth column.
The median of solar radiation in May, June, August and September are similar, but the distributions are different since July, August and September are negatively skewed (left-skewed) because the medians are closer to the upper quartile and the whiskers are shorter on the upper end of the box. However, the distribution of solar radiation is positively skewed (right-skewed) in June because the median is closer to lower quartile.
There is a potential outlier for July. The distribution of solar radiation is May is fairly symmetric and interquartile range (that is, the box lengths), shows that the data in May is more dispersed compared to other months because the box length is longer than others. Also, the range (as shown by the distances between the ends of the two whiskers) shows that solar radiation in May is the largest suggesting wider distribution of solar radiation in May, that is, more scattered data. So it looks as if we can safely say that solar radiation is related to the variable month.
The distribution of wind speed in May, June and September is fairly symmetrical, however, it is positively skewed in July and August as the median is closer to lower quartile and the whisker is shorter on the lower end of the box. The interquartile ranges and overall range are reasonably similar (as shown by the lengths of the boxes and whiskers respectively). So, wind speed and months may be weakly correlated.
The distribution of temperature in July and September looks fairly symmetrical, but the distribution of temperature in June and August is positively skewed. You can see immediately that the median temperature is greater in July than the lower quartile of the other month (that is, over three-quarters of the temperature is higher in July than the median temperature in other months). So it looks as if temperature and months are related (which is expected).

# We will look at the relationship between month and ozone 
# by taking out the specific plot we are interested in 
# To look at a specific plot use getPlot function in GGally package

GGally::getPlot(pm= scatterplot_matrix, i=1, j=5)

The distribution of ozone concentration in parts per billion is positively skewed for all months except June. There are some potential outliers. The interquartile range and overall range for ozone concentration in June are very small suggesting the ozone concentration are less dispersed and less scattered in June compared to other months. This suggests that ozone concentration and months are related.

5.1.3.1 Exercise

Exercise
Show Answer

Create the scatterplot matrix of airquality_data using base R pairs() function.
Comment on the difference between this scatterplot matrix and the one created in the example using ggpairs() function.

Create the scatterplot matrix of airquality_data using base R pairs() function.

pairs(airquality_data)

Comment on the difference between this scatterplot matrix and the one created in the example using ggpairs() function.

It is much easier to see the relationship between all variables using the ggpairs() function. The interpretation is still the same, but ggpairs() shows the shape of distribution, correlation coefficients and boxplots as well.

5.2 Categorical variables

In this section, we will be using energy_databecause airquality_data doesn’t have many categorical variables. However, in energy_data we have 8 categorical variables: floor_area, property_type,tenure,property_age,country,bedrooms,adultsand region. So, far we have covered covariation and correlation of continuous data, now we will plot categorical variables together to indicate the covariance between them.

5.2.1 Plotting categorical variables

We can then visualize this covariation by using a tile plot, which will introduce a colour fill that can show us the mutual count between the two variables visually - the stronger the colour between two, the higher the count.

# Order your floor_area variable and turn it into a factor
# levels specifies the order 

energy_data$floor_area <- factor(energy_data$floor_area,
                          levels = c("50 or less", "51 to 100",
                                     "101 to 150", "151 to 200",
                                     "201 to 250","Over 250"))

# Create a tile plot

energy_data %>%
  count(property_type, floor_area) %>%
  ggplot2::ggplot() +
           aes(x = property_type,
               y = floor_area, fill = n) +
           geom_tile() +
           labs(x="Type of Property", 
                y = expression("Floor Area (m"^2*")")) +
           scale_fill_gradient(low="white", high="blue") +
           theme(axis.text.x = element_text(angle = 45, hjust = 1))

Finally, ggplot2 can also be used to create distributions of continuous variables which are split by the levels of categorical factors. First, we will create the density plot to see the gas consumption by England and Wales.

# Making a grouped distribution visualisation
# Use filter to filter out only observations for England and Wales
# Then use ggplot to create a density plot 
# Of gas consumption for each country

dplyr::filter(.data=energy_data, 
              country == "England"|country == "Wales") %>%
       ggplot2::ggplot() +
                aes(x=consumption, fill=country) +
                geom_density(alpha=0.4)

The distribution of gas consumption (kWh) is unimodal and relatively symmetric for both England and Wales. The mean gas consumption in England is higher than the mean gas consumption in Wales; this might be due to the difference in population within each country. The distribution of gas consumption in Wales has slightly more variation than the distribution of gas consumption in England.

5.2.2 Exercise

Exercise
Show Answer

Remove any missing values from the property_type variable in energy_data and assign it to a new variable.
Use the function geom_boxplot to see the spread of data of gas consumption by different type of properties.

Remove any missing values from the property_type variable in energy_data and assign it to a new variable.

# Take energy_data 
# use filter() to filter out missing values
# represented as NA
# select the variables you want to create 
# boxplot of 

filtered_property <-  energy_data %>%
                      filter(property_type != "NA") %>% 
                      select(property_type, consumption)

Use the function geom_boxplot to see the spread of data of gas consumption by different type of properties.

# Making a grouped boxplot

ggplot2::ggplot(filtered_property) +
                aes(x = property_type, y = consumption) + 
                geom_boxplot() +
                labs(x="Type of Property", y = "Gas Consumption (kWh)") +
                theme(axis.text.x = element_text(angle = 45, hjust = 1))

There are potential outliers in detached, end-terrace, mid-terraced and unknown property types. The median gas consumption of the different property is similar, but the gas consumption of some property types are more variable than others. Converted flat appears to have larger variability than other property types. The majority of the property types are fairly symmetric, however, detached and semi-detached are slightly skewed to the right (positively skewed).

Summary

By now you should be able to:

Inspect datasets
Find missing values and impute them with mean
Find descriptive statistics of one variable and plot them to visualise data
Standardise variables and filter out data
Find descriptive statistics of multiple variables
Plot two continuous variables and finding the relationship between them
Find the correlation between two variables
Plot and interpreting scatterplot matrices
Plot categorical variables to explore data

Next Chapter

In the next chapter we will look at statistical tests where we will be using the cleaned energy_data and airquality_data which are saved in the “Data” folder

Reuse

Open Government Licence 3.0 (View License)