# Libraries needed
library(tidyverse) # For data manipulation
library(visdat) # For missing values
library(naniar) # For missing values
library(moments) # For skewness
library(GGally) # For correlation plots
Chapter 2 - Exploratory Data Analysis
1 Learning Objectives
The goal of this session is to learn about:
Inspecting the dataset
Finding missing values
Univariate exploratory data analysis
Multivariate exploratory data analysis
We need to load the following packages for this chapter:
2 Inspecting the dataset
Exploratory data analysis is usually the first step in any data analysis project. First, we will explore the raw_energy_data
to familiarise ourselves with the structure of the data.
2.1 Exercise
Import the
fake_energy_data
and assign it to the variable nameraw_energy_data
.Find the following things about the
raw_energy_data
:First 6 observations
Last 6 observations
Structure of the dataset (using
str()
ordplyr::glimpse()
functions)
- Import the
fake_energy_data
and assign it to the variable nameraw_energy_data
.
# Importing fake energy data and assign to
# variable name raw_energy_data
<- readr::read_csv("../Data/fake_energy_data.csv") raw_energy_data
- Find the following things about the
raw_energy_data
:
- First 6 observations
# Calculating first 6 observations using head() function
head(x=raw_energy_data)
- Last 6 observations
# Calculating last 6 observations using tail() function
tail(raw_energy_data)
- Structure of the dataset (using
str()
ordplyr::glimpse()
functions)
# Looking at structure of dataset using glimpse() in dplyr
::glimpse(raw_energy_data) dplyr
# Alternative is to use str()
str(raw_energy_data)
3 Finding missing values
Arguably one of the first things we will need to establish in our data, before even looking at summary statistics, is whether it contains any missing values. Having missing values in data is a common occurrence. We can use the packages visdat
to visualise our missing values and base R functions to find the number of missing values in each column (variable).
# check if there are any NAs in our dataset
anyNA(x=raw_energy_data)
# Visualise the missing value
# x can be a dataframe
::vis_dat(x=raw_energy_data) visdat
3.1 Quiz
- From the plot above, which variable has missing values
- The Income variable has missing values.
3.2 Missing values by columns
We can also check the number of missing values in each column by running the code shown below:
# colSums() function gives the sum of missing value in each columns
# is.na() function which checks if the data has missing values
colSums(is.na(x=raw_energy_data))
Also, we can check if there are other values which might represent missing data in categorical variables.
# Select only categorical variables by using if_select() function in dplyr
<- raw_energy_data %>%
cat_raw_energy_data ::select_if(is.character)
dplyr
# See unique values by each column
sapply(cat_raw_energy_data, function(x) unique(x))
As you can see, there are a few Unknown
and unknown
in the categorical variables. These unknown/Unknown can be a missing value or one of the options as well for example the categorical variable Country
has three values England
, Wales
and Unknown
. This unknown can be treated as a missing value or it can be a possibility that the individual’s property is not in either England
or Wales
. Hence, you need to understand the context of your dataset before assuming them to be a missing value. In this course, we will treat these unknown/Unknown as a missing value. We can also convert our missing values to NA to find the sum of all missing values in raw_energy_dataset
as shown below:
# Convert "Unknowns" into NA
== "Unknown"] <- NA
raw_energy_data[raw_energy_data == "unknown"] <- NA
raw_energy_data[raw_energy_data
# This will show the sum of missing values in all columns
colSums(is.na(x=raw_energy_data))
If there are values in the data, NA or not, which represent missing or unknown values, we have to decide what to do with them. For now, we will not do anything with missing values in categorical variables. However, for the variable income
, we might choose to impute the NA values based on what we know about the rest of the data (for example the mean of income
variable can be used). We would do this rather than dropping the variable so that we retain as much information as possible for our future analyses.
# Specify your dataset and then use mutate() function
# Using if_else specify the new column name and conditions
# na.rm omits missing value while calculating mean
<- raw_energy_data %>%
raw_energy_data ::mutate(income =
dplyr::if_else(condition = is.na(income),
dplyrtrue = mean(income, na.rm = TRUE),
false = income))
# check if there are any NAs in income column after mean imputation
anyNA(x=raw_energy_data$income)
As you can see we do not have any missing values since we have imputed missing values with the mean of income. However, it is not the best practice since mean imputation can reduce the variance of the variable. Please refer to the GSS Editing and Imputation Course for details on how to best impute your missing data.
3.2.1 Quiz
In the above code block for imputing missing values, what does the second argument in the mean()
function do?
- It turns all the observations to NA
- It removes the observations which are NA when calculating the mean
- It returns that mean
- Nothing
It removes the observations which are NA when calculating the mean. The argument na.rm = TRUE means that NA observations aren’t included in the calculation, which means that a non-NA value can be returned.
4 Univariate exploratory data analysis
Univariate exploratory data analysis is the simplest form of exploratory data analysis where the data being analysed contains just one variable. Since it’s a single variable we don’t deal with causes or relationships between variables. The main purpose of the univariate analysis is to describe the data and find patterns that exist within it. First, we will look at continuous data and then categorical data.
4.1 Continuous variables
Continuous variables are numeric variables that have an infinite number of values between two values. An example of a continuous variable could be a weight variable which contains the weight of people as observations. In raw_energy_data
there are two numeric variables, the first one is the response variable consumption
and the other one is the explanatory variable income
. A response variable (or dependent variable) is that variable whose variation depends on other variables. The response variable is often related to the independent variable, sometimes denoted as the explanatory variable. In short, the response variable is the subject of change within an experiment, often as a result of differences in the explanatory variables.
First, we will look at the descriptive statistics of these two numeric variables and then plot them. When summarizing continuous variables, we are typically interested in questions like:
- What is the “centre” of the data? (Mean, median)
- How spread out is the data? (Standard deviation, variance)
- What are the extremes of the data? (Minimum, maximum and outliers)
- What is the “shape” of the distribution? Is it symmetric or asymmetric? Are the values mostly clustered about the mean, or are there any values in the “tails” of the distribution? (Skewness)
We can use the following function to examine the summary statistics of consumption
variable:
# Get the summary statistics of consumption variable
summary(raw_energy_data$consumption)
This minimum value of consumption
variable looks strange since gas consumption can only be positive. Since, the minimum value seems to be a potential outlier, removing or replacing the minimum value can affect the mean or median. We can also see that the mean gas consumption of households in England and Wales is 14505 kWh, while the median gas consumption is slightly greater at 14578 kWh, suggesting the data is slightly negatively skewed (the “tail” of the distribution points to the left). We can check the skewness by using the skewness()
function in moments
package.The mathematical equation of calculating skewness is shown below:
\[Skew= \frac{ \sum_{i=1}^{n}(x_{i}-\bar{x})^3}{\sum_{i=1}^{n}(x_{i}-\frac{\bar{x}^2}{n})^\frac{3}{2}}\]
If skewness is less than −1 or greater than +1, the distribution is highly skewed.
If skewness is between −1 and -0.5 or between +0.5 and +1, the distribution is moderately skewed.
If skewness is between −0.5 and +0.5, the distribution is approximately symmetric.
# Calculating skewness using moments package
::skewness(x=raw_energy_data$consumption, na.rm = TRUE) moments
The value is between -0.5 and 0.5, suggesting the data are fairly symmetrical.
To find the summary statistics of income
variable we will use the same function.
# Get the summary statistics of income variable
summary(raw_energy_data$income)
This maximum value of the income
variable looks very strange since earning exactly £1,000,000 per year seems odd especially considering the interquartile range (The interquartile range, denoted as IQR is the distance between the first quartile Q1 and the third quartile Q3. So 50% of the data are within this range, in our case IQR = Q1-Q2 = 13009). The minimum and maximum values are (Q1 - 1.5 x IQ) and (Q3 + 1.5 x IQ) so it seems like the maximum value is a potential outlier. Also, the mean income of households in England and Wales is £34272 per year. The mean income per year is greater than the median income per year suggesting a positively skewed distribution of the income
variable (the “tail” of the distribution points to the right).
# Skewness using moments package
::skewness(x=raw_energy_data$income, na.rm = TRUE) moments
The skewness value supports the suggestion that the income
is positively skewed.
Generally, the summary statistics can indicate if there are strange or unexpected values in our data other than NA. For example, the strange values we just saw in for both consumption
and income
. We’ll look at these in detail in the next section. An important step in understanding our data and the subsequent statistics properly is actually looking at all of the data. We can do this with functions in base R or ggplot2
packages in R. Base R plots are useful to get an immediate sense of the data and identify potential distributions and outliers. However, ggplot graphs are more customisable than base R plots and for that reason, they are probably better to use as ‘final outputs’ for any report.
4.1.1 Plotting continuous variables
Let’s examine our variable income
in raw_energy_data
. What does this variable look like when visualised? First, we’ll use a histogram, and then a density plot to visualise the data. The most basic graph is the histogram, where each bar represents the frequency (count) or proportion of cases for a range of values. The range of the data for each bar is called a bin.
# Select thedata
# Create a histogram
# Specify the variable to display in aes(), aesthetic attributes are mapped to variables
# Specify the labels
# Specify the title
# Centre the title
::ggplot(raw_energy_data) +
ggplot2aes(x=income) +
geom_histogram(bins=15) +
labs(x="Income (£)", y ="Count") +
ggtitle("Histogram of Income") +
theme(plot.title = element_text(hjust = 0.5))
A density plot visualises the distribution of data over a continuous interval. The peaks of a density plot help display where values are concentrated over the interval. An advantage density plots have over Histograms is that they’re better at determining the distribution shape because they’re not affected by the number of bins used. The density plot of income
variable is shown below:
# Create a density plot - we just use geom_density instead of geom_histogram
::ggplot(raw_energy_data) +
ggplot2aes(x = income) +
geom_density() +
labs(x="Income (£)", y ="Density") +
ggtitle("Density Plot of Income") +
theme(plot.title = element_text(hjust = 0.5))
We can see that some very strange plots are being produced. This is yet another way to see if there’s something strange going on in our data, and from our summary statistics and the plots, we can confirm that there is at least one value in the income
variable which is abnormally high (the maximum income £1,000,000 per year). Hence, we’ll replace the maximum value of income with the mean income of a household.
# Specify your dataset and then use mutate function
# Using if_else specify the new column name and conditions
# Replace the maximum value of income with mean of income
<- raw_energy_data %>%
raw_energy_data ::mutate(income =
dplyr::if_else(condition = income==max(income),
dplyrtrue = mean(income),
false = income))
Now we will again look at the summary statistics and the histogram to see if the data looks like what we were expecting it to be.
# Find the summary statistics of the income variable
summary(raw_energy_data$income)
# Plot the histogram again
<- ggplot2::ggplot(raw_energy_data) +
income_hist aes(x=income) +
geom_histogram(color="black", fill="white",bins=25) +
ggtitle("Histogram of Income") +
labs(x="Income", y ="Count") +
theme(plot.title = element_text(hjust = 0.5))
# Add a mean line in the histogram
+ geom_vline(aes(xintercept=mean(income)),
income_hist color="blue", linetype="dashed", size=1)
As we can see, the maximum value is now £122,394 which is much lower than the previous maximum value (£1,000,000) suggesting the previous maximum value is indeed an outlier. Also, the mean is now less than the median suggesting the income
is slightly negatively skewed. Lastly, the histogram looks much better, although the new maximum value (£122,394) still seems to be a potential outlier.
4.1.1.1 Exercise
Find the strange value in the
consumption
variable and impute it appropriately?Check the summary statistics of the
consumption
variable after imputing the data?Create a histogram of the
consumption
variable after imputation
- Find the strange value in the
consumption
variable and impute it appropriately?
The summary statistics show the minimum value (-434 kWh) is negative which looks strange since gas consumption in kWh should only be positive.
# The summary statistics of consumption variable
summary(raw_energy_data$consumption)
We will impute the minimum value with the mean of consumption as shown below:
# Specify your dataset and then use the mutate function
# Using if_else specify the new column name and conditions
# Replace the minimum value of consumption with mean of consumption
<- raw_energy_data %>%
raw_energy_data ::mutate(consumption=
dplyr::if_else(condition = consumption==min(consumption),
dplyrtrue = mean(consumption),
false = consumption))
- Check the summary statistics of the
consumption
variable after imputing the data?
# The minimum value is now positive
summary(raw_energy_data$consumption)
The minimum value is now 41.51 which does not look too strange.
- Create a histogram of the
consumption
variable after imputation
# Create a histogram
::ggplot(raw_energy_data) +
ggplot2aes(x=consumption) +
geom_histogram() +
labs(x="Consumption", y ="Count") +
ggtitle("Histogram of Consumption") +
theme(plot.title = element_text(hjust = 0.5))
The histogram looks fairly symmetric.
4.1.2 Scaling continuous variables
We can also look at the spread of our data by using boxplots. Boxplots are a very useful univariate graphical technique. They are very good at presenting information about the central tendency, symmetry and skew, as well as outliers. First, we will create a boxplot of income
variable in the raw_energy_data
dataset.
# Create a boxplot
# Add whiskers(whiskers are the two lines outside the box
# that extend to the highest and lowest observations)
::ggplot(raw_energy_data) +
ggplot2aes(x = income) +
geom_boxplot() +
labs(x="Income") +
ggtitle("Boxplot of Income") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_boxplot(geom = 'errorbar',width = 0.2)
# For older version R you will get error
# so use cord_flip() to flip your coordinates
<-ggplot2::ggplot(raw_energy_data) +
alternative aes(y=income) +
coord_flip() +
geom_boxplot() +
labs(x="Income") +
ggtitle("Boxplot of Income") +
theme(plot.title = element_text(hjust =0.5)) +
stat_boxplot(geom = 'errorbar',width = 0.2)
Our box plotting for income
shows there are 13 observation beyond the spread of the rest of the data (whiskers). This can be potential outliers. Outliers are unusual values in a dataset which can increase the variability.
Another way to check for outliers is by standardising the variables of interest. Standardising in the process of transforming a variable to have a mean of 0 and a standard deviation of 1. It is done by standardising each value in the variable as shown below:
\[z=\frac{\bar{X}-\mu}{\sigma}\] Where:
- \(z\) is the transformed observation
- \(\bar{X}\) is the observation (observation in the income variable)
- \(\mu\) is the mean of the variable (income mean)
- \(\sigma\) is the standard deviation (standard deviation of income variable)
We standardise the variable and then consider any standard value (z-score) above 3 or below -3 to be an outlier. This can be achieved with the scale()
function in R. Use the code below to check for outliers using this approach.
# Scale the value variable to give standard value
<- scale(raw_energy_data$income)
standard_value
# Check for values greater than 3
> 3]
standard_value[standard_value
# check for scores smaller than -3
< -3] standard_value[standard_value
Finally, given that we have used multiple approaches to visualise and assess our data, we can choose to remove outliers by using the dplyr::filter()
function as shown below:
# Create a new column using mutate()
# standardised income variable as shown above
# Keep all observation with a standard value less than 3 and greater than -3 using filter()
<- raw_energy_data %>%
energy_data mutate(scaled_income = scale(income)) %>%
filter(scaled_income < 3) %>%
filter(scaled_income > -3)
Our dataset is reduced to 1126 observation. You can check the difference between scaled and unscaled income
variable in energy_data
below:
# Our dataset is reduced to 1126 observations
nrow(energy_data)
# See the difference between scaled and unscaled data
# Using sample() function
# Sample of 10 observation of unscaled income
sample(energy_data$income, 10)
# Sample of 10 observation of scaled income
sample(energy_data$scaled_income, 10)
4.1.2.1 Exercise
Create a boxplot of the
consumption
variable inenergy_data
and check if there any potential outliers?Standardise the
consumption
variable inenergy_data
and find the outliers (Hint: Standardised values greater than 3 and less than -3 are considered outliers)
- Create a boxplot of the
consumption
variable inenergy_data
and check if there any potential outliers?
# Create a boxplot
# Add whiskers
::ggplot(energy_data) +
ggplot2aes(x = consumption) +
geom_boxplot() +
labs(x="Consumption") +
ggtitle("Boxplot of Consumption") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_boxplot(geom = 'errorbar',width = 0.2)
# For older version R you will get error
# so use cord_flip() to flip your coordinates
<- ggplot2::ggplot(raw_energy_data) +
alternative aes(y=consumption) +
coord_flip() +
geom_boxplot() +
labs(x="Consumption") +
ggtitle("Boxplot of Consumption") +
theme(plot.title = element_text(hjust =0.5)) +
stat_boxplot(geom = 'errorbar',width = 0.2)
As you can see there are few potential outliers, we will standardise the consumption variable and confirm it.
- Standardise the
consumption
variable inenergy_data
and find the outliers (Hint: Standardised values greater than 3 and less than -3 are considered outliers)
# Scale the value variable to give standard value
<- scale(raw_energy_data$consumption)
standard_value
# Check for values greater than 3
> 3]
standard_value[standard_value
# check for scores smaller than -3
< -3] standard_value[standard_value
We will not filter our data because it is not considered as good practice to remove values from your response (dependent) variable in exploratory data analysis. This concludes the cleaning that we need to do with our data. From now on, we can refer to the cleaned data set simply as energy_data
.
4.2 Categorical variables
Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order for example gender and material type. In energy_data
there are several categorical variables such as property_type
, tenure
and country
. The property_type
variable contains the type of property such as semi-detached, mid-terraced, etc. as observations. We can explore the count of the unique number of levels of this variable by running the code shown below:
# Counting six values with dplyr
# This operation gets us the count for each level in descending order
%>% count(property_type) %>% arrange(desc(n)) energy_data
4.2.1 Exercise
- Create a frequency table of the
tenure
variable in the cleanedenergy_data
.
- Create a frequency table of the
tenure
variable in the cleanedenergy_data
.
# Frequency of property type by
# Aggregating the data using group_by()
%>% group_by(tenure) %>% count() energy_data
4.2.2 Plotting categorical variables
We can also plot variables that have discrete values and different levels. Let’s examine the variable property_type
from energy_data
using a bar chart as shown below:
# Creating a bar chart
# With angled axis labels
::ggplot(data = energy_data) +
ggplot2aes(x = property_type) +
geom_bar() +
labs(x="Type of Property", y ="Count") +
ggtitle("Frequency of Different Property Type") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
This bar chart shows that the most frequent property type is the end terrace house, and that semi-detached houses are the least common. There are 50 NA which means that there are 50 unknown property types in the energy dataset (we converted unknowns to NA in Finding missing values
section), it could be because either their type of property didn’t fit into any of the categories or they might be an actual missing value.
4.2.3 Exercise
- Create a bar chart of the
property_age
variable inenergy_data
.
- Create a bar chart of the
property_age
variable inenergy_data
.
# Creating a bar chart
# With angled axis labels
::ggplot(data = energy_data) +
ggplot2aes(x = property_age) +
geom_bar() +
labs(x="Age of Property", y ="Count") +
ggtitle("Frequency of Age of Property") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
5 Multivariate exploratory data analysis
Another essential lens to examine data through is what patterns emerge between variables. To find the relationship between variables, we will do some multivariate exploratory data analysis. First, we will look at the continuous data and then the categorical data. However, since we have only two continuous variables in energy_data
we will be using airquality_data
to see relationships between different continuous variables, while we will be using energy_data
for exploratory analysis of categorical data.
5.1 Continuous variables
Since we are using airquality_data
data in this section, we can explore the structure of the data again.
# Clean column names
<- janitor::clean_names(airquality)
airquality_data
# Structure of airquality_data
::glimpse(airquality_data) dplyr
We will assume ozone
to be response variable and everything else to be the explanatory variables. month
is an explanatory variable which has unique values. First, we will look at the unique values in the month
variable by running the code below:
# Unique values in the variable month in airquality_data
unique(airquality_data$month)
The variable month
is a categorical variable since it has categories. In R factors are used to represent categorical data. The set of values that the elements of a factor can take are called its levels (so in the month
variable the levels are 5,6,7,8,9). Factors can be ordered or unordered and are an important class for statistical analysis and for plotting; whenever you use the variable month
you should make it a factor.
# Using the categorical data as factor using factor() when plotting graphs
$month <- factor(x=airquality_data$month) airquality_data
We will also convert the variables solar_r
, ozone
and temp
from integers to numeric because it is easier to create a scatterplot matrix (in section Scatterplot Matrix
)
# Convert all variables from integers to numeric
<- airquality_data %>%
airquality_data mutate(solar_r=as.numeric(solar_r)) %>%
mutate(ozone=as.numeric(ozone)) %>%
mutate(temp=as.numeric(temp))
Now, we will look at the structure of the data again and then look at the summary statistics of air quality data by running the following code below:
# The variables ozone, temp and solar_r are numeric now
::glimpse(airquality_data)
dplyr
# Find the summary statistics of airquality_data
summary(airquality_data)
As you can see the mean is slightly greater than the median in variables wind
and ozone
suggesting a slightly positively skewed data. We can check the skewness by using the code below:
# Finding skewness of ozone using moments package
::skewness(x=airquality_data$ozone, na.rm = TRUE) moments
The data is skewed heavily towards the right. We can apply transformations to make it symmetric for example we can take the square root of the ozone
variable. We will look at transformations in Chapter 5.
We can also see in summary statistics that there are 37 missing values in the ozone
variable and 7 missing values in solar_r
variable. We can visualise the missing values in variables ozone
and solar_r
by each month from May till September.
# Creating scatterplot
# Specify x and y axes
# geom_miss_point() has shifted the missing values to now be 10% below the minimum value
# facet_wrap() is used when your variable has several levels
# legend.position specifies the position of the legend
::ggplot(data = airquality_data) +
ggplot2aes(x = ozone, y = solar_r) +
geom_miss_point() +
facet_wrap(~month, ncol = 2) +
theme(legend.position = "bottom")
We can see how missing values are affecting the spread of the data, you can impute it using mean of ozone
and solar_r
.
# Impute missing values in solar_r using mean imputation
<- airquality_data %>%
airquality_data ::mutate(solar_r=
dplyr::if_else(condition=is.na(solar_r),
dplyrtrue = mean(solar_r, na.rm = TRUE),
false=solar_r))
# Impute missing values in ozone using mean imputation
<- airquality_data %>%
airquality_data ::mutate(ozone=
dplyr::if_else(condition=is.na(ozone),
dplyrtrue = mean(ozone, na.rm = TRUE),
false=ozone))
However mean imputation can cause variability in the data and it is not the best practice. Refer to Editing and Imputation to gain a better insight of possible imputations.
5.1.1 Plotting two continuous variables
As part of our analyses we probably also want to understand the relationship between possibly related variables. We can do this by plotting a graph that has one variable on the x-axis and the other on the y-axis. A typical graph to choose is a scatter plot for two continuous variables. First, we will examine the relationship between two explanatory variables solar_r
and wind
in the airquality_data
by creating a scatterplot as shown below:
# Create a scatter plot
# Specify the x and y axes
# Make the colour conditional on a variable month
# Remember to make month variable as a factor variable
# Specify the default geom_point()
# Specify the titles, x and y labels
::ggplot(data = airquality_data) +
ggplot2aes(x =solar_r, y = wind, colour = month) +
geom_point() +
labs(x="Solar Radiations (Langleys)",
y ="Wind (miles per hours)" ) +
ggtitle("Wind Speed and Solar Radiations from May till September") +
theme(plot.title = element_text(hjust = 0.5))
As you can see the data points are randomly scattered and do not form a pattern. This suggests that the solar radiation and the wind speed don’t have a linear relationship. Hence, there is a very weak correlation between the variables wind
and solar_r
.
5.1.1.1 Exercise
Create a scatterplot of the variables
wind
andtemp
.Comment on the scatterplot.
- Create a scatterplot of variables
wind
andtemp
.
# Create a scatter plot
# Specify the x and y axes
# Make the colour conditional on a variable month
# Specify the default geom_point()
# Specify the titles, x and y labels
::ggplot(data = airquality_data) +
ggplot2aes(x =temp, y = wind, colour = month) +
geom_point() +
labs(x="Temperature (°F)",
y ="Wind (miles per hours)" ) +
ggtitle("Wind Speed and Temperature from May till September") +
theme(plot.title = element_text(hjust = 0.5))
- Comment on the scatterplot.
The data points are not scattered randomly, it is showing a slight negative linear relationship between the speed of wind (in miles per hours) and temperature (in degrees Fahrenheit). This suggests that there is a moderate correlation between the variables temp
and wind
. Also, the data is clustered by months as well e.g. we can see red points (representing temperature and wind speed in May) is clustered around the range from 55 degrees Fahrenheit to 70 Fahrenheit which makes sense as well since it is less windy in summers and temperature is usually higher in summers than winters.
5.1.2 Covariance and Correlation
We can quantify and explore such a relationship further by retrieving the statistical values for the covariance and correlation between the two variables.
The covariance of two variables X and Y in a data set (denoted by \(Cov(X,Y)\)) measures how the two are linearly related. A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite - as one variable increases, the other decreases.
The correlation coefficient of two variables X and Y (denoted as \(Corr(X,Y)\) or \(r\)) in a data set equals to their covariance (\(Cov(X,Y)\)) divided by the product of their standard deviations (\(\sigma_x\)\(\sigma_y\)).
\[ r=Corr(X,Y)=\frac{Cov(X,Y)}{\sigma_x\sigma_y}\]
The values for correlation range between -1 and 1. Here is the rule of thumbs for correlation coefficient interpretation:
In R we can calculate the correlation coefficient by using the functions shown below.
# Correlation coefficient of variables temp and wind
# Default method is pearson
cor(x=airquality_data$temp,y=airquality_data$wind)
5.1.2.1 Quiz
What does the correlation coefficient of temp
and wind
shows us:
- There is no correlation between the two variables
- There is a very strong negative correlation between the two variables
- There is a moderate correlation between the two variables
There is a moderate correlation between the two variables
5.1.3 Scatterplot Matrix
We can also create a scatterplot matrix to visualise our data. Scatterplot matrices are tables containing multiple scatterplots, showing all pairwise relationships between variables. This is particularly useful for multiple regression where one of the assumptions is that each response variable shows a linear relationship with the explanatory variable. It is also important that pairs of explanatory variables are not strongly related or else it causes multicollinearity.
Collinearity occurs when explanatory variables are correlated. This correlation is a problem because explanatory variables should not be related to each other. If the degree of correlation between variables is high enough, it can cause problems when you fit the regression model and interpret the results. It is very important to check the collinearity in exploratory data analysis.
The simplest way to plot scatterplot matrices in Base R is to use pairs()
, but we will be using ggpairs()
from the package GGally because as well as showing the scatterplot of all variable combinations, but also the shape of the distribution and correlation coefficients as well.
# Creating a scatterplot matrix
<- GGally::ggpairs(data=airquality_data)
scatterplot_matrix
# Displaying the results
scatterplot_matrix
The graph shows the following things:
Very weak correlation between solar radiation and wind speed, since the data points are scattered randomly. The correlation coefficient between
solar_r
andwind
is -0.055 which is very small confirming a very weak correlation between both variables.Ozone concentration is weakly correlated with solar radiation since the correlation coefficient is 0.303. Also, the scatter plot suggests a positive linear relationship between
ozone
andsolar_r
.The variables
ozone
andwind
seem moderately correlated since the correlation coefficient is -0.531. Also, the data points are not randomly scattered and suggest a fairly strong linear relationship betweenozone
andwind
.Ozone
andtemperature
are strongly correlated as well since the correlation coefficient is 0.609. As the temperature increases, the mean ozone in parts per billion increases.Solar radiation and temperature are also weakly correlated since the correlation coefficient is 0.263 and the data points appear randomly scattered.
Wind and temperature are moderately correlated since the correlation coefficient is -0.458 and the data points do not have a linear relationship.
Since the variable month is a factor, the scatterplot is not very useful in interpreting the results, that is why we have boxplots and histogram which are easier to interpret. Each boxplot is representing the spread for each month e.g. the spread of ozone concentration in May is represented by the first boxplot in the first row and fifth column.
The median of solar radiation in May, June, August and September are similar, but the distributions are different since July, August and September are negatively skewed (left-skewed) because the medians are closer to the upper quartile and the whiskers are shorter on the upper end of the box. However, the distribution of solar radiation is positively skewed (right-skewed) in June because the median is closer to lower quartile.
There is a potential outlier for July. The distribution of solar radiation is May is fairly symmetric and interquartile range (that is, the box lengths), shows that the data in May is more dispersed compared to other months because the box length is longer than others. Also, the range (as shown by the distances between the ends of the two whiskers) shows that solar radiation in May is the largest suggesting wider distribution of solar radiation in May, that is, more scattered data. So it looks as if we can safely say that solar radiation is related to the variable
month
.The distribution of wind speed in May, June and September is fairly symmetrical, however, it is positively skewed in July and August as the median is closer to lower quartile and the whisker is shorter on the lower end of the box. The interquartile ranges and overall range are reasonably similar (as shown by the lengths of the boxes and whiskers respectively). So, wind speed and months may be weakly correlated.
The distribution of temperature in July and September looks fairly symmetrical, but the distribution of temperature in June and August is positively skewed. You can see immediately that the median temperature is greater in July than the lower quartile of the other month (that is, over three-quarters of the temperature is higher in July than the median temperature in other months). So it looks as if temperature and months are related (which is expected).
# We will look at the relationship between month and ozone
# by taking out the specific plot we are interested in
# To look at a specific plot use getPlot function in GGally package
::getPlot(pm= scatterplot_matrix, i=1, j=5) GGally
- The distribution of ozone concentration in parts per billion is positively skewed for all months except June. There are some potential outliers. The interquartile range and overall range for ozone concentration in June are very small suggesting the ozone concentration are less dispersed and less scattered in June compared to other months. This suggests that ozone concentration and months are related.
5.1.3.1 Exercise
Create the scatterplot matrix of
airquality_data
using base Rpairs()
function.Comment on the difference between this scatterplot matrix and the one created in the example using
ggpairs()
function.
- Create the scatterplot matrix of
airquality_data
using base Rpairs()
function.
pairs(airquality_data)
- Comment on the difference between this scatterplot matrix and the one created in the example using
ggpairs()
function.
It is much easier to see the relationship between all variables using the ggpairs()
function. The interpretation is still the same, but ggpairs()
shows the shape of distribution, correlation coefficients and boxplots as well.
5.2 Categorical variables
In this section, we will be using energy_data
because airquality_data
doesn’t have many categorical variables. However, in energy_data
we have 8 categorical variables: floor_area
, property_type
,tenure
,property_age
,country
,bedrooms
,adults
and region
. So, far we have covered covariation and correlation of continuous data, now we will plot categorical variables together to indicate the covariance between them.
5.2.1 Plotting categorical variables
We can then visualize this covariation by using a tile plot, which will introduce a colour fill that can show us the mutual count between the two variables visually - the stronger the colour between two, the higher the count.
# Order your floor_area variable and turn it into a factor
# levels specifies the order
$floor_area <- factor(energy_data$floor_area,
energy_datalevels = c("50 or less", "51 to 100",
"101 to 150", "151 to 200",
"201 to 250","Over 250"))
# Create a tile plot
%>%
energy_data count(property_type, floor_area) %>%
::ggplot() +
ggplot2aes(x = property_type,
y = floor_area, fill = n) +
geom_tile() +
labs(x="Type of Property",
y = expression("Floor Area (m"^2*")")) +
scale_fill_gradient(low="white", high="blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Finally, ggplot2
can also be used to create distributions of continuous variables which are split by the levels of categorical factors. First, we will create the density plot to see the gas consumption by England and Wales.
# Making a grouped distribution visualisation
# Use filter to filter out only observations for England and Wales
# Then use ggplot to create a density plot
# Of gas consumption for each country
::filter(.data=energy_data,
dplyr== "England"|country == "Wales") %>%
country ::ggplot() +
ggplot2aes(x=consumption, fill=country) +
geom_density(alpha=0.4)
The distribution of gas consumption (kWh) is unimodal and relatively symmetric for both England and Wales. The mean gas consumption in England is higher than the mean gas consumption in Wales; this might be due to the difference in population within each country. The distribution of gas consumption in Wales has slightly more variation than the distribution of gas consumption in England.
5.2.2 Exercise
Remove any missing values from the
property_type
variable inenergy_data
and assign it to a new variable.Use the function
geom_boxplot
to see the spread of data of gas consumption by different type of properties.
- Remove any missing values from the
property_type
variable inenergy_data
and assign it to a new variable.
# Take energy_data
# use filter() to filter out missing values
# represented as NA
# select the variables you want to create
# boxplot of
<- energy_data %>%
filtered_property filter(property_type != "NA") %>%
select(property_type, consumption)
- Use the function
geom_boxplot
to see the spread of data of gas consumption by different type of properties.
# Making a grouped boxplot
::ggplot(filtered_property) +
ggplot2aes(x = property_type, y = consumption) +
geom_boxplot() +
labs(x="Type of Property", y = "Gas Consumption (kWh)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
There are potential outliers in detached
, end-terrace
, mid-terraced
and unknown
property types. The median gas consumption of the different property is similar, but the gas consumption of some property types are more variable than others. Converted flat appears to have larger variability than other property types. The majority of the property types are fairly symmetric, however, detached
and semi-detached
are slightly skewed to the right (positively skewed).
Summary
By now you should be able to:
Inspect datasets
Find missing values and impute them with mean
Find descriptive statistics of one variable and plot them to visualise data
Standardise variables and filter out data
Find descriptive statistics of multiple variables
Plot two continuous variables and finding the relationship between them
Find the correlation between two variables
Plot and interpreting scatterplot matrices
Plot categorical variables to explore data
Next Chapter
In the next chapter we will look at statistical tests where we will be using the cleaned energy_data and airquality_data which are saved in the “Data” folder