Chapter 5 - Model Adequacy and Model Selection

Author

Government Analysis Function and ONS Data Science Campus

1 Learning Objectives

The goal of this session is to learn about:

Fitted values and Residuals
Regression Assumptions
Residual Diagnostics Plots
Model Transformation
Model comparisons
Multicollinearity
Model Selection

Load in the following packages and datasets:

# Libraries needed

library(tidyverse) # For data manipulation 
library(car)       # For finding multicollinearity in a regression model 
library(leaps)     # For model selection 

# Importing datasets 

energy_data <- readr::read_csv("../Data/energy_data.csv")

airquality_data <- readr::read_csv("../Data/airquality_data.csv")

2 Fitted values and Residuals

Before describing regression assumptions and regression diagnostics, two key concepts in regression analysis need to be explained:

Fitted values
Residuals

The fitted (or predicted) values are the y-values that you would expect for the given x-values according to the built regression model (or visually, the best-fitting straight regression line). The fitted value is denoted as \(\hat{y}\). In our simple linear regression example, the fitted (predicted) ozone concentration is \[\widehat{ozone_{i}}= 85.2-4.32 \times wind_{i}\]

The estimated random error is called the residuals and it is the difference of the observed value \(y_{i}\) and the fitted (predicted) value \(\hat{y_{i}}\). So, the residual is denoted as:

\[\hat{e_{i}}=y_{i}-\hat{y_{i}}\]

This is a visual illustration of the relationship between fitted values of ozone concentration, and the residual value:

The red colour line represents the residuals between observed ozone value and the corresponding fitted value. Whenever we fit a regression model we always need to check if the regression assumptions are fulfilled by examining the distribution of residuals.

3 Regression Assumptions

There are four assumptions associated with a linear regression model:

Linearity - The relationship between the explanatory variable (\(X\)) and the response variable (\(Y\)) must be linear.
Normality- The residual must be approximately normally distributed.
Homoscedasticity- The residuals should have a constant variance.
Independence- There should be no relationship between the residuals and the response variable (\(Y\)).

You should check whether or not these assumptions hold true. However, there are potential problems that may arise such as the presence of influential values in the data that can be:

Outliers: extreme values in the outcome variable response variable (\(Y\)).
High-leverage points: extreme values in the explanatory variable (\(X\)).

All these assumptions and potential problems can be checked by producing some residual diagnostic plots (visualising the residuals).

3.1 Residual Diagnostics Plots

In R, regression diagnostics plots (residual diagnotics plots) can be created using the base R function plot() or the autoplot() function in the ggfortify package, which creates graphics based on ggplot. We will be using the base R plot() function.

# Change the panel layout to 2 x 2
# Look at all 4 plots at once 

par(mfrow = c(2, 2))

# Use plot() function to create diagnostic plots

simple_model <- lm(formula=ozone ~ wind, data=airquality_data)

plot(simple_model)

The diagnostic plots show residuals in four different ways:

Residuals vs Fitted: is used to check the assumptions of linearity. If the residuals are spread equally around a horizontal line without distinct patterns (red line is approximately horizontal at zero), that is a good indication of having a linear relationship.
Normal Q-Q: is used to check the normality of residuals assumption. If the majority of the residuals follow the straight dashed line, then the assumption is fulfilled.
Scale-Location: is used to check the homoscedasticity of residuals (equal variance of residuals). If the residuals are spread randomly and the see a horizontal line with equally (randomly) spread points, then the assumption is fulfilled.
Residuals vs Leverage: is used to identify any influential value in our dataset. Influential values are extreme values that might influence the regression results when included or excluded from the analysis. Look for cases outside of a dashed line.

What does the homoscedasticity of residuals stand for?

Different variance of residuals
Same variance of residuals
Normal distribution of residuals

What is the Q-Q plot used to check?

Normality of residuals
Constant variance of residuals
Linearity of residuals

What does the homoscedasticity of residuals stand for?

Same variance of residuals

What is the Q-Q plot used to check?

Normality of residuals

3.1.2 Linearity of the data

We check the linearity assumption by looking at the “Residuals vs Fitted” plot:

# Create the first diagnostic plot 
plot(simple_model,1)

Ideally, the residual vs fitted plot should not show any patterns since the residuals should be scattered randomly and the red line should be approximately horizontal at zero. However, our plot shows that the residuals are not randomly scattered and forms a parabola. Suggesting a non-linear relationship between the explanatory variable (wind) and the response variable (ozone). The simple approach is to use non-linear transformations of the explanatory variable, such as \(log(X)\), \(sqrt(X)\) and \(X^2\), in the regression model.

3.1.3 Normality of residuals

The “Normal Q-Q” plot is used to check the normality assumption.

# Create the second diagnostic plot 
plot(simple_model,2)

The majority of the points fall approximately along the reference line, so we can assume normality. The endpoints are deviated from the straight line, suggesting a heavy-tailed distribution ( Distribution is longer and tails are fatter, so there might be outliers).

3.1.4 Homoscedasticity

The “Scale-Location” plot can be used to check the homoscedasticity of residuals (equal variance of residuals).

# Create the third diagnostic plot 
plot(simple_model,3)

The red line is not approximately horizontal and the residuals form a funnel shape with the larger end toward larger fitted values. This suggests heteroscedasticity (residuals have non-constant variance). We can use logarithmic or square root transformation on the response variable to reduce heteroscedasticity.

3.1.5 Influencial values

The “Residuals vs Leverage” plot is used to identify three things:

Outliers
Leverage points
Influential points

Outliers: Observations whose standardised residuals are greater than 3 in absolute value are considered as potential outliers (as covered in Chapter 2). Not all outliers are influential in linear regression analysis.

Leverage points: A measure of how extreme an observation is on the explanatory variable. As a rule of thumb, when leverage is greater than \(\frac{2(p+1)}{n}\) , we say the observation has high leverage; where \(p\) is the number of explanatory variables and \(n\) is the total number of observations. A high leverage observation may or may not actually be influential.

Influential point: A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses or the estimated slope coefficients. Outliers and high leverage data points have the potential to be influential, but we generally have to investigate further to determine whether or not they are actually influential. Cook’s distance is used to measure the influence of a data point. The dashed red line represents the Cook’s distance; when cases are outside the Cook’s distance, the cases are influential to the regression results and excluding those cases can alter our regression results.

# Create the 5th diagnostic plot 
plot(simple_model,5)

The plot above shows that there are is one extreme point (117), with a standardized residual above 3. This suggests 117 is a potential outlier.
There are few high leverage points since some of points are higher than \(\frac{2(p+1)}{n}=\frac{4}{153}=0.026\).
There is no observation outside the red dashed line suggesting there is no influential point

# Create the 4th diagostic plot
# Which gives cook's distance

plot(simple_model,4)

By default, the top 3 most extreme values are labelled on the Cook’s distance plot.

3.1.6 Exercise

Exercise
Show Answer

Fit a multiple linear regression on airquality_data where the response variable is ozone and all explanatory variables are added.
Check the regression assumptions of the model created above?

Fit a multiple linear regression on airquality_data where the response variable is ozone and all explanatory variables are added?

# Fit a multiple linear regression
# the response variable is ozone 
# all explanatory variable 

multiple_exercise <- lm(formula=ozone ~ ., data=airquality_data)

Check the regression assumptions of the model created above?

# Change the panel layout to 2 x 2
# Look at all 4 plots at once 
par(mfrow = c(2, 2))

# Use plot() function to create diagnostic plots
plot(multiple_exercise)

The residual vs fitted plot shows non-linearity since the residuals are not randomly scattered around the zero line. The Q-Q plot shows that the residuals are roughly normal since the majority of the points follow the straight line. The scale-location plot shows that the residuals do not have a constant variance. Lastly, the residuals vs leverage plot show that observation 117 is a potential outlier, but there is no influential value since there is no observation outside the red dashed line.

4 Model Transformation

In chapter 4 we have seen that the predict() function gives us negative values for ozone levels and we have seen above that our regression assumptions are not met. In this case, we can perform transformations inside the model formula where we fit the linear regression model with the response variable ozone and the explanatory variable wind. First, we will create a scatterplot of ozone and wind variables to see the relationship between them.

# Create a scatter plot
# Specify the x and y axes
# Specify the default geom_point()
# Specify the titles, x and y labels
# Add the regression line
# Centrally align the title text

ggplot2::ggplot(airquality_data) +
                aes(x=wind, y=ozone) +
                geom_point() +
                labs(x="Wind (miles per hours)", 
                     y = "Ozone concentration (per billions)") +
                ggtitle("Ozone Concentration and Wind Speed from May till September") + 
                theme(plot.title = element_text(hjust = 0.5))

As you can see the graph does not suggest a linear relationship between ozone and wind. We saw in the previous diagnostic plots that the regression assumptions were not fulfilled. We saw that the data was non-linear and residuals did not have equal variance. We can use variance stabilising transformations such as log(Y) when we see increasing variance in a fitted versus residuals plot. Hence, we will use the logarithm transformation which causes our mathematical equation of regression model to look like this:

\[log(ozone_{i})=\beta_{0}+\beta_{1}\times wind_{i}+\epsilon_{i}\]

In R, we will use logarithm transformation on the simple regression model shown below:

# Apply algorithm transformation on the response variable
# wind is the explanatory variable
log_model <- lm(formula=log(ozone) ~ wind, data=airquality_data)

We will find the summary of the transformed regression model as shown below:

# Summary of the transformed model
summary(log_model)

Again, the transformed response is a linear combination of the explanatory variables: \[log(\widehat{ozone_{i}})= 4.4888-0.0997 \times wind_{i}\]

To interpret our results, we re-scale the data from a log scale back to the original scale of the data: \[\widehat{ozone_{i}}= exp(4.4888)exp(-0.0997 \times wind_{i})\]

If the speed of wind increases by 1 mile per hour, we predict that the daily average ozone concentration increases by \(exp(-0.0997)=0.91\) parts per billion.

We can apply transformations on explanatory variables as well. In R, there are various implementations of automatic transformations that choose the optimal transformation expression for you, one of the example is Box-Cox transformation.

4.1 Exercise

Exercise
Show Answer

Apply the logarithmic transformation on the multiple linear regression containing all explanatory variables from airquality_data.

Apply the logarithmic transformation on the multiple linear regression containing all explanatory variables from airquality_data.

# Apply logarithmic transformation on multiple linear regression
# containing all explanatory variables

log_model_exercise <- lm(formula=log(ozone) ~ ., data =airquality_data)

summary(log_model_exercise)

5 Model comparisons

When we are creating models, we often need to find the most informative model with the fewest variables. This is where we compare our regression model while considering some metrics such as:

R-squared (\(R^2\)): It is the proportion of variation in the response variable explained by the explanatory variables. In general, the higher the R-squared, the better the model fits your data.
Root Mean Squared Error (RMSE): It is the square root of the mean squared error (MSE), which is the average squared difference between the observed response values and the fitted values. It indicates how close the observed data points are to the model’s predicted values. Lower values of RMSE indicate better fit.
Residual Standard Error (RSE): It is a measure used to assess how well a linear regression model fits the data. The lower the RSE, the better the model.
Mean Absolute Error (MAE): It measures the prediction error. Mathematically, it is the average absolute difference between observed and predicted outcomes. MAE is less sensitive to outliers compared to RMSE.

Although these metrics are very useful when you are comparing models, the problem is that they are sensitive to the inclusion of additional variables in the model, even if those variables don’t have significant contribution in explaining the outcome. For example, including additional variables in the model will always increase the \(R^2\) and reduce the RMSE. So, we need a more robust metric to guide the model choice such as

Adjusted R-squared: which adjusts the \(R^2\) for having too many variables in the model.
AIC: It stands for Akaike’s Information Criteria. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
BIC: It stands for Bayesian Information Criteria. It is a variant of AIC with a stronger penalty for including additional variables to the model.

Now we will compare two multiple linear regression models.

# Multiple linear regression
# the response variable is ozone 
# explanatory variables are temp and solar_r

model_one <- lm(formula=ozone ~ temp + solar_r, data=airquality_data)

# Multiple linear regression
# the response variable is ozone 
# all explanatory variable 

model_two <- lm(formula=ozone ~ ., data=airquality_data)

We will use the glance function in the broom package (broom package is in tidyverse) to compare both models.

# computes the R2, adjusted R2, sigma (RSE), AIC, BIC 
# of model one

broom::glance(model_one)

# computes the R2, adjusted R2, sigma (RSE), AIC, BIC 
# of model two 

broom::glance(model_two)

The Adjusted R-squared of model_two is higher than the Adjusted R-squared of model_one. We can also see that sigma, which is the residual standard error, is lower in model_two, suggesting model_two fits the data better than to model_one. Also, the AIC and BIC for model_two are lower than model_one. Finally, the F-statistic and p-value of model_two is lower than that of model_one, indicating that model_two is statistically more significant than model_one.

Note: When we are comparing models, we have to use multiple metrics to justify our final decision. Using F-test or AIC or any other matrices on its own will not give the appropriate model. Also, generally increasing explanatory variables decreases AIC and increases BIC so using delta AIC and delta BIC to compare model is a better approach but beyond of scope for this course.

5.1 Exercise

Exercise
Show Answer

Create two multiple linear regressions of your own choice.
Compare both models using the glance() function.

Create two multiple linear regressions of your own choice.

# Multiple linear regression
# the response variable is ozone 
# explanatory variables are month and solar_r

model_one_exercise <- lm(formula=ozone ~ solar_r + month, data=airquality_data)

# Multiple linear regression
# the response variable is ozone 
# all explanatory variable 

model_two_exercise <- lm(formula=ozone ~ ., data=airquality_data)

Compare both models using the glance() function.

# computes the R2, adjusted R2, sigma (RSE), AIC, BIC 
# of model one

broom::glance(model_one_exercise)

# computes the R2, adjusted R2, sigma (RSE), AIC, BIC 
# of model two 

broom::glance(model_two_exercise)

The Adjusted R-squared of model_two is higher than the Adjusted R-squared of model_one. We can also see that sigma, which is the residual standard error, is lower in model_two, suggesting model_two fits the data better than to model_one. Also, the AIC and BIC for model_two are lower than model_one. Finally, the F-statistic of model_two is higher than that of model_one, although the p-value of model_two is lower. These comparisons indicate that model_two is statistically more significant than model_one.

6 Multicollinearity

Collinearity happens when two or more explanatory variables are correlated with each other. However, there is an extreme situation, called multicollinearity, where collinearity exists between three or more variables even if no pair of variables has a particularly high correlation. This means that there is redundancy between explanatory variables.

Whilst collinearity can be detected with a correlation matrix, multicollinearity is not as easy to detect. The Variance Inflation Factor (VIF) can be used to find how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value is one, indicating no multicollinearity. A value which exceeds 5 or 10 indicates a problematic amount of multicollinearity in the data. In R we use the vif() function from the car package to detect multicollinearity in a multiple regression model (where the response variable is ozone and all explanatory variables are added):

# Use the variance inflation factor from the car package
# multiple_exercise is the multiple linear regression model 
# which contains all explanatory variables 
# that was created in the first section exercise
car::vif(multiple_exercise)

All the variance inflation values are fairly close to one, suggesting our model doesn’t have multicollinearity. If an explanatory variable has a high VIF, we usually remove that explanatory variable from our model. However, we need to look at how removing that variable affects the model.

7 Model Selection

It is possible to build multiple models from a given set of explanatory variables. But building a good quality model can make all the difference. Here, we explore various approaches to build and evaluate regression models. One possible strategy consists of testing all possible combination of the explanatory variables and then selecting the best model. This method is called the best subsets regression. It is computationally expensive and becomes unfeasible for a large data set with many variables. A better alternative to the best subsets regression is to use stepwise selection. It consists of adding and deleting predictors to find the best performing model with a reduced set of variables.

7.1 Stepwise Selection

The stepwise selection consists of iteratively adding and removing explanatory variables into the regression model, to find the subset of variables in the data set resulting in the best performing model; that is a model that lowers prediction error.

There are three strategies of stepwise regression:

Forward selection: It starts with no explanatory variables in the regression model, iteratively adds the most contributive explanatory variables, and stops when the improvement is no longer statistically significant.
Backward selection: It starts with all explanatory variables in the model, iteratively removes the least contributive explanatory variable, and stops when you have a model where all explanatory variables are statistically significant.
Stepwise selection: It is a combination of forward and backward selections. You start with no explanatory variables, then sequentially add the most contributive explanatory variables (like forward selection). After adding each new variable, remove any variables that no longer provide an improvement in the model fit (like backward selection).

We can run a backward selection in R as shown below:

full_model <- lm(formula=ozone~., data = airquality_data)

backward_selection <- step(full_model, direction = "backward")

The best model given is ozone ~ solar_r + wind + temp. Backward, forward, and stepwise search are all useful, but do have an obvious issue. By not checking every possible model, sometimes they will miss the best possible model. With an extremely large number of explanatory variables, sometimes this is necessary since checking every possible model would be rather time-consuming, even with current computers. However, with a reasonably sized dataset, it isn’t too difficult to check all possible models using the leaps package.

7.2 Best subset linear model using `leaps::regsubsets()`

The best subsets regression is a model selection approach that consists of testing all possible combination of the explanatory variables, and then selecting the best model according to some statistical criteria.

The argument method within the function regsubsets can have three values backward, forward and seqrep (sequential replacement, the combination of forward and backward selections). We also need to specify the tuning argument nvmax, which corresponds to the maximum number of explanatory variables to be incorporated in the model. In the full model, where we have all explanatory variables, we specify the value of the argument to be nvmax=5 since we have 5 explanatory variables. It gives the 5 best models with different sizes: the best 1-variable model, the best 2-variables model,…., the best 5-variables model.

# Use the regsubsets() function to get the best model
# Specify the method to be seqrep (sequential replacement)
# Set nvmax to be 5 since we have 5 explanatory variables

best_model <- leaps::regsubsets(ozone~., data = airquality_data, 
                         nvmax = 5, method = "seqrep")

# summary() reports the best set of variables for each model size

summarise_model <- summary(best_model)

summarise_model

In the result above, we can see that the best model with 1 explanatory variable has the variable temp, the best with 2 explanatory variables has wind and temp, and the last one has all 5 explanatory variables. We can use this summary() function determine the best model by looking at some metrics such as Adjusted \(R^2\) , RSS (residual sum of squares) and BIC. This allows us to identify the best overall model, where best is defined as the model that maximize the adjusted \(R^2\) and minimize the prediction error (RSS and BIC).

# Find which model has the maximum Adjusted R2
# and minimum residual sum of square RSS
# and minimum BIC

data.frame(Adj_R2 = which.max(summarise_model$adjr2),
           RSS = which.min(summarise_model$rss),
           BIC = which.min(summarise_model$bic))

As we can see the adjusted \(R^2\) and RSS (residual sum of squares) is showing that the best model is the one with all 5 explanatory variables, however, the BIC shows that the model with 3 explanatory variables is the best model. So, we have different “best” models depending on which metrics we consider. As the saying goes “All models are wrong, some models are useful” so there is no single correct solution to model selection, each of these criteria will lead to slightly different models. There are other techniques as well such as k-fold cross-validation technique which is covered in additional material.

7.2.1 Exercise

Exercise
Show Answer

Fit the best subset linear model using regsubsets() function and set the method to be backward and nvmax=3.
Find the best model using the metrics such as Adjusted \(R^2\) , RSS (residual sum of squares) and BIC.

Fit the best subset linear model using regsubsets() function and set the method to be backward and nvmax=3.

# Use regsubsets() function to get the best model
# Specify the method to be backward
# Set nvmax to be 3 since we want model to have 3 explanatory variables

best_model_exercise <- leaps::regsubsets(ozone~., data = airquality_data, 
                              nvmax = 3, method = "backward")

# summary() reports the best set of variables for each model size

summarise_exer <- summary(best_model_exercise)

summarise_exer

Find the best model using the metrics such as Adjusted \(R^2\) , RSS (residual sum of squares) and BIC.

# Find which model has the maximum Adjusted R2
# and minimum residual sum of square RSS
# and minimum BIC

data.frame(Adj_R2 = which.max(summarise_exer$adjr2),
           RSS = which.min(summarise_exer$rss),
           BIC = which.min(summarise_exer$bic))

As we can see the adjusted \(R^2\), RSS (residual sum of squares) and BIC is showing that the best model is the one with 3 explanatory variables which is ozone ~ solar_r + wind + temp (this is the model we got in the section stepwise selection when we applied backwards method in step function).

Summary

By now you should:

Know the difference between fitted values and residuals
Understand the regression assumptions
Be able to create and interpret residual diagnostics plots
Be able to use log transformation on response variables
Be able to compare different models using metrics such as \(R^2\), RMSE,RSE, AIC,BIC
Know how to find multicollinearity between variables using VIF
Be able to select the best model using stepwise selection
Be able to select the best subset linear model using leaps package

Next Chapter

In the next chapter we will look at generalised linear models

Reuse

Open Government Licence 3.0 (View License)