Chapter 4 - Linear Regression

Author

Government Analysis Function and ONS Data Science Campus

1 Learning Objectives

The goal of this session is to learn about:

Formula and Formulae in R
One-way ANOVA
Multiple pairwise comparisons
Simple linear regression
Multiple linear regression
Interaction Effects

For this chapter, we need to use dplyr package from tidyverse (Version 1.3.0) and we need to import the energy_data and airquality_data from data folder as shown below:

# Libraries needed

library(tidyverse) # For data manipulation

# Importing datasets

energy_data <- readr::read_csv("../Data/energy_data.csv")

airquality_data <- readr::read_csv("../Data/airquality_data.csv")

2 Formula and Formulae in R

In R, there are a number of functions which take the argument formula. These require us to formalize our model in a way that R can interpret. A formula does not just mean ‘mathematical expression’ in R; it belongs to a class called ‘formula’.

# A formula for variables x and a saved as object c
c <- x ~ a

# class of c?
class(c)

The tilde ~ generally characterises formulae (of the class formula) written in R. It means “capture the meaning of the code here without immediately evaluating it”. You’ll notice that at the moment when we run this code, R doesn’t mind not knowing what ‘x’ and ‘a’ are, it just knows it’s a formula object. The variable on the left of the tilde ~ is the response (dependent) variable, and the variable(s) on the right of the tilde ~ are the explanatory (independent) variables (which can be joined together by +). I am interested in predicting the energy consumption of a property based on information I have about the tenure, region, and the age of the property. In order to do that I can pose this question as formula where where my response (independent) variable is consumption and my explanatory (dependent) variables are tenure,region and property_age in the energy_data.

# Formula from energy_data

consumption ~ tenure + region + property_age

A formula is a specific object of the type ‘language’ in R, which include the attribute of being able to store the environment. Don’t worry if you’re unsure about this; the important thing is recognising a formula in R and what you can do with it. We’ll return to the formula in the next section on linear regression in R.

Generally, the formula structure that we’ll be working with will look something like the examples below.

For three explanatory variables in the model: y ~ x1 + x2 + x3

For including all the explanatory variables from a defined dataset in the model: y ~ .

For including all the explanatory variables from a defined dataset minus one variable: y ~ . - x2

If you want to add all the explanatory variables which formula structure is appropriate?

y ~ x1 + x2 + x3
y ~ x
y ~ .

If you want to add all the explanatory variables which formula structure is appropriate?

y ~ .

3 Analysis of Variance (ANOVA)

Like the t-test, the Analysis of Variance test (ANOVA) allows you to compare differences between groups. Unlike a t-test, an ANOVA can be run to compare two or more groups. In ANOVA, we usually have a quantitative response variable and a categorical explanatory variable with two or more groups (or levels). In ANOVA, the categorical explanatory variable is typically referred to as the factor. The use of ANOVA depends on the research design, we commonly use one-way ANOVA and two-way ANOVA. However, we will only cover one-way ANOVA in this course.

3.1 One-way ANOVA

One-way ANOVA has one categorical explanatory variable with several groups (or levels). In one-way ANOVA we compare the variation of sample means between the groups to the variation within the groups. The one-way ANOVA is shown below:

The total sum of squares (SST) is a measure for the total variability of the variable. It is given by

\[SST=\sum_{i=1}^{n}{(x_{i}-\bar{x})^2}\]

where \(x_i\) is each observation in the sample, \(\bar{x}\) is the overall mean of the sample and \(n\) is the total number of observations

The sum of squares groups (SSG) is a measure for the variability between groups. It is given by

\[SSG=\sum_{j=1}^{k}{n_{j}(\bar{x_{j}}-\bar{x})^2}\]

where \(n_{j}\) denotes the sample size for group/level j, \(\bar{x_{j}}\) denotes the mean of group/level j, \(k\) is the number of groups/levels and \(\bar{x}\) denotes to the overall mean of the sample.

Finally, the sum of squares error (SSE) is a measure for the variability within groups. It is given by

\[SSE= SST-SSG\]

The degrees of freedom (df) are defined for each partition of variability (total, in between groups, and within groups variability). Also, \(n\) is the number of observations and \(k\) is the number of groups.

We will use one-way ANOVA on energy_data, where we want to understand whether gas consumption significantly varies across known values of property_type. First, we remove the category ‘unknown’ from the property_type variable.

# Remove the NA in the property_type 

property_type_data <- energy_data %>%
                      filter(property_type != "NA") %>% 
                      select(property_type, consumption) 


# We want to see the counts of each group
property_type_data %>% group_by(property_type) %>% count()

We see that the different property types are not equally distributed in our sample, but that is okay.

Now, we will check if the assumptions for the one-way ANOVA test are fulfilled. The assumptions for one-way ANOVA is as follows:

Random samples
Independent samples
The data of each factor level are normally distributed.
The standard deviations of the variable under consideration are the same for all the groups.

From our previous knowledge, we know that the samples are random and independent samples. First, we will check the normality assumption by plotting a normal probability plot (Q-Q plot) for each grouped variable.

# Q-Q plot of property type variable by each group
# stat_qq produces Q-Q plot

qqplot_property_type <- ggplot2::ggplot(property_type_data) +
                                 aes(sample = consumption,
                                 colour = factor(property_type)) +
                                 stat_qq() 

# Display the graph

qqplot_property_type

We can also add the slope lines to see if the data points follow a straight line.

# stat_line() produces the slope

qqplot_property_type + stat_qq_line()

As you can see the majority of observations for each group falls on a straight line suggesting a normal distribution.

Then, we will check the last assumption which is equal standard deviations of gas consumption by different groups in property_type.

# Grouped tibble by property_type
groupby_property_type <- dplyr::group_by(property_type_data,
                                         property_type)

# Calculating the standard deviation of gas consumption 
# by group dataset

standard_consumption <- dplyr::summarise(groupby_property_type,
                        sd_consump=sd(consumption,na.rm = TRUE))

#Display the results

standard_consumption

To check the assumption of equal standard deviations, we will divide the largest value with the smallest sample standard deviation. If it is less than 2, then it is safe to assume the assumption of constant standard deviation across different groups.

# Divide the maximum with a minimum standard deviation

max(standard_consumption$sd_consump)/min(standard_consumption$sd_consump)

The ratio of the largest to the smallest sample standard deviation is 1.098 which is much less than 2 hence we conclude that the assumptions are fulfilled.

Now, we construct the hypothesis as follows:

\(H_{0}:\) The mean gas consumption is equal across different property types

\(H_{a}:\) At least one mean is different

Then run the ANOVA test by running one function as shown below:

# Running one-way anova test by using aov() function
# Consumption is the response/dependent varible
# so it will be before the ~
# property_type is the explanatory variable

one_way_anova <- aov(consumption~property_type, data=property_type_data) 

# Summary of the analysis
summary(one_way_anova)

Df is degrees of freedom
Sum Sq stands for the sum of squares
Mean Sq is mean sum of squares (sum of squares divided by degrees of freedom)
F value is the ratio of between-group sample variance and within-group sample variance
Pr(>F) is the p-value associated with F-statistics
Residuals are the difference between an observed value and predicted value (We will look at residuals in Chapter 5 in more detail).

Then, determine the critical values as shown below:

The one-way ANOVA uses the F distribution which is always right-tailed so the null hypothesis is rejected when the observed test statistics value is greater than the critical value (as covered in Chapter 3).

# Decide significance level
alpha <- 0.05

# Use qf() to find the quantile function 
# of F distribution used for ANOVA
# Since F-test is one tailed we only specify 
# 1-alph and donot divide by 2 
# Then specify the first degrees of freedom in table
# i.e. degrees of freedom of groups (between-group variability)
# Then specify the second degrees of freedom in table 
# i.e. degrees of freedom of errors (within-group variability)

qf(p= 1-alpha, df1= 6, df2 =  1069)

Critical value approach: The observed test statistics (F = 15.44) is greater than the critical value (\(F_{0.05}^{6,1069}=2.107\)) So, there is enough evidence we reject the null hypothesis at 5% significance level - the data suggests that at least one of the property types has a different mean value for gas consumption.

The p-value approach: As the p-value is less than the significance level 0.05, we can conclude that there is a significant difference in mean gas consumption across different type of properties. This is also highlighted with in the model summary as well. Each star gives the associated significance level for example one * means that we can reject the null hypothesis at a 1% significance level. Two ** means we can reject the null hypothesis at a 0.1% significance level and so on.

An alternative to the ANOVA model where variances are not necessarily assumed to be equal would be the function oneway.test(). This requires similar inputs to aov().

3.1.1 Exercise

Exercise
Show Answer

Use one-way ANOVA on energy_data, to understand whether gas consumption significantly varies across different floor area sizes.
Use the p-value approach to make a conclusion.

Use one-way ANOVA on energy_data, to understand whether gas consumption significantly varies across different floor area sizes.

# Running one-way anova test by using aov() function
# Consumption is the response/dependent variable
# so it will be before the ~
# floor_area is explanatory variable

one_way_exercise<- aov(consumption~floor_area, data=energy_data) 


# Summary of the analysis
summary(one_way_exercise)

Use the p-value approach to make a conclusion.

The p-value approach: As the p-value is less than the significance level 0.05, we can conclude that there is a significant difference in mean gas consumption across different floor_area sizes.

3.1.2 Multiple pairwise-comparison

When we reject the null hypothesis, we conclude that not all the means are equal: that is, at least one mean is different from the other means. The ANOVA test itself provides only statistical evidence of a difference but does not tell us which group means are statistically different. For instance, using the previous example for energy_data, the ANOVA test suggested that there is a significant difference in average gas consumption across several different property types, so a follow up (post-hoc) analysis would be needed to determine which property types differ. We would also want to know if one property type or multiple property types were better/worse than another in gas consumption. To complete this analysis we use a method called multiple pairwise-comparison.

Multiple pairwise-comparisons analyse all possible pairwise means. For example with 7 different property type, the multiple comparison methods would compare 21 possible pairwise comparisons - each value compared with each other value. For example, 3 possible pairs are:

bungalow and converted flat
detached and end terrace
detached and mid-terraced

3.1.3 Pairwise t-test

We can use a pairwise t-test to compare the mean gas difference between different property types. However, that means we need to construct multiple hypotheses which can cause a problem since testing more hypotheses means we will be more to likely reject the null hypothesis incorrectly. Therefore, we need to use methods such as Bonferroni correction to adjust inflating p-values when we are conducting pairwise t-tests. There are other methods available as well.

# Other methods that we can use
p.adjust.methods

# Conduct pairwise t-test
pairwise.t.test(x=property_type_data$consumption, 
                g=property_type_data$property_type, 
                p.adjust.method="bonferroni")

The pairwise t-test with the Bonferroni adjustment shows that at a significance level of 5% the means for 11 combinations are not statistically significant different since the p-value is greater than the significance level of 0.05. Since this procedure is a bit harder to conduct and interpret we will use the Tukey’s HSD (honest significant difference) test.

3.1.4 Tukey Multiple-Comparison Method

The Tukey Multiple-Comparison Method, also known as Tukey’s HSD (honest significant difference) test compares the means of every group to the means of every other group. It obtains the confidence interval for each

\[\mu_{i}-\mu_{j}\]

If the confidence interval for a pairwise comparison does not include 0 then we reject the null hypothesis. If the confidence interval includes 0, we do not reject the null hypothesis.

# Tukey's test
tukey <- TukeyHSD(one_way_anova)


# Plotting the graph makes it easier to interpret results
# par sets or adjusts plotting parameters
# mar set margin sizes mar(bottom,left,top,right)
# las indicates the orientation of the tick mark labels
# 0 is parallel to axis, 1 is horizontal, perpendicular is 2,
# vertical is 3

par(mar=c(5,13,3,3))
plot(tukey, col = "brown",las=1)

We can see there are 11 pairs which contain 0 within their confidence intervals, suggesting the difference in 11 pairs is not statistically significant. For the 10 pairs which don’t contain 0 in the confidence interval, the ANOVA test concluded that there is a mean difference of gas consumptions for the different property types and Tukey’s HSD shows the 10 pairs of property type which are statistically different at 5% significance level.

3.1.4.1 Exercise

Exercise
Show Answer

Use Tukey’s HSD test to identify which group means of floor_area are different in energy_data.

Use tukey’s HSD test to identify which group means of floor_area are different in energy_data.

# Tukey's test
tukey_exercise <- TukeyHSD(one_way_exercise)


# Plotting the graph makes it easier to interpret results
# par sets or adjusts plotting parameters
# mar set margin sizes mar(bottom,left,top,right)
# las indicates the orientation of the tick mark labels
# 0 is parallel to axis, 1 is horizontal, perpendicular is 2,
# vertical is 3

par(mar=c(5,13,3,3))
plot(tukey_exercise, col = "brown",las=1)

We can see there are 13 pairs which contain 0 in their confidence intervals, suggesting these pairs don’t have a statistically significant difference in mean consumption. There are only 2 pairs which do not contain 0 in their confidence interval, so for these pairings we can conclude that there is a mean difference of gas consumptions for different floor area.

4 Linear Regression

Regression analysis is a statistical tool for estimating the relationships between two or more variables. The relationship is modeled as \(y∼x\) or \(y=f(x)\). Both of these model definitions indicate that the variable \(y\) is a function of \(x\) - the variable \(y\) is denoted as a response or dependent variable, whereas the variable \(x\) is denoted as a predictor or explanatory variable.

4.1 Simple Linear regression

In simple linear regression, there is only one explanatory variable. Hence, the relationship between the response variable \(y\) and the predictor variable \(x\) is given in form of a linear equation as shown below:

\[y=a+bx\]

where \(a\) and \(b\) are constants. The constant \(a\) is the intercept (the point of intersection of the regression line and the y-axis when \(x=0\)). The constant \(b\) is the slope (how much the y-value changes when the x-value increases by 1 unit).

As we have seen in chapter 2, the variables wind and ozone in airquality_data are moderately correlated since the correlation coefficient is -0.531 and the scatterplot matrix suggested that as the wind speed increases, the daily mean ozone concentration decreases.

# Specify the x and y axes
# Specify the default geom_point()
# Specify the titles, x and y labels

ggplot2::ggplot(data = airquality_data) +
                aes(x = wind, y = ozone) +
                geom_point() +
                labs(x="Wind (miles per hour)", 
                     y ="Ozone parts (per billion)") +
                ggtitle("Wind and Ozone from May till September") +  
                theme(plot.title = element_text(hjust = 0.5))

The relationship does not seem to be perfectly linear, i.e., the points do not fall on a perfectly straight line, but it does seem to follow a straight line moderately, just with some variability. This is due to the fact, that the response variable is affected by other unknown and/or random processes, that are not fully captured by the explanatory variable. To consider these unknowns, a random error term, denoted by \(\epsilon\), is added to the linear regression model:

\[y=a+bx+\epsilon\]

The error term is independent and follows a normal distribution i.e. \(\epsilon\)~\(N(0,\sigma^2)\).

In linear regression, a and b are denoted by regression parameters \(\beta_{0}\) and \(\beta_{1}\), respectively. So, the simple linear regression for individual observation looks like this:

\[y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}\]

where \(i=1,..,n\)
\(y_{i}\) is the response value; it has a subscript because it is response value of i-th observation.
\(\beta_{0}\) is the intercept of the line that best fits the data; it does not vary across each observation.
\(\beta_{1}\) is the slope of the line that best fits the data; it does not vary across each observation.
\(x_{i}\) is the predictor value; it has a subscript because it can vary across each observation.
\(\epsilon_{i}\) is the error term; independent and identically distributed (iid) such that \(\epsilon_i \sim N(0, \sigma^2)\)

The above equation is in scalar form, we can also have it in the matrix form as shown below:

\[\mathbf{Y}=\mathbf{X}\boldsymbol\beta+\boldsymbol{\epsilon}\]

Note: Bold-faced letters represent matrix and vectors.

We run a regression to discover whether the estimated coefficients on the explanatory variables are different from 0 (to see if explanatory variables have any effect on response variable). The estimated linear regression model in scalar form is as follows:

\[\hat{y_{i}}=\hat{\beta_{0}}+\hat{\beta_{1}}x_{i}\]

where the hats indicated sample-based estimates of these values.

4.1.1 Exercise

Exercise
Show Answer

Use help function to find what the function lm does?

help(lm)

Use help function to find what the function lm does?

help(lm)

Image showing the help documentation for the help() function

The function lm is used to fit the regression model to data using R.

4.1.2 Building linear regression in R

We fit a linear regression model on the airquality_data where the response variable is ozone concentration and the explanatory variable is the speed of wind using lm function in R.

# First specify your formula then the dataset

simple_model <- lm(formula=ozone ~ wind, data=airquality_data)

# Get a summary of the model
summary(simple_model)

The summary gives us a range of diagnostic information about the model we’ve fitted:

Call: This is an R feature that shows what function and parameters were used to create the model.
Residuals: difference of the observed value and the predicted value.
Estimate: coefficient estimates for the intercept and explanatory variables.
Std Error: standard errors (i.e. an estimate of the uncertainty) of the coefficient estimates.
t value: t-statistic for the t-test comparing whether the coefficient is different to 0.
Pr(>|t|): p-value for the t statistics, giving the significance of coefficient.
Residual standard error: an expression of the variation of the observations around the regression line.
Multiple R-squared/Adj. R-squared: The proportion of the variance in the observed values explained by the model. The Adj. R-squared takes into account the number of variables in the model.
F-statistic, p-value: Model fit info (allow you to compare different models to assess the best one)

First, we will look at the estimated regression coefficients:

\(\hat{\beta_{0}}= 85.2\)
\(\hat{\beta_{1}}= -4.32\)

Then, we will look at the p-value associated with t statistics. The low p-value suggests that the slope is not zero, which in turn suggests that changes in the explanatory variable (wind) are associated with changes in the response variable (ozone). We can also see that \(R^2=0.282\) which means that the speed of wind explains 28.2% of the variance in average ozone concentration.

We can write the regression model as follows:

\[\widehat{ozone_{i}}= 85.2-4.32 \times wind_{i}\]

Intercept interpretation: The model predicted mean ozone concentration is 85.2 parts in billion when the speed of the wind is 0 miles per hours.

Slope interpretation If the speed of wind increase by 1 mile per hour, we predict the daily average ozone concentration decreases by 4.32 parts per billion.

To better understand this, consider the daily reading of mean ozone concentration for two different values of speed of the wind. The first is when the wind speed is 7 miles per hours and the second is when the wind speed is 8 miles per hours. Now let’s compute daily mean ozone concentration using the estimated regression coefficients.

# Create a variable where the wind speeds are 7 mph and 8mph

wind_speed <- c(7,8)


# Compute ozone concentration when wind speed is 7 mph and 8mph

85.2-4.32*wind_speed

The difference in predicted mean ozone concentration differs by 4.32 mph of wind speed.

4.1.2.1 Exercise

Exercise
Show Answer

Fit a linear regression model on the airquality_data where the response variable is ozone concentration and the explanatory variable is the temperature.
Interpret your estimated regression parameters.

Fit a linear regression model on the airquality_data where the response variable is ozone concentration and the explanatory variable is the temperature.

# Fit linear model using lm()
simple_model_exercise <- lm(formula=ozone ~ temp, data=airquality_data)

# Get a summary of the model
summary(simple_model_exercise)

Interpret your estimated regression parameters.

Intercept interpretation: The model predicted mean ozone concentration is -101.6 parts in billion when the temperature is 0 degrees Fahrenheit.

Slope interpretation: If the temperature increase by 1 degree Fahrenheit, we predict the daily average ozone concentration increases by 1.85 parts per billion

4.1.3 Predicting using the regression model

Once we have created a linear regression model, we can use the predict() function to predict the response variable for new values given explanatory variables.

# Create some new data

new_wind <- data.frame(wind = c(23, 26, 30))

# Use the predict function on new values
# And also get confidence intervals for these
predict(simple_model, newdata = new_wind, interval = "confidence")

The fit column is the predicted value, with lwr and upr showing the confidence intervals.

When the wind speed is 23 mph, the average ozone concentration is -14.26965 parts per billion - an unfeasible value.

4.1.3.1 Exercise

Exercise
Show Answer

Predict the ozone concentration when the temperature is 48,52 and 55 degrees Fahrenheit using the linear model created in the last exercise?

Predict the ozone concentration when the temperature is 48,52 and 55 degrees Fahrenheit using the linear model created in the last exercise?

# Add some new data
new_temp <- data.frame(temp = c(48, 52, 55))

# Use the predict function on new values
# And also get confidence intervals for these

predict(simple_model_exercise, newdata = new_temp, interval = "confidence")

5 Multiple linear regression

A multiple linear regression model is a generalization of the simple linear regression model. In the multiple linear regression model, there is one response variable and more than one explanatory variable. The model will contain the intercept \(\beta_{0}\), and more than one coefficient, denoted as \(\beta_{1},\beta_{2},...,\beta_{k}\) where \(k\) is the number of explanatory variables in the model. The mathematical equation of a multiple linear regression model is shown below:

\[Y_{i}=\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+...+\beta_{k}x_{ik}+\epsilon_{i}\]

where \(i=1,...,n\)
\(Y_{i}\) is the value of the response variable for the ith observation.
\(x_{ik}\) is the value of the kth explanatory variable for the ith observation.
\(\beta_0\) is the intercept of the line that best fits the data.
\(\beta_1,..,\beta_k\) are the regression coefficients for the explanatory variables.
\(\epsilon_{i}\) is the error term; independent and identically distributed (iid) such that \(\epsilon_i \sim N(0, \sigma^2)\)

In chapter 2, we saw that ozone concentration and temperature are positively correlated. So, as temperature increases, the ozone concentration also increases. We also found that the daily temperature and speed of wind are moderately correlated since the correlation coefficient is -0.458 and the data points in the scatterplot matrix suggested a slight linear relationship. So now a multiple linear regression model can be fit with explanatory variables wind and temp.

# First specify your formula then the dataset
multiple_model <- lm(formula=ozone ~ wind + temp, data=airquality_data)

# Get a summary of the model
summary(multiple_model)

Differences in temperature and wind speed explains 44.4% of the variation in ozone concentration in the sample. The p-values imply that there is a significant linear relationship between ozone and wind and between ozone and temp in the regression model. Further, it appears that the intercept is significant too. Also, the \(R^2\) is greater than the previous model. We can write the regression model as follows:

\[\widehat{ozone_{i}}= -41.2-2.6 \times wind_{i} + 1.4 \times temp_{i}\]

Since \(\hat{\beta_{1}}=-2.6\), we can conclude that when the speed of the wind increases by one mph, the mean ozone concentration decreases by 2.6 parts per billion, while keeping all other explanatory variables constant. Also, \(\hat{\beta_{2}}=1.4\), suggesting as temperature increases by 1 degree Fahrenheit, the ozone concentration increases by 1.4 parts per billion, while keeping all other explanatory variables constant. Lastly, when the wind speed is 0 mph and the temperature is 0 degrees Fahrenheit, then the mean ozone concentration is -41.2 parts per billion.

5.1 Exercise

Exercise
Show Answer

Fit a multiple linear regression model with explanatory variables wind, temp and solar_r.

Fit a multiple linear regression model with explanatory variables wind, temp and solar_r.

# First specify your formula then the dataset
multiple_model_exercise <- lm(formula=ozone ~ wind + temp + solar_r, data=airquality_data)

# Get a summary of the model
summary(multiple_model_exercise)

5.2 Interaction Effects

In the last section, we learned how to build a multiple linear regression model for predicting a continuous response variable \(y\) based on multiple explanatory variables \(x\). For example, to predict ozone concentration, based on daily wind speed and temperature. The model was additive. It assumes that the relationship between a given explanatory variable and the response variable is independent of the other explanatory variables. However, this is not always the case.

Considering our example, the additive model assumes that the effect of wind speed on ozone concentration is independent of the effect of temperature. This assumption might not be true. For example, the increase in temperature might cause a decrease in the speed of the wind which could then cause an increase in the response variable ozone. In statistics, it is referred to as an interaction effect.

The mathematical equation of regression with an interaction term will be:

\[Y=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{1}X_{2}+ \epsilon\]

where everything is same as before and \(\beta_{3}\) is a regression coefficient, and \(X_{1}X_{2}\) is the interaction between \(X_{1}\) and \(X_{2}\)

In R, we can add the interaction term in two ways, but before adding the interaction term, we will mean center our explanatory variables. The notion of centring continuous variables in regression is the source of constant debate and questioning and often answers are given in terms of statistical properties. Centring a variable moves its mean to 0 (which is done by subtracting the mean from the variable). Here’s a list to get a rough idea when to centre:

Interpreting the intercept: If the intercept term needs to be interpreted, and any explanatory variable does not have a meaningful 0 value (such as weight or height) then the explanatory should be centred.
Interactions: If you are testing an interaction between a continuous variable and another variable (continuous or categorical) the continuous variables should be centred to avoid multicollinearity issues, which could affect model convergence and/or inflate the standard errors.
Polynomial terms: If you are transforming a variable (e.g. \(X^2\)), the transformed variable may be highly correlated with the untransformed variable (X). For the same reason as the interaction term, centre the untransformed variable (X) after the transformation.

We should only centre continuous variables ( we don’t want to centre categorical dummy variables like gender). Also, you only centre explanatory variables, not the response variable.

# Centre the explanatory variables
# by using scale() function
# center= True means centering variables and not scaling

centre_temp <- scale(airquality_data$temp,center = TRUE, scale = FALSE)
centre_wind <- scale(airquality_data$wind,center = TRUE, scale = FALSE)

# Add an interaction term between centred wind and centred temp

interaction_model <- lm(formula=ozone ~ centre_wind + centre_temp +
                        centre_wind:centre_temp, data=airquality_data)


# Or only use * between two variables - a*b is equivalent to a + b + a:b
interaction_model <- lm(formula=ozone ~ centre_wind*centre_temp,
                        data=airquality_data)

# Get a summary of the model
summary(interaction_model)

It can be seen that all the coefficients, including the interaction term coefficient, are statistically significant, suggesting that there is an interactive relationship between the two explanatory variables.

5.2.1 Exercise

Exercise
Show Answer

Create a multiple regression model of the response variable ozone and the explanatory variables temp and solar_r, and include interaction between both explanatory variables.

Create a multiple regression model of the response variable ozone and the explanatory variables temp and solar_r, and include interaction between both explanatory variables.

# Create the centered explanatory variables

centered_temp <- scale(airquality_data$temp,center = TRUE, scale = FALSE)
centered_solar <- scale(airquality_data$solar_r,center = TRUE, scale = FALSE)

# Use * between two variables
interaction_exercise <- lm(formula=ozone ~ centered_solar*centered_temp, 
                           data=airquality_data)

# Get a summary of the model
summary(interaction_exercise)

Summary

By now, you should:

Know how to use the formula in R
Know how to carry out one-way ANOVA
Know how to carry out simple linear regression and multiple linear regression
Know how to add interaction terms in regression analysis

Next Chapter

In the next chapter we will look at model adequacy

Reuse

Open Government Licence 3.0 (View License)