Chapter 6 - Generalised Linear Models

Author

Government Analysis Function and ONS Data Science Campus

Learning Objectives

The goal of this session is to learn about:

The theory of generalised linear models
Logistic Regression
Gaussian (Normal) Regression
Gamma Regression

In this chapter, we need to import energy_data and airquality_data, also we need to load the tidyverse package.

# Libraries needed

library(tidyverse) # For data manipulation 

# Importing data 

energy_data <- readr::read_csv("../Data/energy_data.csv")

airquality_data <- readr::read_csv("../Data/airquality_data.csv")

1 Theory of generalised linear models

The mathematical equation of multiple linear regression was as follows:

\[Y_{i}=\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik}+\epsilon_{i}\]

where we assumed that the error terms are normally distributed such that \(\epsilon_{i} \sim N(0,\sigma^2)\)

Hence, the distribution of the response variable in linear regression is also normal such that

\[(Y|\mathbf{X=x})\sim N(\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik},\sigma^2)\]

From the information above (and chapter 4), we know that in linear regression:

The response variable conditioned on the explanatory variables has a normal distribution

Conditional mean is a linear combination of the explanatory variables \(\mathbb{E}(Y|\mathbf{X=x})=\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik}\)

In the generalised linear model, we can modify both of these things, so instead of having only a normal distribution for the response variable (which is the case in linear regression), we can have another distribution such as the binomial distribution or Poisson distribution as far as they are in exponential family. Secondly, instead of the conditional mean being a linear combination of the explanatory variables, it can be some function of a linear combination of the explanatory variables.

A generalised linear model is made up of three parts:

A distribution of the response conditioned on the explanatory variables. (Technically this distribution needs to be from the exponential family of distributions.)
A linear combination of the explanatory variables, \(\beta_0+\beta_1x_1+..+\beta_kx_k\) which we write as \(\eta(\mathbf{x})\). That is,

\[\eta(\mathbf{x})=\beta_0+\beta_1x_1+...+\beta_kx_k\]

A link function \(g()\), that defines how \(\eta(\mathbf{x})\) which is a linear combination of explanatory variables, is related to the mean of the response conditioned on the explanatory variable \(\mathbb{E}(Y|\mathbf{X=x})\). That is,

\[\eta(\mathbf{x})=g(\mathbb{E}(Y|\mathbf{X=x}))\]

Hence, in linear regression we have a normal distribution for response variable and the conditional mean is a linear combination of the explanatory variables as shown below:

\[\mathbb{E}(Y|\mathbf{X=x})=\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik}\]

However, in GLM we can have any distribution as far as it is from exponential family and have a link function \(g()\) such that:

\[\mathbb{E}(Y|\mathbf{X=x})=g^{-1}(\boldsymbol\eta(\mathbf{x}))=g^{-1}(\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik})\]

The following table summarizes three examples of a generalized linear model:

Like ordinary linear regression, we will seek to “fit” the model by estimating the \(\beta\) parameters. To do so, we will use the method of maximum likelihood.

In generalised linear models, the response variable can only have a normal distribution.

True
False

We can have any distribution in generalised linear models.

True
False

The mean of the response conditioned on the explanatory variable is denoted as:

\(\mathbb{E}(X|\mathbf{Y=y})\)
\(\mathbb{E}(Y|\mathbf{X=x})\)
\(Var(Y|\mathbf{X=x})\)

What does a link function do?

It links \(\eta(\mathbf{x})\) which is a linear combination of explanatory variables, to the mean of the response conditioned on the explanatory variable \(\mathbb{E}(Y|\mathbf{X=x})\)
It links the distribution to \(\mathbb{E}(X|\mathbf{Y=y})\)

In generalised linear models, the response variable can only have a normal distribution.

False, in generalised linear models, we can have other distribution such as binomial distribution or poison distribution as far as they are in the exponential family.

We can have any distribution in generalised linear models.

False, in generalised linear models, we can only have distributions that are in the exponential family.

The mean of the response conditioned on the explanatory variable is denoted as:

\(\mathbb{E}(Y|\mathbf{X=x})\)

What does a link function do?

It links \(\eta(\mathbf{x})\) which is a linear combination of explanatory variables, to the mean of the response conditioned on the explanatory variable \(\mathbb{E}(Y|\mathbf{X=x})\)

1.2 Table of exponential-family distributions

The table below shows few exponential-family distributions that are commonly used and the data they are typically used for, along with their link functions.

Distribution	Support of Distribution	Typical Uses	Link Name	Link Function
Normal/Gaussian	Real: \((-\infty,+\infty)\)	Linear response data	Identity	\(\mu\)
Gamma/Exponential	Real: \((0,+\infty)\)	Exponential response data, scale parameters	Negative inverse	\(\mu^{-1}\)
Poisson	Integer: \(0,1,2,..\)	Count of occurrences in fixed amount of time/space	Log	\(\log(\mu)\)
Bernoulli	Integer: \([0,1]\)	Outcome of single yes/no occurrence	Logit	\(\log(\frac{\mu}{1-\mu})\)
Binomial	Integer: \(0,1,2\)	Outcome of number of yes occurrences out of N (yes/no) occurrence	Logit	\(\log(\frac{\mu}{n-\mu})\)

The generalised linear models can handle categorical response variables (binomial), integer response variables (Poisson, negative binomial) right-skewed response variables (gamma, inverse Gaussian) and symmetrical response variables (gaussian).

1.2.1 Quiz

Quiz
Show Answer

Which distribution can be used in GLM when the response variable takes only positive integers: 0,1,2,..?

Gamma distribution
Normal distribution
Poisson distribution

Which distribution can be used in GLM when the response variable is continuous but only takes non-negative values?

Gamma distribution
Normal distribution
Poisson distribution

Which distribution can be used in GLM for binary response variable (takes only values 0 or 1)?

Gamma distribution
Bernoulli distribution
Poisson distribution

Which distribution can be used in GLM when the response variable takes only positive integers: 0,1,2,..?

Poisson distribution

Which distribution can be used in GLM when the response variable is continuous but only takes non-negative values?

Gamma distribution

Which distribution can be used in GLM for binary response variable (takes only values 0 or 1)?

Bernoulli distribution

1.2.2 Exercise

Exercise
Show Answer

Use the help() function or ? to find information about the function glm().
Find out what the argument family means in the function.

Use the help() function or ? to find information about the function glm().

# Use help function to find info about the function glm
help(glm)

The function is used to fit a generalised linear model in R.

Find out what the argument family means in the function.

In the argument family, we specify the distribution and the corresponding link function. If you click on “See family for details of family functions” in the argument, you can see that some of the distribution names and their corresponding link function are provided.

2 Logistic regression theory

Logistic regression is used to model a binary outcome, that is a variable which can have only two possible values: 0 or 1. For example, logistic regression can be used:

To predict whether an email is spam (1) or not (0)
Whether the tumour is dangerous (1) or not (0)

The explanatory variables can be continuous, categorical or a mix of both. In general, we call the two outcomes of the response variable “success” and “failure” and represent them by 1 (for success) and 0 (for a failure). If our data has \(n\) independent observations (so we have outcome 0 or 1, n times), then we have the binomial setting because a binomial distribution is a sum of independent and identically distributed Bernoulli random variables. We now define the logistic regression model:

\[\log\Bigg(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\Bigg)=\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik}\]

Where \(i=1,..,n\) and n is the total number of observations

The link function is \(\log\Big(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\Big)\). It is the logarithm of the odds, also known as log-odds or logit. The odds reflect the likelihood that the event will occur. It can be seen as the ratio of “successes” to “non-successes”.
The linear combination of the explanatory variables is \(\beta_0+\beta_1x_1+..+\beta_kx_k\), where k is the total number of explanatory variables.
The probability of success (\(Y_i=1\)) is \(p(\mathbf{x})=\mathbb{P}(Y_i=1|\mathbf{X=x})\)
The probability of failure (\(Y_i=0\)) is \(1-p(\mathbf{x})=\mathbb{P}(Y_i=0|\mathbf{X=x})\)

Note: Applying the inverse logit transformation allow us to obtain an expression for \(p(\mathbf{x}):\)

\[p(\mathbf{x})=\frac{e^{\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik}}}{1+e^{\beta_{0}+\beta_{1}x_{i1}+...+\beta_{k}x_{ik}}}\]

2.1 Quiz

Quiz
Show Answer

We use logistic regression when the response variable has a binary outcome.

True
False

In logistic regression, the explanatory variables should have a binary outcome.

True
False

What is this equation \(\log\Bigg(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\Bigg)\) called?

Logrithm of bernoulli
Logarithm of the odds (log-odds)

We use logistic regression When the response variable has a binary outcome.

True

In logistic regression, the explanatory variables should have a binary outcome.

False, the explanatory variables can be continuous, categorical or a mix of both.

What is this equation \(\log\Bigg(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\Bigg)\) called?

Logarithm of the odds (log-odds)

3 Logistic regression in R

Now we will use logistic regression to check if the variables such as solar radiation, wind speed, temperature affect the ozone concentration to be above the median. Firstly, we create a new column with a binary outcome as shown below:

# Create a new column using mutate function
# where the column will have binary outcomes (yes/no)
# yes if ozone concentration is above median ozone 
# no otherwise 

airquality_data <- airquality_data %>% 
  dplyr::mutate(ozone_above_median = (ozone > median(ozone)))

# View unique values of the column created above 

unique(airquality_data$ozone_above_median)

Now we will assume the ozone_above_median column to be our response variable and the explanatory variables are wind, solar_r, and temp. We will use the binomial distribution because while each observation follows a Bernoulli distribution (since outcomes are either 0 or 1), we have n independent observations and the sum of multiple Bernoulli random variables follows a binomial distribution.

# Fit a generalised linear model 
# using the binomial distribution

logistic_example <- glm(ozone_above_median ~ wind + solar_r + temp, 
                        family="binomial", data=airquality_data)


# Find the results using summary() function

summary(logistic_example)

The summary output for GLM models displays the call (the arguments that have been run) and coefficients similar to the summary for linear models. However, there are a few differences, such as:

Deviance: is defined as the difference of likelihoods between the fitted model and the saturated model (a saturated is a model that fits perfectly). A zero deviance means perfect model so deviance is always greater than and equal to 0. Smaller deviance (close to 0) indicate better model ﬁt.
Deviance Residuals: is a measure of model fit. It shows the distribution of the deviance residuals for individual cases used in the model.
Dispersion parameter: for a likelihood-based model, the dispersion parameter is always fixed to 1. It is adjusted only for methods that are based on quasi-likelihood estimation such as when family = quasibinomial. These methods are particularly suited for dealing with overdispersion.
Null deviance: is the deviance for the null model, that is, a model with no explanatory variables.
Residual deviance: is the deviance for the fit model.
AIC: is a measure of fit that penalizes for the number of parameters. Smaller values indicate better fit and thus the AIC can be used to compare models.
Number of Fisher Scoring iterations: is the output of an iterative weighted least squares. A high number of iterations may be a cause for concern indicating that the algorithm is not converging properly.

3.1 Exercise

Exercise
Show Answer

Using airquality_data, create a new column such that it contains the binary outcome (yes/no) when the ozone concentration is above the mean value.
Use logistic regression to check if the variables solar radiation, wind speed, and temperature affect the ozone concentration to be above the mean.

Using airquality_data, create a new column such that it contains the binary outcome (yes/no) when the ozone concentration is above the mean value.

# Create a new column using mutate function
# where the column will have binary outcomes (yes/no)
# yes if ozone concentration is above mean ozone 
# no otherwise 

airquality_data <- airquality_data %>% 
  dplyr::mutate(ozone_above_mean = (ozone > mean(ozone)))

# View unique values of the column created above 

unique(airquality_data$ozone_above_mean)

Use logistic regression to check if the variables such as solar radiation, wind speed, temperature affect the ozone concentration to be above the mean.

Now we will assume the ozone_above_mean column to be our response variable and the explanatory variables are wind, solar_r, temp. We will again use the binomial distribution as in the example above.

# Fit a generalised linear model  
# using a binomial distribution

logistic_exercise <- glm(ozone_above_mean ~ wind + solar_r + temp, 
                         family="binomial", data=airquality_data)

# Find the results using summary() function

summary(logistic_exercise)

3.2 Interpreting coefficients

In this section, we will focus on the interpretation of the estimated coefficients produced by fitting a logistic regression to check if the variables such as solar radiation, wind speed, temperature affect the ozone concentration to be above the median.

# Find the coefficients using summary()

summary(logistic_example)$coef

The p-value for the variables wind and temp are statistically significant suggesting the change in wind speed and temperature is associated with changes in the response variable (ozone_above_median). However, the variable solar_r is not statistically significant as it larger than 0.05 (5e-2). Including variables which are not statistically significant can cause overfitting, therefore they should be eliminated by using statistical techniques such as stepwise selection covered in Chapter 5.

The estimated coefficient of the variable wind is \(\hat\beta_1=-0.281\), which is negative. This means an increase in wind speed is associated with a decrease in the probability of ozone concentration being above the median. However, the estimated coefficient for the variable temp is \(\hat\beta_3=0.226\), which is positive. This means that an increase in temperature will be associated with an increased probability of ozone concentration being above the median.

The logistic regression coefficients give the change in the log odds of the outcome for a one-unit increase in the explanatory variable. So,

For every one unit change in wind speed (mph), the log odds of ozone concentration being above the median (versus ozone concentration below the median) decreases by 0.281. Hence, the one unit increase in the wind speed (mph) will decrease the odds of ozone concentration being above the median by exp(0.281)= 1.32 times.
For every one unit change in temperature (degrees Fahrenheit), the log odds of ozone concentration being above the median increases by 0.226. Hence, the one unit increase in the temperature will increase the odds of ozone concentration being above the median by exp(0.226)= 1.25 times.

3.2.1 Exercise

Exercise
Show Answer

Use logistic regression to check if the variables wind speed and temperature affects the ozone concentration to be above the mean.
Interpret your results.

Use logistic regression to check if the variables wind speed and temperature affects the ozone concentration to be above the mean.

# Fit a generalised linear model 
# using a binomial distribution
# response variable is ozone_above_mean which is created in the previous exercise 
# explanatory variables are wind and temp

logistic_model <- glm(ozone_above_mean ~ wind + temp, 
                      family="binomial", data=airquality_data)

# Find the estimated coefficients 

summary(logistic_model)$coef

Interpret your results.

The logistic regression coefficients give the change in the log odds of the outcome for a one-unit increase in the explanatory variable. So,

For every one unit change in wind speed (mph), the log odds of ozone concentration being above the mean (versus ozone concentration below the mean) decreases by 0.26. Hence, the one unit increase in the wind speed (mph) will decrease the odds of ozone concentration being above the mean by exp(0.26)= 1.29 times.
For every one unit change in temperature (degrees Fahrenheit), the log odds of ozone concentration being above the mean increases by 0.24. Hence, the one unit increase in the temperature will increase the odds of ozone concentration being above the mean by exp(0.24)= 1.27 times.

3.3 Predicting using logistic regression

We can also make predictions using logistic regression. In section 2 (poisson regression theory ) we saw the log odd can be solved mathematically to calculate \(p(\mathbf{x})\). Hence, the logistic equation for our model can be written as: \[p(\mathbf{x})=\frac{\exp(-17.63-0.281 *wind_i+0.005*solar_{i}+0.226*temp_i)}{1+\exp(-17.63-0.281*wind_i+0.005*solar_{i}+0.226*temp_i)}\]

In R, we can easily predict the probability of ozone concentration being above the median for given values of the explanatory variables. So, first, we create a new data frame with the values we want the explanatory variables to take on to create our predictions.

# Creating new data which has three variables 
# wind, temp and solar_r
# each variable has two values 
# so we will get two predictions

new_data <- data.frame(wind = c(10,5), temp = c(80,90), solar_r= c(150,150))


# using predict function and new_data we predict 
# the probability of ozone concentration above the median 

predict(logistic_example, new_data, type="response")

As you can see when the wind speed is 10 mph, the temperature is 80 degrees Fahrenheit and solar radiations are 150 langleys, then the probability of ozone concentration to be above 42.13 parts per billion (median) is 15.7% and if we increase the temperature (to 90 degrees) and reduce the wind speed (to 5 mph), then the probability of ozone concentration to be above 42.13 parts per billion (median) also increases to 87.9%.

3.3.1 Exercise

Exercise
Show Answer

Using the logistic model created in the previous exercise (where the response variable was ozone_above_mean and explanatory variables were wind and temp), find the probability of ozone concentration to be above the mean when the wind speed is 6mph and the temperature is 85 degrees Fahrenheit.

Using the logistic model created in the previous exercise (where the response variable was ozone_above_mean and explanatory variables were wind and temp), find the probability of ozone concentration to be above the mean when the wind speed is 6mph and the temperature is 85 degrees Fahrenheit.

# Creating a new data which has two variables wind, temp
# wind takes the value 6 and temp takes the value 85

new_data <- data.frame(wind = 6, temp = 85)


# using predict function and new_data we predict 
# the probability of ozone concentration above the median 
# when wind speed is 6mph and the temperature is 85 degrees Fahrenheit

predict(logistic_model, new_data, type="response")

When the wind speed is 6 mph and the temperature is 85 degrees Fahrenheit, then the probability of ozone concentration to be above 42.13 parts per billion (mean) is 68.76%.

3.4 Checking model fit

We may also wish to see measures of how well our model fits. To do that we will look at the summary() of the logistic regression we fitted above where our response variable is ozone_above_median and explanatory variables are temp, solar_r and wind.

# Find the results using summary() function

summary(logistic_example)

The null deviance (deviance for the null model also called intercept model, with no explanatory variables) is higher than the residual deviance (deviance for the model that we have fitted). This shows that it makes sense to use explanatory variables for fitting the model. Second, the residual deviance is relatively lower (compared to null deviance), which indicates that the log-likelihood of our model is closer to the log-likelihood of the saturated model(best model).

We can also use the chi-squared test to measure the significance of the overall model. This test asks whether the residual model (our model with explanatory variables) fits significantly better than the null model. The test statistic is the difference between the null deviance and residual deviance and the degrees of freedom is the difference between the degrees of freedom of both models. Then we calculate the p-value as shown below:

# Calculate p-value of the chi-squared test 
# where the test statistics id difference between 
# null deviance and residual deviance 
# degrees of freedom is the difference between 
# degrees of freedom of both models 

test_statistic <-  183.59 - 102.57 #81.06
degrees_freedom <-  152-149 # 3

pchisq(test_statistic, df=degrees_freedom, lower.tail=FALSE)

The associated p-value of the chi-square of 81.02 with 3 degrees of freedom is less than the 5% (even 1%) significance level. This suggests that our model as a whole fits significantly better than null model (intercept model with no explanatory variables). We don’t have \(R^2\) in GLM, instead we can calculate pseudo R-squared as shown below:

\[R^2=\frac{\text{null deviance - residual deviance}}{\text{null deviance}} \times100\]

Hence, for our model \(R^2=\frac{183.59-102.57}{183.59} \times 100=44.13%\). So, the explanatory variables wind, solar_r and temp explains 44.13% of the variation in ozone concentration being above the median.

A problem with logistic regression (as well as other GLMs) is overdispersed. Overdispersion occurs when an error (residuals) are more variable than expected from the theorized distribution. In the case of logistic regression, the theorized error distribution is the binomial distribution. The variance of the binomial distribution is a function of its mean (or the parameter \(p\)). If there is overdispersion, the coefficient estimates will be more conﬁdent (smaller standard error values) than they should be. One can detect overdispersion by comparing the residual deviance with the degrees of freedom. As a rule of thumb, if the quotient of residual deviance and its corresponding degrees of freedom is greater than 1, there is overdispersion. If it is less than 1, then it is under-dispersed and if it is approximately 1 then there is no overdispersion, nor any under dispersion. In our case residual deviance/ degrees of freedom = 102.57/149 = 0.688 which shows under dispersion. It can be adjusted by using quasi-likelihood estimation such as when family = quasibinomial.

3.4.1 Exercise

Exercise
Show Answer

What do the null deviance and residual deviance for the model created in the previous exercise (where the response variable was ozone_above_mean and explanatory variables were wind and temp) suggest about the model fit?
Use the chi-squared test to measure the significance of the overall model.

What do the null deviance and residual deviance for the model created in the previous exercise (where the response variable was ozone_above_mean and explanatory variables were wind and temp) suggest about the model fit?

# Fit a generalised linear model 
# using the binomial distribution
# response variable is ozone_above_mean which is created in the previous exercises 
# explanatory variables are wind and temp

logistic_model <- glm(ozone_above_mean ~ wind + temp, 
                      family="binomial", data=airquality_data)

# Find the outputs using summary() 

summary(logistic_model)

The null deviance (deviance for the null model, with no explanatory variables) is higher than residual deviance (deviance for the model that was fit by us). This shows that it makes sense to use explanatory variables for fitting the model.

Use the chi-squared test to measure the significance of the overall model.

# Calculate p-value of the chi-squared test 
# where the test statistics id difference between 
# null deviance and residual deviance 
# degrees of freedom is the difference between 
# degrees of freedom of both models 

pchisq(183.59-104.46, df=2, lower.tail=FALSE)

The associated p-value of the chi-square of 79.13 with 2 degrees of freedom is less than the 5% (even 1%) significance level. This suggests that our model as a whole fits significantly better than a null model (intercept model with no explanatory variables).

4 Gaussian (Normal) Regression

Gaussian regression is another example of a generalised linear model. Gaussian regression is also known as linear regression because the response variables are normally distributed (remember in linear regression, we assume the response variable has a normal distribution). Now we will fit a generalised linear model, where the response variable is ozone and the explanatory variable is wind.

# Fit a generalised linear model 
# on ozone and wind
# using a normal distribution
normal_model <- glm(ozone ~ wind, family="gaussian", data=airquality_data)

# Finding the summary
summary(normal_model)

As you can see, the estimated coefficients are the same as the ones we got when we fitted the linear regression in Chapter 4. Hence, the equation of the linear regression model and the generalised linear model using the normal distribution is the same:

\[\widehat{ozone_{i}= 85.2-4.32 \times wind_{i}\]

Therefore, the interpretation should be the same as well (if you want to learn how to interpret the coefficients, look at Chapter 4). We can also see that the residual standard error of the linear model is 24.4 (from chapter 4) which is the square root of the dispersion parameter (595.14) specified in glm. Furthermore, we can see that the \(R^2\) of both models are the same:

# Calculating R squared 
# By calculating the difference between null deviance and 
# residual deviance and divide it by null deviance 

r_squared <- (normal_model$null.deviance - normal_model$deviance)/normal_model$null.deviance

r_squared

In Chapter 5, when we fitted a linear regression (where the response variable was ozone and the explanatory variable was wind), we saw that the model assumption was violated. The Q-Q plot suggested a normal distribution of residuals, but the endpoints suggested a heavy-tailed distribution. The Scale-Location plot suggested heteroscedasticity (residuals have non-constant variance) and we were getting negative values for ozone concentration which is not feasible. We also saw that the relationship between the response variable (ozone) and the explanatory variable (wind) is not linear, suggesting we need a better model.

4.1 Exercise

Exercise
Show Answer

Fit a Gaussian regression on airquality_data, where the response variable is ozone, and explanatory variables are wind, temp, solar_r and month.

Fit a Gaussian regression on airquality_data, where the response variable is ozone, and explanatory variables are wind, temp, solar_r and month.

# Fit a generalised linear model where the response variable is 
# ozone and explanatory variables are wind, temp, solar_r and month
# using a normal distribution

normal_exercise <- glm(ozone ~ wind + temp + solar_r + month, 
                       family="gaussian", data=airquality_data)

# Finding the summary
summary(normal_exercise)

5 Gamma regression

In gamma regression, the response variable is continuous but only takes positive values, unlike the normal distribution used in linear regression (normal distributions can have both positive and negative values). To learn about the relationship between the response variable ozone and explanatory variable wind in airquality_data, a scatterplot and residuals density plot can be created.

# Specify the x and y axes
# Specify the default geom_point()
# Specify the titles, x and y labels

ggplot2::ggplot(data=airquality_data) +
                aes(x=wind, y=ozone) +
                geom_point() +
                labs(x="Wind (miles per hour)", 
                     y ="Ozone (parts per billion)" ) +
                ggtitle("Wind and Ozone from May till September") +  
                theme(plot.title=element_text(hjust=0.5))

The response variable (ozone) is continuous, but is always positive. Fitting a linear regression will predict negative values for ozone concentration, which is not right since ozone concentration cannot be negative.
The relationship between wind speed and ozone does not look perfectly linear.

A residuals density plot created by fitting a linear regression model will show the suitability of a linear model in this case.

# Fitting a simple linear regression 
# on ozone and wind
simple_model <- lm(formula=ozone ~ wind, data=airquality_data)

# Create a residual density plot 
# of simple linear regression
ggplot2::ggplot() +
         geom_density() +
         aes(x=residuals(simple_model)) +
         labs(x="Residuals", y="Density")

The density plot indicates that values at the right tail of the residual distribution are indeed problematic. Since the residuals are not normally distributed, a linear model is not the best model. Since the response variable is skewed and always positive, we can use a gamma distribution.

Hence, a generalised linear model can be fit where the response variable is ozone and the explanatory variable is wind from the airquality_data. The three most commonly used link functions for gamma GLMs are:

Inverse link: \(g(\mu)=\frac{1}{\mu}\) (by default, this is the link function that glm use for Gamma distribution).
Log link: \(g(\mu)=\log\mu\) (we will use this link function)
Identity link: \(g(\mu)=\mu\)

We will use log link because the response variable (ozone) is a continuous, positive variable and it is right skewed. So our model looks like this:

\[\log(ozone_{i})=\beta_{0}+\beta_{1}*wind_{i}+\epsilon_{i}\]

# Fit a generalised linear model 
# on ozone and wind
# using a gamma distribution
gamma_model <- glm(ozone ~ wind, 
                   family=Gamma(link=log), 
                   data=airquality_data)

# Finding the summary
summary(gamma_model)

The model parameters are now reported on the log scale owing to the link function, this means that we need to exponentiate if we want to see what the values of \(\beta_{0}\) and \(\beta_{1}\) are on the original scale of the data. So,

\[\widehat{ozone_{i}}=\exp(\beta_{0})\times\exp(\beta_{1}\times wind_{i})=100.52\times\exp(-0.0934\times wind_{i})\]

These results are somewhat reassuring. The residual deviance is relatively low (compared to null deviance), which indicates that the log-likelihood of our model is closer to the log-likelihood of the saturated model (‘best possible model’). We can also use the chi-squared test to measure the significance of the overall model.

# Calculate p-value of the chi-squared test 
# where the test statistics id difference between 
# null deviance and residual deviance 
# degrees of freedom is the difference between 
# degrees of freedom of both models 

pchisq(74.757-56.221, df=1, lower.tail=FALSE)

The associated p-value of the chi-square of 18.536 with 1 degree of freedom is less than the 5% (even 1%) significance level. This suggests that our model as a whole fits significantly better than null model (intercept model with no explanatory variables).

Also, for a well-fitting model, the residual deviance should be close to the degrees of freedom (151), which is the case here. However, this was not the case for Gaussian regression where the residual deviance was 89866 for the same degrees of freedom (151). Now, let’s have a look at the diagnostic plots.

# Change the panel layout to 2 x 2
# Look at all 4 plots at once 
par(mfrow = c(2, 2))

# Use plot() function to create diagnostic plots
plot(gamma_model)

The residuals in the residual vs fitted plot are randomly scattered around the zero lines, suggesting that the correct link function was used. The Q-Q plot looks slightly better than before, showing the majority of the points following the straight line. The scale-location plot shows that the deviance residuals do have a constant variance. Lastly, the residuals vs leverage plot show that observation 117 is no longer a potential outlier.

6 Exercise

Exercise
Show Answer

Fit a generalised linear model on airquality_data, where the response variable is ozone and explanatory variables are wind, temp, solar_r and month. Use the Poisson distribution.
Why do you get warning messages?

Fit a generalised linear model on airquality_data, where the response variable is ozone and explanatory variables are wind, temp, solar_r and month. Use the Poisson distribution.

# Fit a generalised linear model 
#  ozone is the response variable and
#  explanatory variables are wind,solar_r,temp and month
# using Poisson distribution

poisson_exercise <- glm(ozone ~ wind + temp + solar_r + month, 
                    family="poisson",
                    data=airquality_data)

# Finding the summary
summary(poisson_exercise)

Why do you get warning messages?

Because ozone is a continuous variable, not a discrete variable and Poisson regression is used to model countable data - 0,1,2,… .

Summary

By now you should:

Know the theory of generalised linear models
Know which distributions are appropriate for different type response variables and their link function
Know the theory of logistic regression
Be able to fit a logistic regression in R, interpret the coefficients, predict and check the model fit
Be able to fit other GLM model in R using Gaussian, Gamma and Poisson distribution
Be able to interpret your results

Reuse

Open Government Licence 3.0 (View License)