Additional material

Author

Government Analysis Function and ONS Data Science Campus

Learning Objectives

The goal of this session is to learn about

Proportion tests
Factor levels & contrasts
Best subset linear model using caret

1 Exercise

Exercise
Show Answer

Load in the following packages “tidyverse”, “visdat”, “naniar”, “moments”, “BSDA”, “caret” and “GGally”.
Import the energy_data and airquality_data in your scripts as assign it to the variable names energy_data and airquality_data respectively.

Load in the following packages “tidyverse”, “visdat”, “naniar”, “moments”, “BSDA”, “caret” and “GGally”.

# Loading in packages that we need for this chapter.

library(tidyverse) # For data manipulation
library(BSDA)    # For Z-test
library(visdat)  # For missing values
library(naniar)  # For missing values
library(moments) # For skewness
library(GGally) # For correlation plots
library(caret)  # For model selection

Import the energy_data and airquality_data in your scripts as assign it to the variable names energy_data and airquality_data respectively.

# Importing the energy_data 

energy_data <- readr::read_csv("../Data/energy_data.csv")

#Built in data 

airquality_data <- readr::read_csv("../Data/airquality_data.csv")

2 Proportion testing

The two-proportion z-test is used to determine if the proportions of the two groups are different. The assumptions are:

The sampling method for each group is simple random sampling.
The samples are independent.
Each population is at least 20 times as big as its sample.

We will use the airquality_data to compare the proportion of days which were hotter than 70 degrees Fahrenheit in May and in September. So, we will format our data first as shown below:

# Here we are filtering for the months' May (5) and  September (9)
# Grouping by the month
# Counting how many days were hotter than 70 degrees Fahrenheit

temp_over_seventy <- airquality_data %>% 
  filter(month == 9 | month == 5) %>%
  group_by(month) %>% 
  count(temperature_over_70 = temp >= 70)

# Display results 
temp_over_seventy

There are 7 days where the temperature was over 70 degrees Fahrenheit in May and there were 24 days in September where the temperature was over 70 degrees Fahrenheit. However, the proportional test needs to use in R need to have a specific format where the data should be a 2 by 2 matrix which gives successes and failure so we need to modify the above data into a matrix as shown below:

# Pivot wider reshapes our data from 4 rows and 3 columns to 2 rows and 3 columns
# Since we need 2 columns so we will drop the temperature_over_70 column as it contains boolean values
# Finally as. matrix will turn into a 2x2 matrix


 temp_matrix <- temp_over_seventy %>% 
   pivot_wider(names_from = month, values_from = n) %>%
  select(-temperature_over_70) %>%
  as.matrix()

# To display the data

 temp_matrix

We are left with a 2x2 table which gives the number of days over 70 degrees fahrenheit compared to not; 24 in May and 7 in September respectively. Now we can test whether the proportion of hot days in May is significantly different from the proportion of hot days in September.

Step 1: Construct a two-tailed test where the null hypothesis and alternative hypothesis are shown below:

\(H_{0}:\) The proportion of hot days in May is the same as the proportion of hot days in September

\(H_{a}:\) The proportion of hot days in May is different from the proportion of hot days in September

Step 2: Choose your significance level \(\alpha=0.05\)

# Assigning significance level 

alpha <- 0.05

Step 3: Run two-proportion z-test test by running one function as shown below:

# Set alternative to two.sided because we are running a two-sided test
# Set correct to FALSE for now

proportion_test <- prop.test(x= temp_matrix, alternative = "two.sided", conf.level = 1-alpha, correct = FALSE)

# Display the results

proportion_test

The function returns:

X-squared: the observed value of Pearson’s chi-squared test statistic. We compare this observed value with the critical value. Since we are conducting a two-proportion z-test we will take a square root of this value to obtain the z value.
P-value: the p-value of the test.
95% confidence intervals: confidence interval for the probability of success.
Sample estimate: an estimated probability of success (the proportion of hot days in May and September).

Now we will display the observed test statistics (X-squared) and take square root because the square root of X-squared gives us the value for z statistics.

# Display the observed test statistics
# Output is a list
 
proportion_test$statistic

# Display the observed z-test statistics
# The squareroot of Pearson's chi-squared test statistic is identified as z-test statistics 

sqrt(proportion_test$statistic[[1]])

Step 4: Critical value approach: Determine the critical value to compare with the observed test statistics.

# Finding the critical value 
# Because it is two sided we need to divide 
# significance level by 2

 z_critical <-qnorm(1−alpha/2)
 
# Constructing the Non-rejection region
 
 non_rejection_region <- c(− z_critical, z_critical) 
 non_rejection_region

Critical value approach: The observed test statistics (z = 4.485) is outside the non-rejection region (since the observed test statistic is greater than the upper critical value). The observed statistics are in the rejection region so there is enough evidence to reject the null hypothesis at the 5% significance level, indicating a significant difference between the proportions of hot days in May and September.

Step 5: P-value approach

# Display the p-value 

proportion_test$p.value

# Turn scientific off

format(proportion_test$p.value, scientific = FALSE)

The p-value is much less than \(\alpha=0.05\) and it is even less than 1% this suggests that there is very strong evidence against the null hypothesis. Therefore, we reject the null hypothesis at the 5% significance level. This shows that the difference in the proportion of hot days between May and September is statistically significant.

Step 6: Determine the confidence interval

# Display the 95% confidence interval 

proportion_test$conf.int


# Display the estimated mean difference of temperature in May and September

proportion_test$estimate[[2]]- proportion_test$estimate[[1]]

Confidence Interval: The 95% confidence interval is (0.369,0.780). This shows that we are 95% confident that the difference in the proportion of hot days between May and September is between 0.369 and 0.780. Our best estimate of the difference, the point estimate, is -0.574. Based on this interval, we also conclude that the proportion of hot days in May is significantly different from the proportion of hot days in September because the 95% confidence interval does not include the null value, zero.

Step 7: Interpret the results

There is a large significant difference in the proportion of hot days between the two months at the 5% significance level; therefore they are not the same. We can also use this function to compare the proportion in our sample to the hypothetical proportion in the population from which the sample was drawn (i.e. a one-sample proportion test/ z-test).

Replicate the two-proportion z-test above by comparing the proportion of days which were windy (speed of wind is more than 10 miles per hour) in July and in September in airquality_data.

Filter your data to get a 2x2 table which gives the number of days it was windy in July and September.
Run the two-proportion z-test on the matrix formed above.
Interpret your results using critical value approach or p-value approach.

Replicate the two-proportion z-test above by comparing the proportion of days which were windy (speed of wind is more than 10 miles per hour) in July and in September in airquality_data.

Filter your data to get a 2x2 table which gives the number of days it was windy in July and September.

# Here we are filtering for the months' May (5) and  September (9)
# Grouping by the month
# Counting how many days were hotter than 70 degrees Fahrenheit
# Pivot wider reshapes our data from 4 rows and 3 columns to 2 rows and 3 columns
# Since we need 2 columns so we will drop the windy column as it contains boolean values
# Finally as. matrix will turn into a 2x2 matrix

airqual_exercise <- airquality_data %>% 
  filter(month == 9 | month == 7) %>%
  group_by(month) %>% 
  count(windy = wind >= 10) %>%
  pivot_wider(names_from = month, values_from = n) %>%
  select(-windy) %>%
  as.matrix()

# Display the result

airqual_exercise

Run the two-proportion z-test on the matrix formed above.

# Run the two-proportion z-test

prop.test(airqual_exercise, alternative = "two.sided", conf.level = 0.95, correct = FALSE)

Interpret your results using critical value approach or p-value approach.

P-value approach: The p-value is less than 0.05 so we reject the null hypothesis. There is a significant difference between the proportion of windy days in July and September
Critical value approach: The observed test statistics (z=2.174) is outside the non-rejection region so there is enough evidence to reject the null hypothesis at 5% significance level, indicating a significant difference between the proportion of windy days in July and September.

3 Factor levels & contrasts

Given a factor categorical variable included in a model, R will generate dummy variables automatically for this variable. R will therefore also decide which is the ‘baseline’ condition and which of the dummy variables are included in the model. We can turn our categorical variable into a factor as shown below and check their groups/levels.

# Turn your variable into a factor
month <- as.factor(airquality_data$month)

# Get the levels of the factor
levels(month)

It is important to know what these contrasts are when assessing the use of a categorical variable with more than two groups/levels in a model. If a variable is a factor then it will have these default contrast levels already associated with it. These can be accessed using the contrasts() function.

# Get contrasts
contrasts(month)

As we can see, a table is produced which indicates that, for instance, the first dummy variable takes on the value of 1 if the month is June (6), and 0 otherwise. A value of May (5) corresponds to a zero for each of the other dummy variables; this is the baseline contrast and is the level which is left out if the original categorical variable is included in modelling.

The contrasts function can also be used to manually set the contrast levels. If a variable is an ordered factor, then these levels will have been deliberately determined when the variable was created, for instance using the function factor() with the argument ordered set to TRUE. This will affect the contrasts produced.

It’s also worth being aware that there is an argument called stringsAsFactors in the function read.csv() and data.frame(). When using either of these functions to read in data, this argument is set to TRUE, which means that R will attempt to convert any text variables into factors with levels. This can be very confusing later on when dealing with variables you expect to behave as characters, but will actually behave as factors. Explicitly setting stringsAsFactors = FALSE when reading in data with these functions will prevent this from happening. You can also always call class() on objects to check their type.

3.1 Exercise

Exercise
Show Answer

Find the type of the variable country in energy_data.
Convert it to a factor and find the contrast of country in energy_data.

Find the type of the variable country in energy_data.

# Using typeof() function (covered in intro to R)

typeof(energy_data$country)

Convert it to a factor and find the contrast of country in energy_data.

# Convert character into factor by using as.factor
country<- as.factor(energy_data$country)

# Contrast of the factor

contrasts(country)

4 Best subset linear model using `caret`

We will use K-fold cross-validation, where we first randomly split our dataset into k-subsets (we will split into 10 subsets/folds). Each subset (10%) serves successively as test data set and the remaining subset (90%) as training data. The k-fold cross-validation can be easily computed using the function train() in caret package. The argument method in the train()` function can have three values:

leapBackward: is used to fit a linear regression with backward selection
leapForward: is used to fit a linear regression with forward selection
leapSeq: is used to fit a linear regression with stepwise selection.

# Set seed for reproducibility
set.seed(123)

# Randomly split our dataset into 10 subsets
train_control <- caret::trainControl(method = "cv", number = 10)

# Train the model
# set method as leapBackward
# we have 5 explanatory variable so set nvmax = 1:5
step_model <- caret::train(ozone ~., data = airquality_data,
                           method = "leapBackward", 
                           tuneGrid = data.frame(nvmax = 1:5),
                           trControl = train_control
                           )

# Display the results
step_model$results

We can see that the model with 5 variables has the lowest RMSE and MAE. We can display the best tuning values (nvmax) automatically selected by the train() function, as shown below:

# Check the best model by selecting the best tuning values

step_model$bestTune

This indicates that the best model is the one with nvmax = 5 variables. The function summary() reports the best set of variables for each model size, up to the best 5-variables model.

# See the best set of variables for each model 

summary(step_model$finalModel)

The best 5 variable model is unsuprisingly: ozone ~ solar_r + wind + temp + month + day

Reuse

Open Government Licence 3.0 (View License)