# Loading in packages that we need for this chapter.
library(tidyverse) # For data manipulation
library(BSDA) # For Z-test
library(visdat) # For missing values
library(naniar) # For missing values
library(moments) # For skewness
library(GGally) # For correlation plots
library(caret) # For model selection
Additional material
Learning Objectives
The goal of this session is to learn about
Proportion tests
Factor levels & contrasts
Best subset linear model using
caret
1 Exercise
Load in the following packages “tidyverse”, “visdat”, “naniar”, “moments”, “BSDA”, “caret” and “GGally”.
Import the
energy_data
andairquality_data
in your scripts as assign it to the variable names energy_data and airquality_data respectively.
- Load in the following packages “tidyverse”, “visdat”, “naniar”, “moments”, “BSDA”, “caret” and “GGally”.
- Import the
energy_data
andairquality_data
in your scripts as assign it to the variable names energy_data and airquality_data respectively.
# Importing the energy_data
<- readr::read_csv("../Data/energy_data.csv")
energy_data
#Built in data
<- readr::read_csv("../Data/airquality_data.csv") airquality_data
2 Proportion testing
The two-proportion z-test is used to determine if the proportions of the two groups are different. The assumptions are:
- The sampling method for each group is simple random sampling.
- The samples are independent.
- Each population is at least 20 times as big as its sample.
We will use the airquality_data
to compare the proportion of days which were hotter than 70 degrees Fahrenheit in May and in September. So, we will format our data first as shown below:
# Here we are filtering for the months' May (5) and September (9)
# Grouping by the month
# Counting how many days were hotter than 70 degrees Fahrenheit
<- airquality_data %>%
temp_over_seventy filter(month == 9 | month == 5) %>%
group_by(month) %>%
count(temperature_over_70 = temp >= 70)
# Display results
temp_over_seventy
There are 7 days where the temperature was over 70 degrees Fahrenheit in May and there were 24 days in September where the temperature was over 70 degrees Fahrenheit. However, the proportional test needs to use in R need to have a specific format where the data should be a 2 by 2 matrix which gives successes and failure so we need to modify the above data into a matrix as shown below:
# Pivot wider reshapes our data from 4 rows and 3 columns to 2 rows and 3 columns
# Since we need 2 columns so we will drop the temperature_over_70 column as it contains boolean values
# Finally as. matrix will turn into a 2x2 matrix
<- temp_over_seventy %>%
temp_matrix pivot_wider(names_from = month, values_from = n) %>%
select(-temperature_over_70) %>%
as.matrix()
# To display the data
temp_matrix
We are left with a 2x2 table which gives the number of days over 70 degrees fahrenheit compared to not; 24 in May and 7 in September respectively. Now we can test whether the proportion of hot days in May is significantly different from the proportion of hot days in September.
Step 1: Construct a two-tailed test where the null hypothesis and alternative hypothesis are shown below:
\(H_{0}:\) The proportion of hot days in May is the same as the proportion of hot days in September
\(H_{a}:\) The proportion of hot days in May is different from the proportion of hot days in September
Step 2: Choose your significance level \(\alpha=0.05\)
# Assigning significance level
<- 0.05 alpha
Step 3: Run two-proportion z-test test by running one function as shown below:
# Set alternative to two.sided because we are running a two-sided test
# Set correct to FALSE for now
<- prop.test(x= temp_matrix, alternative = "two.sided", conf.level = 1-alpha, correct = FALSE)
proportion_test
# Display the results
proportion_test
The function returns:
X-squared: the observed value of Pearson’s chi-squared test statistic. We compare this observed value with the critical value. Since we are conducting a two-proportion z-test we will take a square root of this value to obtain the z value.
P-value: the p-value of the test.
95% confidence intervals: confidence interval for the probability of success.
Sample estimate: an estimated probability of success (the proportion of hot days in May and September).
Now we will display the observed test statistics (X-squared) and take square root because the square root of X-squared gives us the value for z statistics.
# Display the observed test statistics
# Output is a list
$statistic
proportion_test
# Display the observed z-test statistics
# The squareroot of Pearson's chi-squared test statistic is identified as z-test statistics
sqrt(proportion_test$statistic[[1]])
Step 4: Critical value approach: Determine the critical value to compare with the observed test statistics.
# Finding the critical value
# Because it is two sided we need to divide
# significance level by 2
<-qnorm(1−alpha/2)
z_critical
# Constructing the Non-rejection region
<- c(− z_critical, z_critical)
non_rejection_region non_rejection_region
Critical value approach: The observed test statistics (z = 4.485) is outside the non-rejection region (since the observed test statistic is greater than the upper critical value). The observed statistics are in the rejection region so there is enough evidence to reject the null hypothesis at the 5% significance level, indicating a significant difference between the proportions of hot days in May and September.
Step 5: P-value approach
# Display the p-value
$p.value
proportion_test
# Turn scientific off
format(proportion_test$p.value, scientific = FALSE)
The p-value is much less than \(\alpha=0.05\) and it is even less than 1% this suggests that there is very strong evidence against the null hypothesis. Therefore, we reject the null hypothesis at the 5% significance level. This shows that the difference in the proportion of hot days between May and September is statistically significant.
Step 6: Determine the confidence interval
# Display the 95% confidence interval
$conf.int
proportion_test
# Display the estimated mean difference of temperature in May and September
$estimate[[2]]- proportion_test$estimate[[1]] proportion_test
Confidence Interval: The 95% confidence interval is (0.369,0.780). This shows that we are 95% confident that the difference in the proportion of hot days between May and September is between 0.369 and 0.780. Our best estimate of the difference, the point estimate, is -0.574. Based on this interval, we also conclude that the proportion of hot days in May is significantly different from the proportion of hot days in September because the 95% confidence interval does not include the null value, zero.
Step 7: Interpret the results
There is a large significant difference in the proportion of hot days between the two months at the 5% significance level; therefore they are not the same. We can also use this function to compare the proportion in our sample to the hypothetical proportion in the population from which the sample was drawn (i.e. a one-sample proportion test/ z-test).
2.1 Exercise
Replicate the two-proportion z-test above by comparing the proportion of days which were windy (speed of wind is more than 10 miles per hour) in July and in September in airquality_data
.
Filter your data to get a 2x2 table which gives the number of days it was windy in July and September.
Run the two-proportion z-test on the matrix formed above.
Interpret your results using critical value approach or p-value approach.
Replicate the two-proportion z-test above by comparing the proportion of days which were windy (speed of wind is more than 10 miles per hour) in July and in September in airquality_data
.
- Filter your data to get a 2x2 table which gives the number of days it was windy in July and September.
# Here we are filtering for the months' May (5) and September (9)
# Grouping by the month
# Counting how many days were hotter than 70 degrees Fahrenheit
# Pivot wider reshapes our data from 4 rows and 3 columns to 2 rows and 3 columns
# Since we need 2 columns so we will drop the windy column as it contains boolean values
# Finally as. matrix will turn into a 2x2 matrix
<- airquality_data %>%
airqual_exercise filter(month == 9 | month == 7) %>%
group_by(month) %>%
count(windy = wind >= 10) %>%
pivot_wider(names_from = month, values_from = n) %>%
select(-windy) %>%
as.matrix()
# Display the result
airqual_exercise
- Run the two-proportion z-test on the matrix formed above.
# Run the two-proportion z-test
prop.test(airqual_exercise, alternative = "two.sided", conf.level = 0.95, correct = FALSE)
- Interpret your results using critical value approach or p-value approach.
P-value approach: The p-value is less than 0.05 so we reject the null hypothesis. There is a significant difference between the proportion of windy days in July and September
Critical value approach: The observed test statistics (z=2.174) is outside the non-rejection region so there is enough evidence to reject the null hypothesis at 5% significance level, indicating a significant difference between the proportion of windy days in July and September.
3 Factor levels & contrasts
Given a factor categorical variable included in a model, R will generate dummy variables automatically for this variable. R will therefore also decide which is the ‘baseline’ condition and which of the dummy variables are included in the model. We can turn our categorical variable into a factor as shown below and check their groups/levels.
# Turn your variable into a factor
<- as.factor(airquality_data$month)
month
# Get the levels of the factor
levels(month)
It is important to know what these contrasts are when assessing the use of a categorical variable with more than two groups/levels in a model. If a variable is a factor then it will have these default contrast levels already associated with it. These can be accessed using the contrasts()
function.
# Get contrasts
contrasts(month)
As we can see, a table is produced which indicates that, for instance, the first dummy variable takes on the value of 1 if the month is June (6), and 0 otherwise. A value of May (5) corresponds to a zero for each of the other dummy variables; this is the baseline contrast and is the level which is left out if the original categorical variable is included in modelling.
The contrasts function can also be used to manually set the contrast levels. If a variable is an ordered factor, then these levels will have been deliberately determined when the variable was created, for instance using the function factor()
with the argument ordered
set to TRUE
. This will affect the contrasts produced.
It’s also worth being aware that there is an argument called stringsAsFactors
in the function read.csv()
and data.frame()
. When using either of these functions to read in data, this argument is set to TRUE
, which means that R will attempt to convert any text variables into factors with levels. This can be very confusing later on when dealing with variables you expect to behave as characters, but will actually behave as factors. Explicitly setting stringsAsFactors = FALSE
when reading in data with these functions will prevent this from happening. You can also always call class()
on objects to check their type.
3.1 Exercise
Find the type of the variable
country
inenergy_data
.Convert it to a factor and find the contrast of
country
inenergy_data
.
- Find the type of the variable
country
inenergy_data
.
# Using typeof() function (covered in intro to R)
typeof(energy_data$country)
- Convert it to a factor and find the contrast of
country
inenergy_data
.
# Convert character into factor by using as.factor
<- as.factor(energy_data$country)
country
# Contrast of the factor
contrasts(country)
4 Best subset linear model using caret
We will use K-fold cross-validation, where we first randomly split our dataset into k-subsets (we will split into 10 subsets/folds). Each subset (10%) serves successively as test data set and the remaining subset (90%) as training data. The k-fold cross-validation can be easily computed using the function train()
in caret
package. The argument method
in the train()` function can have three values:
- leapBackward: is used to fit a linear regression with backward selection
- leapForward: is used to fit a linear regression with forward selection
- leapSeq: is used to fit a linear regression with stepwise selection.
# Set seed for reproducibility
set.seed(123)
# Randomly split our dataset into 10 subsets
<- caret::trainControl(method = "cv", number = 10)
train_control
# Train the model
# set method as leapBackward
# we have 5 explanatory variable so set nvmax = 1:5
<- caret::train(ozone ~., data = airquality_data,
step_model method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:5),
trControl = train_control
)
# Display the results
$results step_model
We can see that the model with 5 variables has the lowest RMSE and MAE. We can display the best tuning values (nvmax) automatically selected by the train()
function, as shown below:
# Check the best model by selecting the best tuning values
$bestTune step_model
This indicates that the best model is the one with nvmax = 5 variables. The function summary()
reports the best set of variables for each model size, up to the best 5-variables model.
# See the best set of variables for each model
summary(step_model$finalModel)
The best 5 variable model is unsuprisingly: ozone ~ solar_r + wind + temp + month + day