Statistics in R

Author

Government Analysis Function and ONS Data Science Campus

Analysis Function and Data Science Campus Logos

1 Introduction

Audience: Someone interested in learning statistical methods and how to apply those methods in R. If you are only interested in learning how to code in R then it is not an appropriate course since this course consists of theory and coding equally.

Prerequisite: Introduction to R (GSS Capability)

Description: This course was produced by the BEIS data science team and GSS Capability. This course is an even split of statistical theory and R codes. Firstly, the theory will be shown, then how to apply the corresponding theory in R to find desired output, then the example will be shown and lastly, you will be asked to complete exercises as we go along the course.

1.1 Course Materials

The course materials come in several formats:

HTML pages such as the one you are reading now
Data we will use during the course. It’s highly recommended you create a project with a ‘data’ folder and download all the required datasets before starting the course

You can also navigate to the course Github Repository and clone or fork the website structure for yourself. If you are new to programming and version control, we recommend you remain on the website to gain the best experience.

2 Learning Objectives

Chapter 1: Introduction

Import the fake energy (CSV file) dataset and explore the structure of the dataset
Familiarise yourself with airquality data and explore the dataset
Do random sampling in R
Compute probability density/mass functions, cumulative and random sampling of few distributions in R

Chapter 2: Exploratory Data Analysis

Inspect data
Find missing values and impute them with mean
Find descriptive statistics of one variable and plot them to visualise data
Standardise variables and filter out data
Find descriptive statistics of multiple variables
Plot two continuous variables and finding the relationship between them
Find the correlation between two variables
Plot and interpreting scatterplot matrices
Plot categorical variables to explore data

Chapter 3: Statistical Tests

Know the basic structure of the statistical test
Know the one-sided and two-sided tests in hypothesis testing
Know the critical Value approach and p-value approach
Know statistical tests such as z-test, t-test, chi-squared test

Chapter 4: Linear regression

Know how to use the formula in R
Know how to carry out one-way ANOVA
Know how to carry out simple linear regression and multiple linear regression
Know how to add interaction terms in regression analysis

Chapter 5: Model Adequacy and Model Selection

Know the difference between fitted values and residuals
Understand the regression assumptions
Be able to create and interpret residual diagnostics plots
Be able to use log transformation on response variables
Be able to compare different models using metrics such as \(R^2\), RMSE,RSE, AIC,BIC
Know how to find multicollinearity between variables using VIF
Be able to select the best model using stepwise selection
Be able to select the best subset linear model using leaps package

Chapter 6: Generalised Linear Models

Know the theory of generalised linear models
Know which distributions are appropriate for different type response variables and their link function
Know the theory of logistic regression
Be able to fit a logistic regression in R, interpret the coefficients, predict and check the model fit
Be able to fit other GLM model in R using Gaussian, Gamma and Poisson distribution
Be able to interpret your results

3 Packages and Data

3.1 Packages

Make sure to install all of these packages before attending this course. The main packages we will be using for this course are:

“tidyverse” (Version 1.3.0) (All chapters)
- “readr” (Version 1.3.1) for importing datasets (All chapters)
- “ggplot2” (Version 3.3.2) for data visualisation (All chapters)
- “dplyr” (Version 1.0.0) for data manipulation (All chapters)
- “broom” (Version 0.7.2) for model comparisons (Chapter 5)
“janitor” (Version 2.0.1) for cleaning column names (Chapter 2)
“visdat” (Version 0.5.3) for visualising missing data (Chapter 2)
“naniar” (Version 0.6.0) for missing values (Chapter 2)
“moments” (Version 0.14) for skewness (Chapter 2)
“GGally” (Version 2.0.0) for correlation plots (Chapter 2)
“BSDA” (Version 1.2.0) for computing z-test statistics (Chapter 3)
“car” (Version 3.0.8) for finding multicollinearity in a regression model (Chapter 5)
“leaps” (Version 3.1) for model selection (Chapter 5)

Please contact your department, if you are struggling in installing these packages since these packages are mandatory.

3.2 Data

We will use the National Energy Efficiency Data which is stored in Data folder as fake_energy_data. It has 1,131 rows and 11 columns, where the columns are as follow:

rowid - Contains 1,131 row ids
consumption - Gas consumption by households in kilowatt-hour (kWh)
income - Income of household in pounds (£)
floor_area - Properties with floor area divided in different categories “50 or less”, “51 to 100”, “101 to 150”, “151 to 200”, “201 to 250” and “Over 250”
property_age - The age of property in years, divided into 7 different categories “Pre 1919”,“1919 to 1944”,“1945 to 1964”,“1965 to 1983”,“1984 to 1992”,“1993 to 1999” and Post 1999”
tenure - The form of tenancy
bedrooms - Number of bedrooms in the property
property_type - The type of property
country - The countries where the information is collected from
adults - The number of adults in the household
region - The regions where the data is information from

We will use also use airquality data stored in R. It has 153 rows and 6 columns, where the columns are as follow:

Ozone - Mean ozone concentration in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R - Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind-Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp-Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport.
Month- Daily air quality measurements in New York from May to September in 1973
Day- Daily air quality measurements for all days from May to September so Days column can either have 30 days or 31 days depending on the corresponding month

Reuse

Open Government Licence 3.0 (View License)