Statistics in R

Author

Government Analysis Function and ONS Data Science Campus

Analysis Function and Data Science Campus Logos

1 Introduction

Audience: Someone interested in learning statistical methods and how to apply those methods in R. If you are only interested in learning how to code in R then it is not an appropriate course since this course consists of theory and coding equally.

Prerequisite: Introduction to R (GSS Capability)

Description: This course was produced by the BEIS data science team and GSS Capability. This course is an even split of statistical theory and R codes. Firstly, the theory will be shown, then how to apply the corresponding theory in R to find desired output, then the example will be shown and lastly, you will be asked to complete exercises as we go along the course.

1.1 Course Materials

The course materials come in several formats:

  • HTML pages such as the one you are reading now

  • Data we will use during the course. It’s highly recommended you create a project with a ‘data’ folder and download all the required datasets before starting the course

You can also navigate to the course Github Repository and clone or fork the website structure for yourself. If you are new to programming and version control, we recommend you remain on the website to gain the best experience.

2 Learning Objectives

Chapter 1: Introduction

  • Import the fake energy (CSV file) dataset and explore the structure of the dataset

  • Familiarise yourself with airquality data and explore the dataset

  • Do random sampling in R

  • Compute probability density/mass functions, cumulative and random sampling of few distributions in R

Chapter 2: Exploratory Data Analysis

  • Inspect data

  • Find missing values and impute them with mean

  • Find descriptive statistics of one variable and plot them to visualise data

  • Standardise variables and filter out data

  • Find descriptive statistics of multiple variables

  • Plot two continuous variables and finding the relationship between them

  • Find the correlation between two variables

  • Plot and interpreting scatterplot matrices

  • Plot categorical variables to explore data

Chapter 3: Statistical Tests

  • Know the basic structure of the statistical test

  • Know the one-sided and two-sided tests in hypothesis testing

  • Know the critical Value approach and p-value approach

  • Know statistical tests such as z-test, t-test, chi-squared test

Chapter 4: Linear regression

  • Know how to use the formula in R

  • Know how to carry out one-way ANOVA

  • Know how to carry out simple linear regression and multiple linear regression

  • Know how to add interaction terms in regression analysis

Chapter 5: Model Adequacy and Model Selection

  • Know the difference between fitted values and residuals

  • Understand the regression assumptions

  • Be able to create and interpret residual diagnostics plots

  • Be able to use log transformation on response variables

  • Be able to compare different models using metrics such as \(R^2\), RMSE,RSE, AIC,BIC

  • Know how to find multicollinearity between variables using VIF

  • Be able to select the best model using stepwise selection

  • Be able to select the best subset linear model using leaps package

Chapter 6: Generalised Linear Models

  • Know the theory of generalised linear models

  • Know which distributions are appropriate for different type response variables and their link function

  • Know the theory of logistic regression

  • Be able to fit a logistic regression in R, interpret the coefficients, predict and check the model fit

  • Be able to fit other GLM model in R using Gaussian, Gamma and Poisson distribution

  • Be able to interpret your results

3 Packages and Data

3.1 Packages

Make sure to install all of these packages before attending this course. The main packages we will be using for this course are:

  • “tidyverse” (Version 1.3.0) (All chapters)

    • “readr” (Version 1.3.1) for importing datasets (All chapters)

    • “ggplot2” (Version 3.3.2) for data visualisation (All chapters)

    • “dplyr” (Version 1.0.0) for data manipulation (All chapters)

    • “broom” (Version 0.7.2) for model comparisons (Chapter 5)

  • “janitor” (Version 2.0.1) for cleaning column names (Chapter 2)

  • “visdat” (Version 0.5.3) for visualising missing data (Chapter 2)

  • “naniar” (Version 0.6.0) for missing values (Chapter 2)

  • “moments” (Version 0.14) for skewness (Chapter 2)

  • “GGally” (Version 2.0.0) for correlation plots (Chapter 2)

  • “BSDA” (Version 1.2.0) for computing z-test statistics (Chapter 3)

  • “car” (Version 3.0.8) for finding multicollinearity in a regression model (Chapter 5)

  • “leaps” (Version 3.1) for model selection (Chapter 5)

Please contact your department, if you are struggling in installing these packages since these packages are mandatory.

3.2 Data

We will use the National Energy Efficiency Data which is stored in Data folder as fake_energy_data. It has 1,131 rows and 11 columns, where the columns are as follow:

  • rowid - Contains 1,131 row ids

  • consumption - Gas consumption by households in kilowatt-hour (kWh)

  • income - Income of household in pounds (£)

  • floor_area - Properties with floor area divided in different categories “50 or less”, “51 to 100”, “101 to 150”, “151 to 200”, “201 to 250” and “Over 250”

  • property_age - The age of property in years, divided into 7 different categories “Pre 1919”,“1919 to 1944”,“1945 to 1964”,“1965 to 1983”,“1984 to 1992”,“1993 to 1999” and Post 1999”

  • tenure - The form of tenancy

  • bedrooms - Number of bedrooms in the property

  • property_type - The type of property

  • country - The countries where the information is collected from

  • adults - The number of adults in the household

  • region - The regions where the data is information from

We will use also use airquality data stored in R. It has 153 rows and 6 columns, where the columns are as follow:

  • Ozone - Mean ozone concentration in parts per billion from 1300 to 1500 hours at Roosevelt Island

  • Solar.R - Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

  • Wind-Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

  • Temp-Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport.

  • Month- Daily air quality measurements in New York from May to September in 1973

  • Day- Daily air quality measurements for all days from May to September so Days column can either have 30 days or 31 days depending on the corresponding month

Reuse

Open Government Licence 3.0 (View License)