Chapter 1 - Introduction

Author

Government Analysis Function and ONS Data Science Campus

Learning Objectives

The goal of this session is to:

  • Introduce dataset in R
  • Introduce data sampling in R
  • Introduce probability distributions

1 Data

1.1 Importing dataset

In this course, we’ll be using some fictitious data based on the structure of the National Energy Efficiency Database, which is maintained by BEIS, to demonstrate the concepts discussed. First, we will import fake_energy_data in R and then assign it to raw_energy_data. (If you are interested in learning more about importing dataset in R, then look at Chapter 4 in Intro to R)

# Importing fake_energy_data from data folder
# assigning it to the name raw_energy_data 

raw_energy_data <- readr::read_csv("../Data/fake_energy_data.csv")

# We can view the data 
raw_energy_data

The data has the following columns:

  • rowid - Contains 1,131 row ids

  • consumption - Gas consumption by households in kilowatt-hours (kWh)

  • income - Income of household in pounds (£)

  • floor_area - Properties’ floor area in square meter \(m^2\) divided into different categories “50 or less”, “51 to 100”, “101 to 150”, “151 to 200”, “201 to 250” and “Over 250”

  • property_age - The age of property in years, divided into 7 different categories “Pre 1919”,“1919 to 1944”,“1945 to 1964”,“1965 to 1983”,“1984 to 1992”,“1993 to 1999” and Post 1999”

  • tenure - The form of tenancy

  • bedrooms - Number of bedrooms in the property

  • property_type - The type of property

  • country - The country where the information is collected from

  • adults - The number of adults in the household

  • region - The region where the information is collected from

First, we will look at the dataset so look at the columns, rows and structure of the dataset

# Look at the first 10 observations 

head(raw_energy_data, n = 10)

Run the code below to look at the general structure of the data.

dplyr::glimpse(raw_energy_data)

1.2 Quiz

  1. What combination of vector types are there in raw_energy_data?
  • Three numeric, six character
  • Four numeric, seven character
  • Three number, seven string
  • No vectors
  1. How many observations are there in raw_energy_data?
  • 2113
  • 1031
  • 1131
  • 130
  1. Four numeric, seven character. This totals to 11 vectors creating our data frame.

  2. There are 1131 observations

1.3 Built in data

We’ll also use the built-in dataset airquality to illustrate some concepts as well. It shows the daily readings of the air quality from May 1, 1973 to September 30, 1973.

# First assign the airquality dataset stored in R 
# to the name airquality_data

airquality_data <- airquality


# To display the first 5 observation

head(airquality_data,n=5)
# Look at last 10 observations 

tail(airquality_data, n = 10)

The data has the following columns:

Ozone - Mean ozone concentration in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R - Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

Wind-Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp-Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport.

Month- Daily air quality measurements in New York from May to September in 1973

Day- Daily air quality measurements for all days from May to September so Days column can either have 30 days or 31 days depending on the corresponding month

1.3.1 Exercise

  1. Load the airquality data stored in R and assign it to the variable name airquality_data.

  2. Clean the column names of airquality_data using janitor::clean_names() function or the gsub function.

  3. Look at the general structure of the airquality_data dataset.

  1. Load the airquality data stored in R and assign it to the variable name airquality_data.
# First assign the airquality dataset stored in R 
# to the name airquality_data

airquality_data <- airquality
  1. Clean column names of airquality_data using janitor::clean_names() function or gsub function?
# Using janitor package to clean names

airquality_data <- janitor::clean_names(airquality_data)
# Using gsub to clean names
# Fixed=TRUE will not treat dot/period as a special character
# Can use \\. as well

names(airquality_data) <- tolower(names(airquality_data)) 

names(airquality_data) <- gsub(".", "_", names(airquality_data),fixed = TRUE)
  1. Look at the general structure of the airquality_data dataset?
# Use dplyr::glimpse() or str()

dplyr::glimpse(airquality_data)

We will cover exploratory data analysis in Chapter 2. Now we will cover random sampling in R using these datasets.

2 Random sampling in R

You can create your own data in R by sampling pseudo-random data points from the different types of distribution. First, it’s worth understanding how randomness works in R. R has a generator from which it can generate random numbers, as far as true randomness is possible. By setting the seed number from which the generation occurs, using set.seed(), we can create reproducible sampling and analyses. Objects with the same seed will produce the same ‘random’ pattern, and therefore analyses can be reproduced. You can use the sample() function to generate a random sample from a given population. To do this, you have to give R a vector of data from which the sample will be drawn. We could do this with a sequence that we’ve decided, or a variable from our existing dataset. Here is an example showing how you can generate ten random numbers from a sequence of numbers from 1 to 100 using sample() function.

# We start the list of random numbers at the position 42. 
# The position of the seed is defined by you so it can be anything.

set.seed(42)

# Pick 10 numbers from the sequence 1 to 100. 
# The sample argument 'replace' is default false so we change it to TRUE.

sample(x=seq(1,100), size=10, replace = TRUE)

Note that the seed works from the first operation, so if you just ran the sampling lines above over and over again, you would get different results if you did not again set the seed explicitly beforehand. As replace has been set to TRUE duplicate values can appear - such as 49.

Be aware as well that while a specific seed will allow for reproducible analysis, random numbers generated will only be consistent if the version of R is consistent.

2.1 Exercise

  1. Generate 5 random numbers from a sequence of numbers starting from 1 till 500 where each number was chosen is not replaced. Also, the value of the seed should be set to 653.

  2. Generate 10 random observations from rowid column using the raw_energy_data dataset.

  1. Generate 5 random numbers from a sequence of numbers starting from 1 till 500 where each number was chosen is not replaced. Also, the value of the seed should be set to 653.
# Set the value of seed to be 653

set.seed(653)

# Using sample() function
# Pick 5 numbers from a sequence of numbers starting from 1 till 500. 
# Numbers generated with R version 3.6.3 

sample(x=seq(1,500), size=5, replace = FALSE)
  1. Generate 10 random observations from rowid column using the raw_energy_data dataset.
# Use sample() function to generate 
# 10 observations from rowid column in raw_energy_data 

sample(x=raw_energy_data$rowid, size=10)

3 Probability Distributions

Uncertainty and variability occur in many aspects of life. Probability helps us make sense and quantify uncertainties. Probability also helps us make informed decisions on what is likely to happened, based on patterns in the data collected.

Probability is the chance of an event occurring, defined between 0 and 1. 0 = event will never happen, 1 = event will always happen. Random Variable is the variable that takes values depending on outcomes of random events. It can be discrete if it can take distinct, separate values (gender) or continuous (age, weight). Probability distribution the probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable. For a discrete random variable, x, the probability distribution is defined by a probability mass function, denoted by \(f(x)\). For a continuous random variable, because there is an infinite number of values in any interval (e.g. you can always go to smaller and smaller decimals), we instead consider the probability that a continuous random variable will lie within a given interval. Here the probability distribution is defined by a probability density function, also denoted by \(f(x)\). Both probability functions must satisfy two requirements:

  1. \(f(x)\) must be non-negative for each value of the random variable.

  2. The sum of the probabilities for each value (or integral over all values) of the random variable must equal one.

3.1 Exercise

In RStudio use help() function to find information about “Distributions”.

In RStudio use help() function to find information about “Distributions”.

help("Distributions")

Image showing the help documentation for the help() function

The “Distributions” in the stats package give density/mass functions of several probability distributions. It does this using the naming convention of the distribution and prefixing this with d, p, q or r.

d for “density”, the density function (p.d.f) p for “probability”, the cumulative distribution function q for “quantile”, the inverse c.d.f r for “random”, a random variable having the specified distribution

3.2 Binomial distribution

The binomial distribution applies when you have a series of trials where the outcome of the trial can only take one of two values (e.g. black/white, heads/tails, alive/dead, present/absent, buyers/non-buyers).

rbinom() function generates required number of random values of given probability from a given sample.

# Picking 8 random values from a sample which follows a binomial distribution, 
# where the total number of observations in the sample is 150 
# and the probability of success is 0.4.

rbinom(n=8,size=150,prob=0.4)

dbinom()function gives the probability density distribution at each point.

# Probability of getting 26 heads from 51 tosses of a coin
# where the probability of success (probability of getting a head) is 0.5

dbinom(x=26,size=51,prob=0.5)

pbinom() gives the cumulative probability of an event.

# Probability of getting up to 26 heads from 51 tosses of a coin.
# where the probability of success is 0.5
# Mathematically it is P(X⩽26) where X~Bin(51,0.5)

pbinom(q=26,size=51,prob=0.5)

qbinom() function takes the probability value and gives a number whose cumulative value matches the probability value.

# What is the maximum number of heads 
# in the bottom 25% of coin flip experiments when the coin is tossed 51 times?

qbinom(p=0.25,size=51,prob=0.5)

3.2.1 Exercise

Suppose there are twelve multiple-choice questions in an English class quiz (size=12). Each question has five possible answers, and only one of them is correct.

  1. Find the probability of having exact four correct answers if a student attempts to answer every question at random? [Hint: Calculate P(X=4), where X~Bin(n,p) where n is the total number of observations and p is the probability of success]

  2. Find the probability of having four or less correct answers if a student attempts to answer every question at random? [Hint: Calculate P(X⩽4), where X~Bin(n,p) where n is the total number of observations and p is the probability of success]

Suppose there are twelve multiple-choice questions in an English class quiz (size=12). Each question has five possible answers, and only one of them is correct.

  1. Find the probability of having exact four correct answers if a student attempts to answer every question at random? [Hint: Calculate P(X=4), where X~Bin(n,p) where n is the total number of observations and p is the probability of success]

Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2.

# dbinom() gives the probability density distribution at each point
# x=4 because we want the probability of exact 4 correct answers
# prob=0.2 because 1/5=0.2
# size is the total number of observations

dbinom(x=4, size=12, prob=0.2) 
  1. Find the probability of having four or less correct answers if a student attempts to answer every question at random? [Hint: Calculate P(X⩽4), where X~Bin(n,p) where n is the total number of observations and p is the probability of success]
# Since we are interested in cumulative probability we will use pbinom()
# q=4 because we want the probability of 4 or less correct answers
# prob=0.2 because 1/5=0.2
# size is the total number of observations

pbinom(q=4, size=12, prob=0.2) 

The probability of four or less questions answered correctly by random in a twelve question multiple-choice quiz is 92.7%.

3.3 Normal (Gaussian) distribution

The normal distribution is arguably the most commonly used distribution in statistics. It is used for continuous variables (e.g. height, weight, etc.). It has several computational properties which make it easy to work with. For instance, it is symmetric, unimodal, and the mean, median, and mode are all equal. That means it can be captured just by specifying the mean and the variance. A standard normal distribution is a normal distribution with a mean of zero and standard deviation of 1. We can plot a histogram of a random sample derived from a standard normal distribution as shown below:

# Create a sample of 50 random numbers which are normally distributed.
# using rnorm() function which generates random numbers from 
# normal distributions 
# n=50 because we want 50 random numbers 
# standard normal distribution has mean=0 and standard deviation 1

normal_sample <- rnorm(n=50, mean = 0, sd = 1)

# Plot the histogram for this sample.
# x is the vector of values 
# main gives the title, xlab and ylab gives the title for x-axis and y-axis

hist(x=normal_sample, main = "Normal Distribution",
     xlab = "Sample from normal distribution")

We can also calculate the probability of an event happening if the sample follows a normal distribution. For example, if we can find the probability of a man whose height is less and equal to 157.5 cm \(P(X⩽157.5)\), when the mean height of men in sample is 170cm and the standard deviation of men height in the sample is 10cm, \(X \sim N(170,10^2)\)

# The default argument is to have lower.tail = TRUE,
# which return the probabilities less than the height specified
# that is P(X⩽157.5)

pnorm(q=157.5, mean = 170, sd = 10, lower.tail = TRUE)

3.3.1 Exercise

Assume that the test scores of a college entrance exam follow a normal distribution. The sample mean test score is 72, and the sample standard deviation is 15.2. What is the percentage of students scoring 84 or more in the exam? [Hint: Calculate P(X≥84), where X~N(72,15.22)].

Assume that the test scores of a college entrance exam follow a normal distribution. The sample mean test score is 72, and the sample standard deviation is 15.2. What is the percentage of students scoring 84 or more in the exam? [Hint: Calculate P(X≥84), where X~N(72,15.22)].

# Since we are interested in cumulative probability we will use pnorm()
# we specify the mean and standard deviation 
# we are looking for the percentage of students scoring higher than 84
# we are interested in the upper tail of the normal distribution

pnorm(q=84, mean=72, sd=15.2, lower.tail=FALSE) 


# Alternative way is shown below since
# X~N(72,15.2^2)
# P(X≥84) = 1-P(X⩽84) =1 - P(X<84) for a continuous variable
# the probability at each point is 0 so P(X=x)=0

1-pnorm(q=84, mean=72, sd=15.2)

The percentage of students scoring 84 or more in the college entrance exam is 21.5%.

3.4 Poisson distribution

The Poisson distribution gives the distribution of the number of individuals, arrivals, events, counts, etc., in a given time/space/unit of counting. In the Poisson distribution, the mean and the variance are the same.

We can generate random numbers derived from Poisson distribution as shown below:

# Select 15 observations from a poisson distribution
# where the average rate within our window is 10
# n is the size of random numbers we want
# lambda is the mean which is 10 in our case

rpois(n = 15, lambda = 10)

Now we will find the probability of making 2 to 4 sales in a week if the average sales rate is 3 per week.

# Let X~Poisson(3)
# Since Poisson distribution is discrete distribution 
# the probability at each point doesnot have to be 0
# So P(2≤X≤4) = P(X≤4)-P(X<2) = P(X≤4)-P(X≤1)
# because P(X<2) = P(X=1)+P(X=0) = P(X≤1)
# Since we are interested in cumulative probability we will use ppois()


ppois(q = 4, lambda = 3, lower.tail = TRUE) - 
  ppois(q = 1, lambda = 3, lower.tail = TRUE)


# Alternative way
# dpois() gives the probability density distribution at each point


dpois(x = 2, lambda = 3) +
  dpois(x = 3, lambda = 3) +
  dpois(x = 4, lambda = 3)

3.4.1 Exercise

If there are ten cars crossing a bridge per minute on average, find the probability of having twelve or more cars crossing the bridge in a particular minute? [Hint: Calculate P(X≥12), where X~poi(10)].

If there are ten cars crossing a bridge per minute on average, find the probability of having twelve or more cars crossing the bridge in a particular minute? [Hint: Calculate P(X≥12), where X~poi(10)].

# Since we are interested in cumulative probability we will use ppois()
# we are looking for twelve or more cars crossing the bridge so
# we are interested in the upper tail of the poisson distribution

ppois(q=12, lambda=10, lower.tail=FALSE) 

If there are ten cars crossing a bridge per minute on average, the probability of having twelve or more cars crossing the bridge in a particular minute is 20.8%.

Summary

By now you should be able to:

  • Import the fake energy (CSV file) dataset and explore the structure of the dataset

  • Familiarise yourself with airquality data and explore the dataset

  • Do random sampling in R

  • Compute probability density/mass functions, cumulative and random sampling of few distributions in R

Next Chapter

In the next chapter we will look at Exploratory data analysis

Reuse

Open Government Licence 3.0 (View License)