Editing and Imputation in R

Author

Government Analysis Function and ONS Data Science Campus

1 Course Outline

This is a short course and aims to covers the practical application of editing and imputation in R. A similar course is available for those who prefer working in Python (Editing and Imputation in Python). Firstly, I will show the name of the method we are using, then an example of how to use that method in R and some exercises as well for you to complete.

This course doesn’t cover the theory of any of the methods specified. The theory of these methods is covered in Introduction to Editing and Imputation which is the prerequisite of this course. Also, this course doesn’t teach the best method to use for editing and imputation. Instead, it only shows how different methods can be used in R. Feel free to email Expert Group for their expert opinion.

The topics we will be covering are:

Reviewing and editing data (data entry, automatic editing and error localization);
Model-based imputation (mean, the ratio of means, mean of ratios and regression imputation) and;
Donor-based imputation (random hot deck, sequential hot deck, hierarchical hot deck, nearest neighbour imputation and predictive mean matching).

1.1 Learning Outcomes

After taking this course you should be able to conduct:

The methods of reviewing and editing data such as data entry, automatic editing, deductive correction and error localization in R.
Model-based imputation such as mean, ratio and regression imputation in R.
Donor based imputation such as the random hot deck, sequential hot deck, hierarchical hot deck, k-nearest neighbor and predictive mean matching imputation in R.

1.2 Prerequisites

Awareness in Editing and Imputation
Introduction to Editing and Imputation
Introduction to R
R Control Flow, Loops and Functions (If you have done Introduction to R before 18/08/2021 then you don’t need to complete this course since before these two courses were not split)

1.3 Duration

It is estimated that this course will take around 4-6 hours to work through.

2 Packages and Data

2.1 Packages

This course has been written in R Version 4.0.2

The packages used in this course are:

tidyverse (Version 1.3.0) for data manipulation.
editrules (Version 2.9.3) for editing data.
deducorrect (Version 1.3.7) for editing data.
VIM (Version 6.1.0) for donor based imputation.
mice (Version 3.13.0) for model based imputation.
lattice (Version 0.20.41) for creating plots.

2.2 Data

We will be using three data sets in this course which are saved in the Data folder. The first data we will be using is the income data.

# Import the income data

income_data <- readr::read_csv("./Data/income.csv")

The income_data data set contains the questionnaire data of the ‘Census Income’ database formatted as a data frame. The data set contains 841 observations and 10 variables:

age - age of the individual completing the questionnaire.
education - the level of education in factors such as 1st-4th, 5th-6th,.., Doctorate.
education_num - the level of education in numbers.
occupation - the individual’s occupation in factors such as Adm-clerical, Sales,.., Tech-support.
income - the level of the number in factor format such as small (if income < 50k) or large (if income >= 50k).
sex - the sex of each individual in factor with levels Female and Male.
marital_status - the individual marital status in factors such as Divorced, Married-AF-spouse,.., Widowed.
capital_gain - yearly profit from the sale of property or an investment.
capital_loss - yearly loss from the sale of property or an investment.
hours_per_week - the numbers of hours worked per week.

Then we will use the cleaned_income data set, which is the cleaned version of the income data. The cleaning will be covered in section 4 which is Reviewing data. We will use the cleaned_income data set in Model-based imputation and Donor-based imputation sections (Section 5 and 6).

The third data stored in the Data folder is business_data which compares the numbers of toys sold in two different periods for 10 different brands. The data has 10 observations and 3 variables:

business - the business number (which are different brands selling toys).
previous_period - the number of toys sold in the previous period.
current_period - the number of toys sold in the current period.

Load in the following packages: tidyverse, editrules, deducorrect, VIM, mice and lattice.
Import the income data from the Data folder and assign it to the variable name income_data.
Look at the structure of the data set using the dplyr::glimpse() function.
Look at the first 10 observations of the data set.

Load in the following packages: tidyverse, editrules, deducorrect, VIM, mice and lattice.

# Loading in the following packages

library(package = tidyverse)

library(package = editrules)

library(package = deducorrect)

library(package = VIM)

library(package = mice)

library(package = lattice)

Import the income data from the Data folder and assign it to the variable name income_data.

# Load in the income data set from the data folder 
# using readr package and read_csv() function 
# assign to the name income_data

income_data <- readr::read_csv("../Data/income.csv")

Look at the structure of the data set using the dplyr::glimpse() function.

# Use dplyr::glimpse() function to look at the structure of the data set

dplyr::glimpse(income_data)

Look at the first 10 observations of the data set.

# Look at first 10 observations of cleaned_income data set 

head(income_data, n = 10)

3 Editing and Imputation

Editing

Data Editing is essentially error localisation. That is the process of identifying and flagging the specific values in a record deemed to be an error, rendering them subject to subsequent adjustment or correction through imputation.

Imputation

Imputation refers to estimating values for missing or inconsistent data items.

In this course, we will first cover editing data in R and then we will cover model-based imputation and donor-based imputation.

4 Reviewing and Editing Data

Reviewing data and editing it aims to improve the quality and accuracy of data. In this course, we will cover the following aspects of editing data:

Data Entry
Automatic Editing
Error localization

4.1 Data Entry

This stage involves quality assurance of micro data to identify and resolve obvious, known issues such as duplicate records, incorrect formatting, identifying missing values and looking for invalid values.

4.1.1 Identifying Missing Values

So the first thing we will do is to look at all of the unique values in all variables for income_data.

# See unique values by each column 

sapply(income_data, function(x) unique(x))

As, you can see the income has * and - values, which seems to be an invalid entry. In R, only NA is represented as a missing value, so we will assign all these symbols to NA as shown below:

# Assign all missing values to NA

income_data[income_data == c("*","-")] <- NA

Now we will look at the unique values of the income variable to double-check our results.

# Unique values in income column 

unique(income_data$income)

After replacing different symbols now we know we have missing values in our data set. But generally, we can find if we have any missing values in any data set using anyNA() function.

# Find any missing values in the income data

anyNA(income_data)

Since the output is true suggesting that there are missing values in our data. Now we will use the md.pattern() function in the mice package to calculates the frequencies of the missing data patterns as shown below:

# Understand the missing value pattern

mice::md.pattern(income_data, plot = FALSE)

The output can be understood as follows: 1 and 0 under each variable represent their presence and missing state respectively. The numbers before the first variable (646,117,40,9,15,11,1,1,1) represent the number of rows. For example, 646 samples are complete, there are 117 samples where hours_per_week is missing and all other values are present. Similarly, there are 40 samples where occupation is missing and all other values are present. Also, there are 9 samples where occupation and hours_per_week are missing but all others are present. The last column lists the number of missing entries per pattern e.g. the first 0 is telling us that there is 646 sample which has all values present, the next 1 is showing that one variable hours_per_week has missing values and so on.

The bottom row provides the number of missing entries per variable and the total number of missing cells. In our case, only 4 columns have missing values i.e. 3 missing values in age, 27 missing values in income, 49 in occupation and 139 missing values in hours_per_week column.

4.1.2 Visualising Missing Values

A perhaps more helpful visual representation can be obtained using the VIM package as follows:

# Plot the missing values
# col is a vector giving the colours to be used for observed and missing
# numbers give the proportion or frequencies of the different combinations 
# sortVars variables should be sorted by the number of missing values
# labels on the x-axis
# cex.axis is the character expansion factor to be used for the axis annotation
# gap is a numeric value giving the distance between the two plots in margin lines
# ylab are the labels on the y-axis

VIM::aggr(income_data, col = mice::mdc(1:2),  numbers = TRUE, sortVars = TRUE, 
     labels=names(income_data), cex.axis = 0.7, gap = 3,
     ylab=c("Proportion of missingness","Missingness Pattern"))

The pink colour shows the missing values while the blue shows the present values. The plot helps us understanding that almost 77% of the samples are not missing any information, 13.9% of samples are missing the hours_per_week value while other values are present, 0.12% of samples are missing the hours_per_week, income and age values while other values are present, and the remaining ones show other missing patterns.

Notice: If you are getting a plot which doesnot fit the screen, then please change the cex.axis, gap values and use the function oma() to fix outer margin area according to your screen e.g. cex.axis = 0.6, gap = 1 and oma = c(10,5,3,3).

We can also find all rows which have missing values as shown below:

# complete case returns a logical vector indicating which cases are complete, i.e., have no missing values
# so we will use ! which acts as a negator so it will return a logical vector where cases are not complete
# select all columns by leaving space after the comma

missing_values <- income_data[!complete.cases(income_data), ]


# First six rows with missing value

head(x = missing_values)

As you can see, that this approach give us all the rows with the missing value e.g. the first person has a missing occupation. We can impute these values using the information provided in the model-based and donor-based imputation sections.

4.1.3 Finding Duplicates

Now we will check if we have any duplicates in our data set.

# Show the repeat entries

income_data[duplicated(income_data),]

These 4 rows are duplicated rows. The decision to keep or remove duplicate rows depends on the data set we are using. For example, if a data set contains unique IDs for individuals then we would want to remove duplicated rows. However, in our case, we will choose to keep the duplicated rows because it could be possible for two or more individuals to have all the same information across all variables. Duplicated rows can be removed in R by using distinct() function.

4.2 Automatic editing

In a real-life data set, there are simple errors that need to be treated automatically. For example, a person’s age cannot be negative, an under-aged person cannot possess a drivers license, etc. Such knowledge can be expressed as rules or constraints. In data editing literature these rules are referred to as edit rules or edits, in short. The editrules package can be used to define rules on categorical, numerical or mixed-type data sets which each record must obey. Automatic editing is a cheap mode of treatment, but you must be confident that your assumptions are correct, otherwise your automatic corrections could introduce more error.

Now for demonstration purpose, we will define some restrictions on age column in the income_data using editrules package.

# Set the rule where age should be non-negative and less than and equal to 40 only

E <- editrules::editset(c("age >=0", "age <= 40"))

# Display the results

E

The income_data can be checked against these rules with the violatedEdits() function.

# Check if the rules are violated 

violate_edits <- editrules::violatedEdits(E, dat = income_data)

# Display first six rows

violate_edits[1:6,]

The first column shows the records (rows), the second column num1 (age >=0) shows that all values are FALSE meaning that the rule of age being positive has not been violated (age column has positive values for the first 6 records). The second column num2 (age <= 40) shows that the 5th record violates the rule because the logical value is TRUE suggesting that the age of the 5th person is greater than 40. We can double confirm our results as well.

# Filter the 5th person in the data set

income_data[5,]

As you can see the age of the 5th person is 59 which is indeed greater than 40.

Now we will set some restrictions on the education_num column. Since education_num columns have the level of education in numbers e.g. a person who has done Masters would have education number 14. The maximum number of educations can be 16 which means the person has Doctorate. Therefore we will set some restrictions on the education_num since the education number should only be non-negative values and less than and equal to 16.

# Set the rule where education_num should be non-negative and less than and equal to 16

E <- editrules::editset(c("education_num >=0", "education_num <= 16"))

# Check if the rules are violated 
# the function returns a matrix

edits_violated <- editrules::violatedEdits(E, dat = income_data)

# Display the first 6 rows

edits_violated[1:6,]

The function violatedEdits() returns a matrix edits_violated. However, to check all rows where the rules have been violated, we have to convert the edits_violated to a data frame and filter the rows where the edit rules have been violated.

# Convert the variable to a data frame

violated_data <- data.frame(edits_violated)

# Filter the rows where either of the edit rules have been violated
# once you create a data frame you can see the column names by using $ sign

violated_data %>% 
dplyr::filter(num1 == TRUE | num2 == TRUE)

The second column num2 (education_num <= 16) shows that the 5th and 93rd record violates the rule because the logical value is TRUE suggesting that the education_num of both people is greater than 16. We can double confirm our results as well.

# Filter the 5th and 93rd person in the data set

income_data[c(5,93),]

As you can see the value is 100 (instead of 10) for the 5th person and 90 instead of 9 for the 93rd person. Obviously, we can never be 100% sure that this 100 should be 10, so we will make an assumption that the person made a typing error where the value should be 10 and 9 respectively. We can use deducorrect package to this value because this package fixes inconsistent observations by altering invalid values in the record based on information from valid values.

# We specify the rule and the correction 

correction_rules <- deducorrect::correctionRules(expression(if(education_num > 16) {
      education_num <- education_num/10
}))

# Apply simple replacement rules to a data.frame

correction <- deducorrect::correctWithRules(rules = correction_rules, dat = income_data)

# list of alterations

correction$corrections

As you can see the previous value was 100 and 90 which has been corrected to 10 and 9 respectively. Now we will look at the altered data.

# list with altered data

correction$corrected

The altered data has the correct education_num values now.

Exercise

Exercise
Show Answer

Set a restriction on the capital_gain column in income_data, where the column should have non-negative values and less than and equal to 100000.
Using income_data, check if the rules have been violated and assign it to variable violate.
Convert violate to a dataframe using data.frame() function.
Filter the rows where either of the edit rules have been violated.
Do the correction using the deducorrect package (Hint: the correction will be that wherever the capital gain is greater than 100000 we divide it by 1000).

Set a restriction on the capital_gain column in income_data, where the column should have non-negative values and less than and equal to 100000.

# Set the rule where capital_gain should be non-negative and less than and equal to 100000

E <- editrules::editset(c("capital_gain >=0", "capital_gain <= 100000"))

# Display the results

E

Using income_data, check if the rules have been violated and assign it to variable violate.

# Check if the rules are violated 

violate <- editrules::violatedEdits(E, dat = income_data)

# Display first 6 rows

violate[1:6,]

Convert violate to a dataframe using data.frame() function.

# Convert violate to a dataframe 

violate_data <- data.frame(violate)

Filter the rows where either of the edit rules have been violated.

# Filter the rows where either of the edit rules have been violated
# Once you create a data frame you can see the column names by using $ sign

violate_data %>% 
dplyr::filter(num1 == TRUE | num2 == TRUE)

Record 16 and 273 violates the rule of capital_gain being less than 100000. To double confirm, we will look at the actual income_data.

# select the 16th and 273rd record from the income_data and the capital_gain column

income_data[c(16,273),8]

Both values for the capital_gain is more than 100000.

Do the correction using the deducorrect package (Hint: the correction will be that wherever the capital gain is greater than 100000 we divide it by 1000).

# We specify the rule and the correction 
# Since the restriction is that the capital_gain should be non-negative and less than and equal to 100000
# we will specify the rule that wherever the capital gain is greater than 100000 we divide it by 1000

corr_rules <- deducorrect::correctionRules(expression(if(capital_gain > 100000) {
      capital_gain <- capital_gain/1000
}))

# Apply simple replacement rules to a data.frame.

corr <- deducorrect::correctWithRules(rules = corr_rules, dat = income_data)

# list of alterations

corr$corrections

4.3 Error localization

So far we have seen how to set restrictions and correct those rows which violate the restrictions. However, in the real world, we work with large data sets, so this means we have more variables so more rules. So rather than having one specific condition, we can have a text file saved with all conditions for all data types (i.e. numeric, categorical and mixed data). With the editrules package, we can read in the text files with the rules and restrictions to check if there has been any violation or not.

# Read the text file containing rules 

E <- editfile("../Data/edits.txt")

# Check if the rules are violated

violated_rules <- editrules::violatedEdits(E, income_data)

# Display the summary

summary(violated_rules)

Here, the edit labelled dat7 is violated by 543 records (64.6% of all records). Violated edits are sorted from most to least often violated. We can also check the interconnection of rules by creating a plot as shown below:

#  A graph showing the interconnection of rules and restriction

plot(E)

Here we can see the interconnection of restrictions where blue circles represent variables and yellow boxes represent rules. The lines indicate which restrictions are associated with what variables.

As you can see this interconnectivity of edits is what makes error localization difficult. For example, the graph shows that a record violating edit mix5 may contain an error in education_num and/or education and/or age. Suppose that we alter education_num so that mix5 is not violated anymore. Then we are carrying the risk of violating up to five other edits containing education_num.

If we have no other information available but the edit violations, it makes sense to minimize the number of fields being altered. This principle is commonly referred to as the principle of Fellegi and Holt (covered in Introduction to Editing and Imputation).

# First two records from the income_data
income_data[1:2, ]

# Check if they are violated

editrules::violatedEdits(E, income_data[1:2, ])

The first record violates dat6 and dat7 while the second record violates dat6 only. We use the localizeErrors() function, with a mixed-integer programming approach to find the minimal set of variables to adapt.

# find the minimal set of variables to adapt
# method cabn be errorlocalizer ("bb") or mix integer programming ("mip")

localised_error <- editrules::localizeErrors(E, dat = income_data[1:2, ], method = "mip")

# adapt indicates the minimal set of variables to be altered in each record

localised_error$adapt

The array localised_error$adapt indicates with a boolean value which variable in which record must be changed. So for the first person, the variables education and occupation needs to be changed and for the second person, the variables education and education_num needs to be changed.

5 Model Based Imputation

Model-based imputation is the approximation of missing values using assumptions about the distribution of the data, which includes mean and median imputation. Or using assumptions about the relationship between variables or units, which includes ratio and regression imputation.

There are two types of model-based methods:

Deterministic: those methods which are based on models and, when repeated, always impute the same value.
Stochastic: methods which include a random element. When repeated these may impute a different value each time.

So first we will cover the deterministic imputation where we will cover the mean imputation, ratio imputation and regression imputation. For this section and later sections we will use the cleaned data set named cleaned_income data stored in the Data folder.

# Import cleaned_income data and assign it to clean_income

cleaned_income <- readr::read_csv("../Data/cleaned_income.csv")

5.1 Mean imputation

We will use the mice package to impute the missing values in cleaned_income data. Mice stand for Multivariate Imputation via Chained Equations and it is one of the commonly used packages by R users. MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on the observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable. We will cover mean imputation first.

# Use mice function within mice package
# method used is "mean" and m = 1 means single imputation
# maxit is a scalar giving the number of iterations (default is 5)

mean_impute <- mice::mice(cleaned_income, method = "mean", m = 1, maxit = 5)

The mice package imputes the missing values in all variables e.g. in our case the imputed variables are age and hours_per_week. The method we used is “mean”, but you can use other imputation methods by running the methods(mice) command which gives a list of the available imputation methods. Since “mean” imputation can only be used for numerical data, that is why the imputed variables are numerical i.e. age and hours_per_week. If we use methods such as “polytomous logistic regression”, then we can impute missing values for categorical data such as occupation.

We can see the rows which had the missing value and their corresponding imputed values as shown below:

# Selecting the rows in the age column which have imputed values

mean_impute$imp$age

As you can see row 37, 76 and 830 had missing values in the age column and their imputed value is the mean of the age column which is 38.76.

# Selecting the first 6 rows in hour_per_week column which have imputed values

head(mean_impute$imp$hours_per_week)

Now we can get back the completed data set using the complete() function. The missing values will be replaced with the imputed values in the data set.

# Store imputed data using complete() function

mean_data <- mice::complete(mean_impute)

So now we can find the summary statistics of hours_per_week column in both data sets.

# Summary statistics of hours per week for original data

summary(cleaned_income$hours_per_week)

# Summary statistics of hours per week for imputed data  

summary(mean_data$hours_per_week)

Making the comparison we see that only the values of the first and third quantile have changed. The minimum and maximum values are the same and so, of course, is the mean. But of more interest is what happens to the sample standard deviation; its value for the imputed data can be found using:

# Standard deviation of hours per week for imputed data 

sd(mean_data$hours_per_week, na.rm = TRUE)

# Standard deviation of hours per week for original data 

sd(cleaned_income$hours_per_week, na.rm = TRUE)

The value for the imputed data, 11.32 is lower than the original data, 12.40 (which was expected since mean imputation reduces the variance of the imputed variables because the imputation method adds more observations which are the mean - and so reduce the average deviation from the mean for the data).

Exercise

Exercise
Show Answer

Using mice() function, impute the missing values in the cleaned_income data.
Store the imputed data using the complete() function.
Look at the difference in the standard deviation of the age column in imputed data and original data.

Using mice() function, impute the missing values in the cleaned_income data.

# Use mice function within mice package where the method used is "mean"
# and m = 1 means single imputation
# maxit is a scalar giving the number of iterations (default is 5)

mean_impute <- mice::mice(cleaned_income, method = "mean", m = 1, maxit = 1)

Store the imputed data using the complete() function.

# Store imputed data using complete() function

mean_data <- mice::complete(mean_impute)

Look at the difference in the standard deviation of the age column in imputed data and original data

# Standard deviation of the age for original data 

sd(cleaned_income$age, na.rm = TRUE)

# Standard deviation of the age for imputed data 

sd(mean_data$age, na.rm = TRUE)

As expected the standard deviation of the age column in the imputed data is lower than the original one.

5.2 Ratio imputation

Now we will look at the ratio imputation. There are two types of ratio imputation used, mainly for business surveys:

Mean of ratios (e.g. used for vacancy survey)
Ratio of means (e.g. used for financial services survey)

For this section only we will use a short data set called business_data stored in the data folder.

# Import business_data

business_data <-readr::read_csv("../Data/business_data.csv")

# Display the data

business_data

First, we will impute the missing values using the ratio of means imputation and then we will cover the mean of ratios imputation.

5.2.1 Ratio of means imputation

The steps for the ratio of means imputation are as follows:

Identify businesses that responded in both periods.
Sum the current period values for businesses that responded in both periods.
Sum the previous period values for businesses that responded in both periods.
Calculate the growth factor – the sum of the values for the current period divided by the sum of the values for the previous period.
For each business with missing values in the current period, multiply the previous period value by the growth factor.

# Take the business data and select all rows with no missing value
# Divide the sum of the current_period and previous_period columns
# to obtain growth factor
# pull out the value of growth factor

ratio_sum <- business_data %>% 
   tidyr::drop_na() %>% 
   dplyr::summarise(ratio_sum = sum(current_period)/sum(previous_period)) %>% 
   dplyr::pull() 

# Display the value

ratio_sum

Using this value, we will impute the missing value in the current_period column by multiplying the growth factor with the previous period value as shown below:

# Impute the missing value in the current period column using if_else() function
# we are updating the current_period so use mutate
# specify wherever the missing value is true, replace that with ratio * previous value
# otherwise keep the original value

business_data %>% 
dplyr::mutate(current_period = dplyr::if_else(
       condition = is.na(current_period), 
       true = (ratio_sum * previous_period), 
       false = current_period))

5.2.2 Mean of ratios imputation

The steps for mean of ratios imputation are as follows:

Identify businesses that responded in both periods.
Calculate the ratio for each business that responded in both periods– for each business divide the current period value by the previous period value for each business.
Sum the ratios.
Calculate the growth factor – this is the sum of ratios divided by the number of businesses that responded in the current period and the previous period.
For each business with missing values for the current period, multiply the previous period value by the growth factor.

# Take the business data and select all rows with no missing value
# create a column of ratios which is current period divided by previous previous period
# Find the sum of ratios and divide it by the number of businesses to get the growth_factor
# pull out the value of growth factor from the data set

growth_factor <- business_data %>% 
                  tidyr::drop_na() %>% 
                  dplyr::mutate(ratio = current_period / previous_period) %>% 
                  dplyr::summarise(growth_factor = sum(ratio) / n()) %>%
                  dplyr::pull() 

# Display the value

growth_factor

Using this value, we will impute the missing value in the current_period column by multiplying the growth factor with the previous period value as shown below:

# Impute the missing value in the current period column using if_else() function
# we are updating the current_period so use mutate
# specify wherever the missing value is true, replace that with growth * previous value
# otherwise keep the original value

business_data %>% 
dplyr::mutate(current_period = dplyr::if_else(
       condition = is.na(current_period), 
       true = (growth_factor * previous_period), 
       false = current_period))

5.3 Regression imputation

Now, let’s apply a regression imputation to our cleaned_income data. As covered before there are two types of regression imputation.

Deterministic: this method assumes that the imputed values appear close to the regression line.
Stochastic: it is an improvement to the previous method by aiming to preserve the variability of data by adding an error (residual) factor to the predicted value.

# Deterministic regression imputation
# method used is "norm.predict"
# m = 1 means single imputation

deter_impute <- mice::mice(cleaned_income, method = "norm.predict", m = 1)

The variables that have been imputed are age and hours_per_week

# Store data

det_data <- mice::complete(deter_impute)

We can use almost the same code for stochastic regression imputation. We only have to change the method from method = "norm.predict" to method = "norm.nob" as shown below:

# Stochastic regression imputation
# method used is "norm.nob"
# m = 1 means single imputation

stoch_imp <- mice::mice(cleaned_income, method = "norm.nob", m = 1)

 # Store data

stoch_data <- mice::complete(stoch_imp)

Let’s graphically check how well our missing data imputations worked:

# Graphical comparison of deterministic and stochastic regression imputation
# Both plots in one graphic

par(mfrow = c(1, 2)) 

# Deterministic regression imputation plot
# Plot of observed values

plot(cleaned_income$age[!is.na(cleaned_income$hours_per_week)],
     det_data$hours_per_week[!is.na(cleaned_income$hours_per_week)],
     main = "Deterministic Regression",
     xlab = "Age", ylab = "Hours per week")

# Plot of imputed values         

points(cleaned_income$age[is.na(cleaned_income$hours_per_week)],
        det_data$hours_per_week[is.na(cleaned_income$hours_per_week)],
        col = "red")

# Regression slope

abline(lm(hours_per_week ~ age, det_data), col = "#1b98e0", lwd = 1.5) 


# Stochastic regression imputation
# Plot of observed values

plot(cleaned_income$age[!is.na(cleaned_income$hours_per_week)], 
     stoch_data$hours_per_week[!is.na(cleaned_income$hours_per_week)],
     main = "Stochastic Regression",
     xlab = "Age", ylab = "Hours per week")

# Plot of imputed values

points(cleaned_income$age[is.na(cleaned_income$hours_per_week)],
       stoch_data$hours_per_week[is.na(cleaned_income$hours_per_week)],     
       col = "red")

# Regression slope

abline(lm(hours_per_week ~ age, stoch_data), col = "#1b98e0", lwd = 1.5)

The black points are observed values, whereas the imputed values are represented by red points. The graph visualizes the main drawback of deterministic regression imputation i.e. the imputed values (red bubbles) are way too close to the regression slope (blue line)!

In contrast, the imputation by stochastic regression worked much better. The distribution of imputed values is similar compared to the observed values and, hence, much more realistic. As such, this method, and in particular, the idea to draw from the residuals, forms the basis of more advanced imputation techniques. However as covered in the Introduction to Editing and Imputation course, there are some disadvantages of using stochastic regression. For example, it can lead to implausible values and assumes the data should be homoscedastic. The predictive mean matching method covered in Section 6.5 is capable of solving the issues related to stochastic regression imputation.

Exercise

Exercise
Show Answer

Impute the missing values of continuous data in cleaned_income data using stochastic regression and assign it to stoch_impute
Use stoch_impute$imp$hours_per_week to check which rows have been imputed.

Impute the missing values of continuous data in cleaned_income data using stochastic regression and assign it to stoch_impute

# Stochastic regression imputation
# method used is "norm.nob"
# m = 1 means single imputation

stoch_impute <- mice::mice(cleaned_income, method = "norm.nob", m = 1)

Use stoch_imp$imp$hours_per_week to check which rows have been imputed.

# Select the imputed value in hours_per_week column
# select only first 6 values

head(stoch_impute$imp$hours_per_week)

It shows the rows which have missing values and their corresponding imputed value. As covered in the Introduction to editing and imputation course, stochastic regression may impute different values each time.

6 Donor Based Imputation

Now we will use donor-based imputation to fill in missing values for a given unit (the recipient) by copying the corresponding observed values of another unit (the donor). We will cover these types of donor-based imputation:

Random hot deck
Sequential hot deck
Hierarchical hot deck
Nearest neighbour imputation
Predictive mean matching

The theory for each type of donor-based imputation is covered in Introduction to editing and imputation.

6.1 Random hot deck

In random hot deck imputation, imputation classes (also known as donor pools or adjustment cells) are formed based on matching variables. For each recipient unit i in a given imputation class, the potential donors are units within that same class but with the target variable, y observed. Of these potential donors, one is selected at random typically through equal-probability sampling and used to impute the recipient. To conduct the hot deck imputation I will be using a part of the cleaned_income data set which have only 10 rows and hence should make this section easier to understand. We will then use full data set later on when we will be covering hierarchical hot deck imputation.

# Create a shorter data with 10 rows and all columns

shorter_data <- cleaned_income[1:10,]

# Display the results

shorter_data

So we want to impute the missing value in the occupation variable for the first person (which is the recipient) using the matching variable income.

In R we can use the function hotdeck() from the VIM package to carry out hot-deck imputation. Where we first specify the data set, then we specify the variable which has a missing value to impute in the argument variable (in our case it will be occupation) and then we specify the matching variable (in our case it will income).

# Using hotdeck function in VIM package to carry out random hot deck imputation
# variable take the column that we want to impute the missing value of
# domain_var is the variables for building domains and impute within these domains

VIM::hotdeck(data = shorter_data, variable = "occupation", domain_var = "income" )

So the value for occupation can be either “Adm-clerical” or “Other-service” or “Exec-managerial” since the donors are the 4th, 7th, 8th and 9th person (because they have small income and the person whom we are imputing the missing value has the value small for the income variable). Since this is a random hot deck imputation that means the donor is randomly selected so the imputed value can be Adm-clerical/Other-service/Exec-managerial.

Also, there is a new column occupation_imp which returns a boolean value TRUE if the missing value in the occupation variable has been imputed and FALSE otherwise.

We can have more than one matching variable as well and we will cover that later in the hierarchical hot deck imputation section (Section 6.3).

Exercise

Exercise
Show Answer

Carry out random hot deck imputation to impute the missing value in the hours_per_week variable for the 4th person in the shorter data using the matching variables sex. Remember shorter data contains the first 10 observations from the cleaned_income data set.

Carry out random hot deck imputation to impute the missing value in the hours_per_week variable for the 4th person in the shorter data using the matching variables sex. Remember shorter data contains the first 10 observations from the cleaned_income data set.

# Using hotdeck function in VIM package to carry out random hot deck imputation
# variable take the column that we want to impute the missing value of
# domain_var is the variables for building domains and impute within these domains


VIM::hotdeck(data = shorter_data, variable = "hours_per_week", domain_var =  "sex")

Since the 4th person is a male, the donor can be either 6th or 10th person. So the imputed value can be either 40 or 45. To see results easier you can filter the data as shown below:

# Filter all men in data 
# Select only sex and hours_per_week column

shorter_data %>% 
  dplyr::filter(sex == "Male") %>% 
  select(sex, hours_per_week)

So the imputed value can be either 40 or 45.

6.2 Sequential hot Deck

In sequential hot deck imputation, the data set is not explicitly split into groups. Instead, one goes over the records in the data set in order and imputes each missing value by the last previously encountered observed value for a unit with the same scores on the matching variables. Thus, the recipient is imputed using as a donor the last unit with y observed that belongs to the same imputation class and that comes before the recipient in the data file.

We will impute the missing value in the occupation variable using income as the matching variable. But this time we will do sequential hot deck imputation where we will order the hours_per_week variable. If we don’t order the hours_per_week then R will only carry out random hot deck imputations.

# Using hotdeck function in VIM package to carry out sequential hot deck imputation
# variable take the column that we want to impute
# domain_var is the matching variable
# variables for sorting the data set before imputation (not overlapping with variable)


VIM::hotdeck(data = shorter_data, variable = "occupation", domain_var =  "income", 
             ord_var = "hours_per_week")

Now the imputed value for the occupation variable is “Adm-clerical” because when we order the hours_per_week variable, the structure of the data looks like this:

# Arranging the hours_per_week column in the ascending order

shorter_data %>% 
  dplyr::arrange(hours_per_week)

As you can see when we order the hours_per_week variable, the structure of the data changes where the recipient (the person with missing value in the occupation variable) is the 6th person and the donor is the 4th person with the same income. (Remember: in sequential hot deck imputation the data set is not explicitly split into groups instead, one goes over the records in the data set in order and imputes each missing value by the last previously encountered observed value.)

Exercise

Exercise
Show Answer

Carry out sequential hot deck imputation to impute the missing value in the occupation variable for the 1st person in the shorter data using the matching variables sex while ordering the education_num column. Remember shorter data contains the first 10 observations from the cleaned_income data set.

Carry out sequential hot deck imputation to impute the missing value in the occupation variable for the 1st person in the shorter data using the matching variables sex while ordering the education_num column. Remember shorter data contains the first 10 observations from the cleaned_income data set.

# Using hotdeck function in VIM package to carry out sequential hot deck imputation
# variable take the column that we want to impute
# domain_var takes the matching variable
# variables for sorting the data set before imputation

VIM::hotdeck(data = shorter_data, variable = "occupation",
             domain_var =  "sex",  
             ord_var = "education_num")

As you can see the imputed value is now “Adm-clerical” because when we order the education_num variable, the structure of the data looks like this:

# Arranging education_num column in ascending order 

shorter_data %>% 
  dplyr::arrange(education_num)

So, the donor is the 7th person who is a female. So the imputed value for occupation will be “Adm-clerical”.

6.3 Hierarchical hot deck

If a single donor is used many times to impute many records, this could lead to a lack of precision in survey estimates. One way to avoid the problems associated with sequential hot-deck imputation is hierarchical hot-deck imputation. This method sorts recipient and donor units into a large number of imputation classes based on a detailed categorisation of a large set of matching variables.

Donor units are then matched with recipients in the smallest class first. If no match is found that class is collapsed with the next one, and so on until a donor is found.

We will impute the missing value in the occupation variable of the first person in cleaned_income data set using the hierarchy shown below (the hierarchy below has been constructed based on the observed data for the first person in cleaned_income data):

There are various methods to build the hierarchy similar to this for each data set but the most commonly use methods are decision trees.

Now we will impute the missing value in the occupation variable for the first person using all of these matching variables variable specified in the diagram, so we want the donor to have the same value on all variables as the first person (recipient).

# specify the data and the variable that needs to be imputed 
# specifying more than one matching variable now


hierarchy_impute <- VIM::hotdeck(data = cleaned_income, variable = "occupation", 
             domain_var =  c("income","education","education_num","hours_per_week",
                             "sex","marital_status","capital_gain","capital_loss",
                             "age"))

# Display first 6 observations

head(hierarchy_impute)

As you can see the missing value hasn’t been imputed because no donor was found that had the same characteristics on all variables as the recipient. So we will work backwards on the hierarchy. Now we want to select donors with the same values on all variables except for age.

# Removing age variable from the previous codes
# so selecting donor who has the same character as recipient except for age

hierarchical_impute <- VIM::hotdeck(data = cleaned_income, variable = "occupation", 
             domain_var =  c("income","education","education_num","hours_per_week",
                             "sex","marital_status","capital_gain","capital_loss"))

# Display first 6 observations

head(hierarchical_impute)

The imputed value for occupation is ” Prof-specialty” because the donor was a female, with a small income, masters degree where the education in numbers is 14, never-married and spend 40 hours per week on work with 0 capital gain and loss which is the same as the recipient.

We can filter out the data as well to see how many donors are as shown below:

# Filter rows that have the same values on all variables except age

cleaned_income %>% 
  dplyr::filter(income == "small" & education == "Masters"  & education_num == 14 &
                hours_per_week == 40 & sex == "Female" &  marital_status == "Never-married" &
                capital_gain == 0 & capital_loss == 0)

So there is one donor with the same characteristics as the recipient except for age.

Exercise

Exercise
Show Answer

Carry out hierarchical hot deck imputation to impute the missing value in the hours_per_week variable for the 4th person in the cleaned_income data using the matching variables income, education, education_num, occupation, marital_status and sex.

Carry out hierarchical hot deck imputation to impute the missing value in the hours_per_week variable for the 4th person in the cleaned_income data using the matching variables income, education, education_num, occupation, marital_status and sex.

# Using cleaned_income data 
# specifying all matching variables in domain_var

hierarchical_example <- VIM::hotdeck(data = cleaned_income, variable = "hours_per_week",
             domain_var =  c("income","education","education_num","occupation","marital_status",
                             "sex"))

# Display first 6 observations

head(hierarchical_example)

The imputed value for hours_per_week for the recipient can be either 45 or 30 because the donors are men with a small income, Bachelors degrees with 13 years of education, never married and working as an Admin clerical. Which are the same characteristics as the recipients. We can see how many donors are there, by filtering our data as shown below:

# Filter rows that have the same values on all variables except age
# select all columns where hours_per_week comes first

cleaned_income %>% 
  dplyr::filter(income == "small" & education == "Bachelors"  & education_num == 13 & 
  occupation == "Adm-clerical"  & sex == "Male" &  marital_status == "Never-married") %>% 
  select(hours_per_week, everything())

So as we can see there are two donors and the imputed value can be 45 or 30.

6.4 K-Nearest Neighbours (KNN)

Nearest neighbour donor imputation removes the restriction that the donor and the recipient have identical values for all matching variables. Instead, the matching variables are used to define a distance function between the recipient and a potential donor.

An extension of this concept is k-nearest neighbours imputation which is based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables. To compute the Gower distance between two items you compare each element and compute a term. If the element is numeric, the term is the absolute value of the difference divided by the range. If the element is non-numeric the term is 1 if the elements are different or the term is 0 if the elements are the same. The Gower distance is the average of the terms. Each individual term will be between 0.0 and 1.0 inclusive, therefore the Gower distance will always be between 0.0 and 1.0 where a distance of 0.0 means the two items are the same and a distance of 1.0 means the two items are as far apart as possible, relative to the source data set.

# Using KNN function in VIM package 
# we can use any variables so we will use a cleaned income data set
# k is the number of nearest neighbour donors used (k = 1 is Nearest Neighbour imputation)
# dist_var takes the variables to be used for distance calculation
# I am choosing variables randomly but it has been taught in the theoretical course

knn_imputation <- VIM::kNN(data = cleaned_income, variable = "occupation", k = 4 ,  
                           dist_var = c("education","sex","income","marital_status"))

# Display results

head(x = knn_imputation)

The imputed value for occupation is chosen randomly using 4 closest donors.

Notice: If you are getting an Rccp package error, then please install the Rccp package. If you get a warning then that’s fine, warning messages are generally fine in R.

Exercise

Exercise
Show Answer

Impute the missing value in the hours_per_week variable for the 4th person using K-Nearest Neighbours.

Impute the missing value in the hours_per_week variable for the 4th person using K-Nearest Neighbours.

# Using KNN function in VIM package 
# k is the number of nearest neighbour donors used

knn_imputation_cont <- VIM::kNN(data = cleaned_income, variable = "hours_per_week", k = 7)


# Display results

knn_imputation_cont %>% 
   dplyr::select("hours_per_week", everything()) %>%  
   head()

The imputed value in the hours_per_week column is chosen randomly using 7 closet donors.

6.5 Predictive mean matching

Finally, a special case of the nearest neighbour methods is predictive mean matching, in which the nearest neighbour donor is determined using the predicted y-value from a regression model. For each missing value, it imputes a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model.

The predictive mean matching method is more appropriate than a standard regression approach if the normality assumption is violated. (This method only imputes missing value for continuous data since it’s similar to a regression imputation method but with a donor element.). We will now impute the missing value of numerical data in cleaned income using predictive mean matching.

# Impute data via predictive mean matching (single imputation)

imp_single <- mice::mice(data = cleaned_income, m = 1, method = "pmm")

A scalar giving the number of iterations, the default is 5. The imputed variables are age and hours_per_week.

# Store imputed data using complete() function

data_imp_single <- mice::complete(imp_single)   


# Display the results

head(data_imp_single)

We can also conduct multiple imputation by specifying m = 5 within the function mice (i.e. 5 imputed data sets).

# Impute missing values multiple times in hours_per_week column

imp_multi <- mice::mice(cleaned_income, m = 5, method = "pmm")  

# Store multiple imputed data
# “repeated” and include = TRUE tells the complete function
# how to store our multiply imputed data in the object data_imp_multi_all. 

data_imp_multi_all <- mice::complete(imp_multi,      
                           "repeated",
                           include = TRUE)

# Select all of the hours_per_week columns in this new data set
# Select only the first 5 rows

data_imp_multi_all %>% 
  dplyr::select(starts_with("hours")) %>% 
  head(n = 5)

hours_per_week.0 is equal to the original hours_per_week with missing values.
hours_per_week.1-hours_per_week.5 are the five imputed versions of hours_per_week.

While hours_per_week.0 contains missing value, hours_per_week.1–hours_per_week.5 are filled with imputed values. Note that the imputed values differ between the five imputed data sets.

Now we will check the goodness of fit so we will check how well the values are imputed. The xyplot() functions come into the picture and help us verify our imputations.

# Using xplot to create scatter plot where the x-axis is hours_per_week and y-axis is age

lattice::xyplot(imp_multi, age ~ hours_per_week| .imp, pch = 20, cex = 1.4)

The first row and column show the data of age.0 and hours_per_week.0. The blue points are the observed data. The second column and first row show the data of age.1 and hours_per_week.1 and so on. The red points are the imputed data (that is why the first plot does not have any red points because the columns have the missing value that isn’t imputed). To check the goodness of first, ideally, the red points should be similar to the blue points so that the imputed values are similar to observed values, which is true in our case.

A useful feature of the mice() function is the ability to specify the set of predictors (matching variable) to be used for each incomplete variable. The basic specification is made through the predictorMatrix argument, which is a square matrix of size ncol(data) containing 0/1 data.

# Use mice() function within mice package

impute <- mice::mice(cleaned_income, print = FALSE)

# mice() returns a mids object containing a predictorMatrix entry 

impute$predictorMatrix

The rows correspond to incomplete variables, in the sequence as they appear in the data. A value of 1 indicates that the column variable is a predictor (matching variable) to impute the incomplete variables (row variable), and a 0 means that it is not used. Thus, in the above example, age is predicted from education_num, capital_gain, capital_loss and hours_per_week. Note that the diagonal is 0 since a variable cannot predict itself.

Exercise

Exercise
Show Answer

Do a single imputation to impute the missing values of continuous data in cleaned_income data using the predictive mean matching method.
Find which rows in age column have been imputed and their imputed value.

Do a single imputation to impute the missing values of continuous data in cleaned_income data using the predictive mean matching method.

# Impute data via predictive mean matching (single imputation)

pmm_impute <- mice::mice(data = cleaned_income, m = 1, method = "pmm")

Find which rows in age column have been imputed and their imputed value.

# Display the results

pmm_impute$imp$age

The imputed value in the age column is shown above for each row with the missing value.

7 Summary

By now you should be able to:

Apply methods of reviewing data such as identifying missing values, visualising the pattern of missingness and finding duplicates in data in R.
Do automatic editing where you can apply restrictions, check which restrictions have been violated and correct those.
Impute missing value using model-based methods such as mean, ratio and regression imputation in R.
Impute missing values using donor-based imputation such as the random hot deck, sequential hot deck, hierarchical hot deck in R.
Impute missing value using KNN and predictive mean matching imputation in R.

Reuse

Open Government Licence 3.0