# Import the income data
<- readr::read_csv("./Data/income.csv") income_data
Editing and Imputation in R
1 Course Outline
This is a short course and aims to covers the practical application of editing and imputation in R. A similar course is available for those who prefer working in Python (Editing and Imputation in Python). Firstly, I will show the name of the method we are using, then an example of how to use that method in R and some exercises as well for you to complete.
This course doesn’t cover the theory of any of the methods specified. The theory of these methods is covered in Introduction to Editing and Imputation
which is the prerequisite of this course. Also, this course doesn’t teach the best method to use for editing and imputation. Instead, it only shows how different methods can be used in R. Feel free to email Expert Group for their expert opinion.
The topics we will be covering are:
- Reviewing and editing data (data entry, automatic editing and error localization);
- Model-based imputation (mean, the ratio of means, mean of ratios and regression imputation) and;
- Donor-based imputation (random hot deck, sequential hot deck, hierarchical hot deck, nearest neighbour imputation and predictive mean matching).
1.1 Learning Outcomes
After taking this course you should be able to conduct:
The methods of reviewing and editing data such as data entry, automatic editing, deductive correction and error localization in R.
Model-based imputation such as mean, ratio and regression imputation in R.
Donor based imputation such as the random hot deck, sequential hot deck, hierarchical hot deck, k-nearest neighbor and predictive mean matching imputation in R.
1.2 Prerequisites
- Awareness in Editing and Imputation
- Introduction to Editing and Imputation
- Introduction to R
- R Control Flow, Loops and Functions (If you have done Introduction to R before 18/08/2021 then you don’t need to complete this course since before these two courses were not split)
1.3 Duration
It is estimated that this course will take around 4-6 hours to work through.
2 Packages and Data
2.1 Packages
This course has been written in R Version 4.0.2
The packages used in this course are:
tidyverse
(Version 1.3.0) for data manipulation.editrules
(Version 2.9.3) for editing data.deducorrect
(Version 1.3.7) for editing data.VIM
(Version 6.1.0) for donor based imputation.mice
(Version 3.13.0) for model based imputation.lattice
(Version 0.20.41) for creating plots.
2.2 Data
We will be using three data sets in this course which are saved in the Data
folder. The first data we will be using is the income
data.
The income_data
data set contains the questionnaire data of the ‘Census Income’ database formatted as a data frame. The data set contains 841 observations and 10 variables:
age - age of the individual completing the questionnaire.
education - the level of education in factors such as
1st-4th
,5th-6th
,..,Doctorate
.education_num - the level of education in numbers.
occupation - the individual’s occupation in factors such as
Adm-clerical
,Sales
,..,Tech-support
.income - the level of the number in factor format such as
small
(if income < 50k) orlarge
(if income >= 50k).sex - the sex of each individual in factor with levels
Female
andMale
.marital_status - the individual marital status in factors such as
Divorced
,Married-AF-spouse
,..,Widowed
.capital_gain - yearly profit from the sale of property or an investment.
capital_loss - yearly loss from the sale of property or an investment.
hours_per_week - the numbers of hours worked per week.
Then we will use the cleaned_income
data set, which is the cleaned version of the income
data. The cleaning will be covered in section 4 which is Reviewing data. We will use the cleaned_income
data set in Model-based imputation and Donor-based imputation sections (Section 5 and 6).
The third data stored in the Data
folder is business_data
which compares the numbers of toys sold in two different periods for 10 different brands. The data has 10 observations and 3 variables:
business - the business number (which are different brands selling toys).
previous_period - the number of toys sold in the previous period.
current_period - the number of toys sold in the current period.
Exercise
Load in the following packages:
tidyverse
,editrules
,deducorrect
,VIM
,mice
andlattice
.Import the
income
data from theData
folder and assign it to the variable nameincome_data
.Look at the structure of the data set using the
dplyr::glimpse()
function.Look at the first 10 observations of the data set.
- Load in the following packages:
tidyverse
,editrules
,deducorrect
,VIM
,mice
andlattice
.
# Loading in the following packages
library(package = tidyverse)
library(package = editrules)
library(package = deducorrect)
library(package = VIM)
library(package = mice)
library(package = lattice)
- Import the
income
data from theData
folder and assign it to the variable nameincome_data
.
# Load in the income data set from the data folder
# using readr package and read_csv() function
# assign to the name income_data
<- readr::read_csv("../Data/income.csv") income_data
- Look at the structure of the data set using the
dplyr::glimpse()
function.
# Use dplyr::glimpse() function to look at the structure of the data set
::glimpse(income_data) dplyr
- Look at the first 10 observations of the data set.
# Look at first 10 observations of cleaned_income data set
head(income_data, n = 10)
3 Editing and Imputation
Editing
Data Editing is essentially error localisation. That is the process of identifying and flagging the specific values in a record deemed to be an error, rendering them subject to subsequent adjustment or correction through imputation.
Imputation
Imputation refers to estimating values for missing or inconsistent data items.
In this course, we will first cover editing data in R and then we will cover model-based imputation and donor-based imputation.
4 Reviewing and Editing Data
Reviewing data and editing it aims to improve the quality and accuracy of data. In this course, we will cover the following aspects of editing data:
Data Entry
Automatic Editing
Error localization
4.1 Data Entry
This stage involves quality assurance of micro data to identify and resolve obvious, known issues such as duplicate records, incorrect formatting, identifying missing values and looking for invalid values.
4.1.1 Identifying Missing Values
So the first thing we will do is to look at all of the unique values in all variables for income_data
.
# See unique values by each column
sapply(income_data, function(x) unique(x))
As, you can see the income
has *
and -
values, which seems to be an invalid entry. In R, only NA
is represented as a missing value, so we will assign all these symbols to NA as shown below:
# Assign all missing values to NA
== c("*","-")] <- NA income_data[income_data
Now we will look at the unique values of the income
variable to double-check our results.
# Unique values in income column
unique(income_data$income)
After replacing different symbols now we know we have missing values in our data set. But generally, we can find if we have any missing values in any data set using anyNA()
function.
# Find any missing values in the income data
anyNA(income_data)
Since the output is true suggesting that there are missing values in our data. Now we will use the md.pattern()
function in the mice
package to calculates the frequencies of the missing data patterns as shown below:
# Understand the missing value pattern
::md.pattern(income_data, plot = FALSE) mice
The output can be understood as follows: 1
and 0
under each variable represent their presence and missing state respectively. The numbers before the first variable (646,117,40,9,15,11,1,1,1) represent the number of rows. For example, 646 samples are complete, there are 117 samples where hours_per_week
is missing and all other values are present. Similarly, there are 40 samples where occupation
is missing and all other values are present. Also, there are 9 samples where occupation
and hours_per_week
are missing but all others are present. The last column lists the number of missing entries per pattern e.g. the first 0 is telling us that there is 646 sample which has all values present, the next 1 is showing that one variable hours_per_week
has missing values and so on.
The bottom row provides the number of missing entries per variable and the total number of missing cells. In our case, only 4 columns have missing values i.e. 3 missing values in age
, 27 missing values in income
, 49 in occupation
and 139
missing values in hours_per_week
column.
4.1.2 Visualising Missing Values
A perhaps more helpful visual representation can be obtained using the VIM package as follows:
# Plot the missing values
# col is a vector giving the colours to be used for observed and missing
# numbers give the proportion or frequencies of the different combinations
# sortVars variables should be sorted by the number of missing values
# labels on the x-axis
# cex.axis is the character expansion factor to be used for the axis annotation
# gap is a numeric value giving the distance between the two plots in margin lines
# ylab are the labels on the y-axis
::aggr(income_data, col = mice::mdc(1:2), numbers = TRUE, sortVars = TRUE,
VIMlabels=names(income_data), cex.axis = 0.7, gap = 3,
ylab=c("Proportion of missingness","Missingness Pattern"))
The pink colour shows the missing values while the blue shows the present values. The plot helps us understanding that almost 77% of the samples are not missing any information, 13.9% of samples are missing the hours_per_week
value while other values are present, 0.12% of samples are missing the hours_per_week
, income
and age
values while other values are present, and the remaining ones show other missing patterns.
Notice: If you are getting a plot which doesnot fit the screen, then please change the cex.axis
, gap
values and use the function oma()
to fix outer margin area according to your screen e.g. cex.axis = 0.6
, gap = 1
and oma = c(10,5,3,3)
.
We can also find all rows which have missing values as shown below:
# complete case returns a logical vector indicating which cases are complete, i.e., have no missing values
# so we will use ! which acts as a negator so it will return a logical vector where cases are not complete
# select all columns by leaving space after the comma
<- income_data[!complete.cases(income_data), ]
missing_values
# First six rows with missing value
head(x = missing_values)
As you can see, that this approach give us all the rows with the missing value e.g. the first person has a missing occupation
. We can impute these values using the information provided in the model-based and donor-based imputation sections.
4.1.3 Finding Duplicates
Now we will check if we have any duplicates in our data set.
# Show the repeat entries
duplicated(income_data),] income_data[
These 4 rows are duplicated rows. The decision to keep or remove duplicate rows depends on the data set we are using. For example, if a data set contains unique IDs for individuals then we would want to remove duplicated rows. However, in our case, we will choose to keep the duplicated rows because it could be possible for two or more individuals to have all the same information across all variables. Duplicated rows can be removed in R by using distinct()
function.
4.2 Automatic editing
In a real-life data set, there are simple errors that need to be treated automatically. For example, a person’s age cannot be negative, an under-aged person cannot possess a drivers license, etc. Such knowledge can be expressed as rules or constraints. In data editing literature these rules are referred to as edit rules or edits, in short. The editrules
package can be used to define rules on categorical, numerical or mixed-type data sets which each record must obey. Automatic editing is a cheap mode of treatment, but you must be confident that your assumptions are correct, otherwise your automatic corrections could introduce more error.
Now for demonstration purpose, we will define some restrictions on age
column in the income_data
using editrules
package.
# Set the rule where age should be non-negative and less than and equal to 40 only
<- editrules::editset(c("age >=0", "age <= 40"))
E
# Display the results
E
The income_data
can be checked against these rules with the violatedEdits()
function.
# Check if the rules are violated
<- editrules::violatedEdits(E, dat = income_data)
violate_edits
# Display first six rows
1:6,] violate_edits[
The first column shows the records (rows), the second column num1
(age >=0) shows that all values are FALSE
meaning that the rule of age being positive has not been violated (age column has positive values for the first 6 records). The second column num2
(age <= 40) shows that the 5th record violates the rule because the logical value is TRUE
suggesting that the age of the 5th person is greater than 40. We can double confirm our results as well.
# Filter the 5th person in the data set
5,] income_data[
As you can see the age of the 5th person is 59 which is indeed greater than 40.
Now we will set some restrictions on the education_num
column. Since education_num
columns have the level of education in numbers e.g. a person who has done Masters
would have education number 14
. The maximum number of educations can be 16
which means the person has Doctorate
. Therefore we will set some restrictions on the education_num
since the education number should only be non-negative values and less than and equal to 16.
# Set the rule where education_num should be non-negative and less than and equal to 16
<- editrules::editset(c("education_num >=0", "education_num <= 16"))
E
# Check if the rules are violated
# the function returns a matrix
<- editrules::violatedEdits(E, dat = income_data)
edits_violated
# Display the first 6 rows
1:6,] edits_violated[
The function violatedEdits()
returns a matrix edits_violated
. However, to check all rows where the rules have been violated, we have to convert the edits_violated
to a data frame and filter the rows where the edit rules have been violated.
# Convert the variable to a data frame
<- data.frame(edits_violated)
violated_data
# Filter the rows where either of the edit rules have been violated
# once you create a data frame you can see the column names by using $ sign
%>%
violated_data ::filter(num1 == TRUE | num2 == TRUE) dplyr
The second column num2
(education_num <= 16) shows that the 5th and 93rd record violates the rule because the logical value is TRUE
suggesting that the education_num
of both people is greater than 16. We can double confirm our results as well.
# Filter the 5th and 93rd person in the data set
c(5,93),] income_data[
As you can see the value is 100 (instead of 10) for the 5th person and 90 instead of 9 for the 93rd person. Obviously, we can never be 100% sure that this 100 should be 10, so we will make an assumption that the person made a typing error where the value should be 10 and 9 respectively. We can use deducorrect
package to this value because this package fixes inconsistent observations by altering invalid values in the record based on information from valid values.
# We specify the rule and the correction
<- deducorrect::correctionRules(expression(if(education_num > 16) {
correction_rules <- education_num/10
education_num
}))
# Apply simple replacement rules to a data.frame
<- deducorrect::correctWithRules(rules = correction_rules, dat = income_data)
correction
# list of alterations
$corrections correction
As you can see the previous value was 100 and 90 which has been corrected to 10 and 9 respectively. Now we will look at the altered data.
# list with altered data
$corrected correction
The altered data has the correct education_num
values now.
Exercise
Set a restriction on the
capital_gain
column inincome_data
, where the column should have non-negative values and less than and equal to 100000.Using
income_data
, check if the rules have been violated and assign it to variableviolate
.Convert
violate
to a dataframe usingdata.frame()
function.Filter the rows where either of the edit rules have been violated.
Do the correction using the
deducorrect
package (Hint: the correction will be that wherever the capital gain is greater than 100000 we divide it by 1000).
- Set a restriction on the
capital_gain
column inincome_data
, where the column should have non-negative values and less than and equal to 100000.
# Set the rule where capital_gain should be non-negative and less than and equal to 100000
<- editrules::editset(c("capital_gain >=0", "capital_gain <= 100000"))
E
# Display the results
E
- Using
income_data
, check if the rules have been violated and assign it to variableviolate
.
# Check if the rules are violated
<- editrules::violatedEdits(E, dat = income_data)
violate
# Display first 6 rows
1:6,] violate[
- Convert
violate
to a dataframe usingdata.frame()
function.
# Convert violate to a dataframe
<- data.frame(violate) violate_data
- Filter the rows where either of the edit rules have been violated.
# Filter the rows where either of the edit rules have been violated
# Once you create a data frame you can see the column names by using $ sign
%>%
violate_data ::filter(num1 == TRUE | num2 == TRUE) dplyr
Record 16 and 273 violates the rule of capital_gain
being less than 100000. To double confirm, we will look at the actual income_data
.
# select the 16th and 273rd record from the income_data and the capital_gain column
c(16,273),8] income_data[
Both values for the capital_gain
is more than 100000.
- Do the correction using the
deducorrect
package (Hint: the correction will be that wherever the capital gain is greater than 100000 we divide it by 1000).
# We specify the rule and the correction
# Since the restriction is that the capital_gain should be non-negative and less than and equal to 100000
# we will specify the rule that wherever the capital gain is greater than 100000 we divide it by 1000
<- deducorrect::correctionRules(expression(if(capital_gain > 100000) {
corr_rules <- capital_gain/1000
capital_gain
}))
# Apply simple replacement rules to a data.frame.
<- deducorrect::correctWithRules(rules = corr_rules, dat = income_data)
corr
# list of alterations
$corrections corr
4.3 Error localization
So far we have seen how to set restrictions and correct those rows which violate the restrictions. However, in the real world, we work with large data sets, so this means we have more variables so more rules. So rather than having one specific condition, we can have a text file saved with all conditions for all data types (i.e. numeric, categorical and mixed data). With the editrules
package, we can read in the text files with the rules and restrictions to check if there has been any violation or not.
# Read the text file containing rules
<- editfile("../Data/edits.txt")
E
# Check if the rules are violated
<- editrules::violatedEdits(E, income_data)
violated_rules
# Display the summary
summary(violated_rules)
Here, the edit labelled dat7
is violated by 543 records (64.6% of all records). Violated edits are sorted from most to least often violated. We can also check the interconnection of rules by creating a plot as shown below:
# A graph showing the interconnection of rules and restriction
plot(E)
Here we can see the interconnection of restrictions where blue circles represent variables and yellow boxes represent rules. The lines indicate which restrictions are associated with what variables.
As you can see this interconnectivity of edits is what makes error localization difficult. For example, the graph shows that a record violating edit mix5
may contain an error in education_num
and/or education
and/or age
. Suppose that we alter education_num
so that mix5
is not violated anymore. Then we are carrying the risk of violating up to five other edits containing education_num
.
If we have no other information available but the edit violations, it makes sense to minimize the number of fields being altered. This principle is commonly referred to as the principle of Fellegi and Holt (covered in Introduction to Editing and Imputation).
# First two records from the income_data
1:2, ]
income_data[
# Check if they are violated
::violatedEdits(E, income_data[1:2, ]) editrules
The first record violates dat6
and dat7
while the second record violates dat6
only. We use the localizeErrors()
function, with a mixed-integer programming approach to find the minimal set of variables to adapt.
# find the minimal set of variables to adapt
# method cabn be errorlocalizer ("bb") or mix integer programming ("mip")
<- editrules::localizeErrors(E, dat = income_data[1:2, ], method = "mip")
localised_error
# adapt indicates the minimal set of variables to be altered in each record
$adapt localised_error
The array localised_error$adapt
indicates with a boolean value which variable in which record must be changed. So for the first person, the variables education
and occupation
needs to be changed and for the second person, the variables education
and education_num
needs to be changed.
5 Model Based Imputation
Model-based imputation is the approximation of missing values using assumptions about the distribution of the data, which includes mean and median imputation. Or using assumptions about the relationship between variables or units, which includes ratio and regression imputation.
There are two types of model-based methods:
Deterministic: those methods which are based on models and, when repeated, always impute the same value.
Stochastic: methods which include a random element. When repeated these may impute a different value each time.
So first we will cover the deterministic imputation where we will cover the mean imputation, ratio imputation and regression imputation. For this section and later sections we will use the cleaned data set named cleaned_income
data stored in the Data
folder.
# Import cleaned_income data and assign it to clean_income
<- readr::read_csv("../Data/cleaned_income.csv") cleaned_income
5.1 Mean imputation
We will use the mice
package to impute the missing values in cleaned_income
data. Mice stand for Multivariate Imputation via Chained Equations and it is one of the commonly used packages by R users. MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on the observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable. We will cover mean imputation first.
# Use mice function within mice package
# method used is "mean" and m = 1 means single imputation
# maxit is a scalar giving the number of iterations (default is 5)
<- mice::mice(cleaned_income, method = "mean", m = 1, maxit = 5) mean_impute
The mice
package imputes the missing values in all variables e.g. in our case the imputed variables are age
and hours_per_week
. The method we used is “mean”, but you can use other imputation methods by running the methods(mice)
command which gives a list of the available imputation methods. Since “mean” imputation can only be used for numerical data, that is why the imputed variables are numerical i.e. age
and hours_per_week
. If we use methods such as “polytomous logistic regression”, then we can impute missing values for categorical data such as occupation
.
We can see the rows which had the missing value and their corresponding imputed values as shown below:
# Selecting the rows in the age column which have imputed values
$imp$age mean_impute
As you can see row 37, 76 and 830 had missing values in the age
column and their imputed value is the mean of the age column which is 38.76.
# Selecting the first 6 rows in hour_per_week column which have imputed values
head(mean_impute$imp$hours_per_week)
Now we can get back the completed data set using the complete()
function. The missing values will be replaced with the imputed values in the data set.
# Store imputed data using complete() function
<- mice::complete(mean_impute) mean_data
So now we can find the summary statistics of hours_per_week
column in both data sets.
# Summary statistics of hours per week for original data
summary(cleaned_income$hours_per_week)
# Summary statistics of hours per week for imputed data
summary(mean_data$hours_per_week)
Making the comparison we see that only the values of the first and third quantile have changed. The minimum and maximum values are the same and so, of course, is the mean. But of more interest is what happens to the sample standard deviation; its value for the imputed data can be found using:
# Standard deviation of hours per week for imputed data
sd(mean_data$hours_per_week, na.rm = TRUE)
# Standard deviation of hours per week for original data
sd(cleaned_income$hours_per_week, na.rm = TRUE)
The value for the imputed data, 11.32 is lower than the original data, 12.40 (which was expected since mean imputation reduces the variance of the imputed variables because the imputation method adds more observations which are the mean - and so reduce the average deviation from the mean for the data).
Exercise
Using
mice()
function, impute the missing values in thecleaned_income
data.Store the imputed data using the
complete()
function.Look at the difference in the standard deviation of the
age
column in imputed data and original data.
- Using
mice()
function, impute the missing values in thecleaned_income
data.
# Use mice function within mice package where the method used is "mean"
# and m = 1 means single imputation
# maxit is a scalar giving the number of iterations (default is 5)
<- mice::mice(cleaned_income, method = "mean", m = 1, maxit = 1) mean_impute
- Store the imputed data using the
complete()
function.
# Store imputed data using complete() function
<- mice::complete(mean_impute) mean_data
- Look at the difference in the standard deviation of the
age
column in imputed data and original data
# Standard deviation of the age for original data
sd(cleaned_income$age, na.rm = TRUE)
# Standard deviation of the age for imputed data
sd(mean_data$age, na.rm = TRUE)
As expected the standard deviation of the age
column in the imputed data is lower than the original one.
5.2 Ratio imputation
Now we will look at the ratio imputation. There are two types of ratio imputation used, mainly for business surveys:
Mean of ratios (e.g. used for vacancy survey)
Ratio of means (e.g. used for financial services survey)
For this section only we will use a short data set called business_data
stored in the data folder.
# Import business_data
<-readr::read_csv("../Data/business_data.csv")
business_data
# Display the data
business_data
First, we will impute the missing values using the ratio of means imputation and then we will cover the mean of ratios imputation.
5.2.1 Ratio of means imputation
The steps for the ratio of means imputation are as follows:
Identify businesses that responded in both periods.
Sum the current period values for businesses that responded in both periods.
Sum the previous period values for businesses that responded in both periods.
Calculate the growth factor – the sum of the values for the current period divided by the sum of the values for the previous period.
For each business with missing values in the current period, multiply the previous period value by the growth factor.
# Take the business data and select all rows with no missing value
# Divide the sum of the current_period and previous_period columns
# to obtain growth factor
# pull out the value of growth factor
<- business_data %>%
ratio_sum ::drop_na() %>%
tidyr::summarise(ratio_sum = sum(current_period)/sum(previous_period)) %>%
dplyr::pull()
dplyr
# Display the value
ratio_sum
Using this value, we will impute the missing value in the current_period
column by multiplying the growth factor with the previous period value as shown below:
# Impute the missing value in the current period column using if_else() function
# we are updating the current_period so use mutate
# specify wherever the missing value is true, replace that with ratio * previous value
# otherwise keep the original value
%>%
business_data ::mutate(current_period = dplyr::if_else(
dplyrcondition = is.na(current_period),
true = (ratio_sum * previous_period),
false = current_period))
5.2.2 Mean of ratios imputation
The steps for mean of ratios imputation are as follows:
Identify businesses that responded in both periods.
Calculate the ratio for each business that responded in both periods– for each business divide the current period value by the previous period value for each business.
Sum the ratios.
Calculate the growth factor – this is the sum of ratios divided by the number of businesses that responded in the current period and the previous period.
For each business with missing values for the current period, multiply the previous period value by the growth factor.
# Take the business data and select all rows with no missing value
# create a column of ratios which is current period divided by previous previous period
# Find the sum of ratios and divide it by the number of businesses to get the growth_factor
# pull out the value of growth factor from the data set
<- business_data %>%
growth_factor ::drop_na() %>%
tidyr::mutate(ratio = current_period / previous_period) %>%
dplyr::summarise(growth_factor = sum(ratio) / n()) %>%
dplyr::pull()
dplyr
# Display the value
growth_factor
Using this value, we will impute the missing value in the current_period
column by multiplying the growth factor with the previous period value as shown below:
# Impute the missing value in the current period column using if_else() function
# we are updating the current_period so use mutate
# specify wherever the missing value is true, replace that with growth * previous value
# otherwise keep the original value
%>%
business_data ::mutate(current_period = dplyr::if_else(
dplyrcondition = is.na(current_period),
true = (growth_factor * previous_period),
false = current_period))
5.3 Regression imputation
Now, let’s apply a regression imputation to our cleaned_income
data. As covered before there are two types of regression imputation.
Deterministic: this method assumes that the imputed values appear close to the regression line.
Stochastic: it is an improvement to the previous method by aiming to preserve the variability of data by adding an error (residual) factor to the predicted value.
# Deterministic regression imputation
# method used is "norm.predict"
# m = 1 means single imputation
<- mice::mice(cleaned_income, method = "norm.predict", m = 1) deter_impute
The variables that have been imputed are age
and hours_per_week
# Store data
<- mice::complete(deter_impute) det_data
We can use almost the same code for stochastic regression imputation. We only have to change the method from method = "norm.predict"
to method = "norm.nob"
as shown below:
# Stochastic regression imputation
# method used is "norm.nob"
# m = 1 means single imputation
<- mice::mice(cleaned_income, method = "norm.nob", m = 1)
stoch_imp
# Store data
<- mice::complete(stoch_imp) stoch_data
Let’s graphically check how well our missing data imputations worked:
# Graphical comparison of deterministic and stochastic regression imputation
# Both plots in one graphic
par(mfrow = c(1, 2))
# Deterministic regression imputation plot
# Plot of observed values
plot(cleaned_income$age[!is.na(cleaned_income$hours_per_week)],
$hours_per_week[!is.na(cleaned_income$hours_per_week)],
det_datamain = "Deterministic Regression",
xlab = "Age", ylab = "Hours per week")
# Plot of imputed values
points(cleaned_income$age[is.na(cleaned_income$hours_per_week)],
$hours_per_week[is.na(cleaned_income$hours_per_week)],
det_datacol = "red")
# Regression slope
abline(lm(hours_per_week ~ age, det_data), col = "#1b98e0", lwd = 1.5)
# Stochastic regression imputation
# Plot of observed values
plot(cleaned_income$age[!is.na(cleaned_income$hours_per_week)],
$hours_per_week[!is.na(cleaned_income$hours_per_week)],
stoch_datamain = "Stochastic Regression",
xlab = "Age", ylab = "Hours per week")
# Plot of imputed values
points(cleaned_income$age[is.na(cleaned_income$hours_per_week)],
$hours_per_week[is.na(cleaned_income$hours_per_week)],
stoch_datacol = "red")
# Regression slope
abline(lm(hours_per_week ~ age, stoch_data), col = "#1b98e0", lwd = 1.5)
The black points are observed values, whereas the imputed values are represented by red points. The graph visualizes the main drawback of deterministic regression imputation i.e. the imputed values (red bubbles) are way too close to the regression slope (blue line)!
In contrast, the imputation by stochastic regression worked much better. The distribution of imputed values is similar compared to the observed values and, hence, much more realistic. As such, this method, and in particular, the idea to draw from the residuals, forms the basis of more advanced imputation techniques. However as covered in the Introduction to Editing and Imputation course, there are some disadvantages of using stochastic regression. For example, it can lead to implausible values and assumes the data should be homoscedastic. The predictive mean matching method covered in Section 6.5 is capable of solving the issues related to stochastic regression imputation.
Exercise
Impute the missing values of continuous data in
cleaned_income
data using stochastic regression and assign it tostoch_impute
Use
stoch_impute$imp$hours_per_week
to check which rows have been imputed.
- Impute the missing values of continuous data in
cleaned_income
data using stochastic regression and assign it tostoch_impute
# Stochastic regression imputation
# method used is "norm.nob"
# m = 1 means single imputation
<- mice::mice(cleaned_income, method = "norm.nob", m = 1) stoch_impute
- Use
stoch_imp$imp$hours_per_week
to check which rows have been imputed.
# Select the imputed value in hours_per_week column
# select only first 6 values
head(stoch_impute$imp$hours_per_week)
It shows the rows which have missing values and their corresponding imputed value. As covered in the Introduction to editing and imputation course, stochastic regression may impute different values each time.
6 Donor Based Imputation
Now we will use donor-based imputation to fill in missing values for a given unit (the recipient) by copying the corresponding observed values of another unit (the donor). We will cover these types of donor-based imputation:
- Random hot deck
- Sequential hot deck
- Hierarchical hot deck
- Nearest neighbour imputation
- Predictive mean matching
The theory for each type of donor-based imputation is covered in Introduction to editing and imputation.
6.1 Random hot deck
In random hot deck imputation, imputation classes (also known as donor pools or adjustment cells) are formed based on matching variables. For each recipient unit i
in a given imputation class, the potential donors are units within that same class but with the target variable, y
observed. Of these potential donors, one is selected at random typically through equal-probability sampling and used to impute the recipient. To conduct the hot deck imputation I will be using a part of the cleaned_income
data set which have only 10 rows and hence should make this section easier to understand. We will then use full data set later on when we will be covering hierarchical hot deck imputation.
# Create a shorter data with 10 rows and all columns
<- cleaned_income[1:10,]
shorter_data
# Display the results
shorter_data
So we want to impute the missing value in the occupation
variable for the first person (which is the recipient) using the matching variable income
.
In R we can use the function hotdeck()
from the VIM
package to carry out hot-deck imputation. Where we first specify the data set, then we specify the variable which has a missing value to impute in the argument variable
(in our case it will be occupation
) and then we specify the matching variable (in our case it will income
).
# Using hotdeck function in VIM package to carry out random hot deck imputation
# variable take the column that we want to impute the missing value of
# domain_var is the variables for building domains and impute within these domains
::hotdeck(data = shorter_data, variable = "occupation", domain_var = "income" ) VIM
So the value for occupation
can be either “Adm-clerical” or “Other-service” or “Exec-managerial” since the donors are the 4th, 7th, 8th and 9th person (because they have small income and the person whom we are imputing the missing value has the value small for the income variable). Since this is a random hot deck imputation that means the donor is randomly selected so the imputed value can be Adm-clerical/Other-service/Exec-managerial.
Also, there is a new column occupation_imp
which returns a boolean value TRUE if the missing value in the occupation variable has been imputed and FALSE otherwise.
We can have more than one matching variable as well and we will cover that later in the hierarchical hot deck imputation section (Section 6.3).
Exercise
- Carry out random hot deck imputation to impute the missing value in the
hours_per_week
variable for the 4th person in the shorter data using the matching variablessex
. Remember shorter data contains the first 10 observations from thecleaned_income
data set.
- Carry out random hot deck imputation to impute the missing value in the
hours_per_week
variable for the 4th person in the shorter data using the matching variablessex
. Remember shorter data contains the first 10 observations from thecleaned_income
data set.
# Using hotdeck function in VIM package to carry out random hot deck imputation
# variable take the column that we want to impute the missing value of
# domain_var is the variables for building domains and impute within these domains
::hotdeck(data = shorter_data, variable = "hours_per_week", domain_var = "sex") VIM
Since the 4th person is a male, the donor can be either 6th or 10th person. So the imputed value can be either 40 or 45. To see results easier you can filter the data as shown below:
# Filter all men in data
# Select only sex and hours_per_week column
%>%
shorter_data ::filter(sex == "Male") %>%
dplyrselect(sex, hours_per_week)
So the imputed value can be either 40 or 45.
6.2 Sequential hot Deck
In sequential hot deck imputation, the data set is not explicitly split into groups. Instead, one goes over the records in the data set in order and imputes each missing value by the last previously encountered observed value for a unit with the same scores on the matching variables. Thus, the recipient is imputed using as a donor the last unit with y
observed that belongs to the same imputation class and that comes before the recipient in the data file.
We will impute the missing value in the occupation
variable using income
as the matching variable. But this time we will do sequential hot deck imputation where we will order the hours_per_week
variable. If we don’t order the hours_per_week
then R will only carry out random hot deck imputations.
# Using hotdeck function in VIM package to carry out sequential hot deck imputation
# variable take the column that we want to impute
# domain_var is the matching variable
# variables for sorting the data set before imputation (not overlapping with variable)
::hotdeck(data = shorter_data, variable = "occupation", domain_var = "income",
VIMord_var = "hours_per_week")
Now the imputed value for the occupation
variable is “Adm-clerical” because when we order the hours_per_week
variable, the structure of the data looks like this:
# Arranging the hours_per_week column in the ascending order
%>%
shorter_data ::arrange(hours_per_week) dplyr
As you can see when we order the hours_per_week
variable, the structure of the data changes where the recipient (the person with missing value in the occupation
variable) is the 6th person and the donor is the 4th person with the same income. (Remember: in sequential hot deck imputation the data set is not explicitly split into groups instead, one goes over the records in the data set in order and imputes each missing value by the last previously encountered observed value.)
Exercise
- Carry out sequential hot deck imputation to impute the missing value in the
occupation
variable for the 1st person in the shorter data using the matching variablessex
while ordering theeducation_num
column. Remember shorter data contains the first 10 observations from thecleaned_income
data set.
- Carry out sequential hot deck imputation to impute the missing value in the
occupation
variable for the 1st person in the shorter data using the matching variablessex
while ordering theeducation_num
column. Remember shorter data contains the first 10 observations from thecleaned_income
data set.
# Using hotdeck function in VIM package to carry out sequential hot deck imputation
# variable take the column that we want to impute
# domain_var takes the matching variable
# variables for sorting the data set before imputation
::hotdeck(data = shorter_data, variable = "occupation",
VIMdomain_var = "sex",
ord_var = "education_num")
As you can see the imputed value is now “Adm-clerical” because when we order the education_num
variable, the structure of the data looks like this:
# Arranging education_num column in ascending order
%>%
shorter_data ::arrange(education_num) dplyr
So, the donor is the 7th person who is a female. So the imputed value for occupation
will be “Adm-clerical”.
6.3 Hierarchical hot deck
If a single donor is used many times to impute many records, this could lead to a lack of precision in survey estimates. One way to avoid the problems associated with sequential hot-deck imputation is hierarchical hot-deck imputation. This method sorts recipient and donor units into a large number of imputation classes based on a detailed categorisation of a large set of matching variables.
Donor units are then matched with recipients in the smallest class first. If no match is found that class is collapsed with the next one, and so on until a donor is found.
We will impute the missing value in the occupation
variable of the first person in cleaned_income
data set using the hierarchy shown below (the hierarchy below has been constructed based on the observed data for the first person in cleaned_income
data):
There are various methods to build the hierarchy similar to this for each data set but the most commonly use methods are decision trees.
Now we will impute the missing value in the occupation
variable for the first person using all of these matching variables variable specified in the diagram, so we want the donor to have the same value on all variables as the first person (recipient).
# specify the data and the variable that needs to be imputed
# specifying more than one matching variable now
<- VIM::hotdeck(data = cleaned_income, variable = "occupation",
hierarchy_impute domain_var = c("income","education","education_num","hours_per_week",
"sex","marital_status","capital_gain","capital_loss",
"age"))
# Display first 6 observations
head(hierarchy_impute)
As you can see the missing value hasn’t been imputed because no donor was found that had the same characteristics on all variables as the recipient. So we will work backwards on the hierarchy. Now we want to select donors with the same values on all variables except for age.
# Removing age variable from the previous codes
# so selecting donor who has the same character as recipient except for age
<- VIM::hotdeck(data = cleaned_income, variable = "occupation",
hierarchical_impute domain_var = c("income","education","education_num","hours_per_week",
"sex","marital_status","capital_gain","capital_loss"))
# Display first 6 observations
head(hierarchical_impute)
The imputed value for occupation
is ” Prof-specialty” because the donor was a female, with a small income, masters degree where the education in numbers is 14, never-married and spend 40 hours per week on work with 0 capital gain and loss which is the same as the recipient.
We can filter out the data as well to see how many donors are as shown below:
# Filter rows that have the same values on all variables except age
%>%
cleaned_income ::filter(income == "small" & education == "Masters" & education_num == 14 &
dplyr== 40 & sex == "Female" & marital_status == "Never-married" &
hours_per_week == 0 & capital_loss == 0) capital_gain
So there is one donor with the same characteristics as the recipient except for age.
Exercise
- Carry out hierarchical hot deck imputation to impute the missing value in the
hours_per_week
variable for the 4th person in thecleaned_income
data using the matching variablesincome
,education
,education_num
,occupation
,marital_status
andsex
.
- Carry out hierarchical hot deck imputation to impute the missing value in the
hours_per_week
variable for the 4th person in thecleaned_income
data using the matching variablesincome
,education
,education_num
,occupation
,marital_status
andsex
.
# Using cleaned_income data
# specifying all matching variables in domain_var
<- VIM::hotdeck(data = cleaned_income, variable = "hours_per_week",
hierarchical_example domain_var = c("income","education","education_num","occupation","marital_status",
"sex"))
# Display first 6 observations
head(hierarchical_example)
The imputed value for hours_per_week
for the recipient can be either 45 or 30 because the donors are men with a small income, Bachelors degrees with 13 years of education, never married and working as an Admin clerical. Which are the same characteristics as the recipients. We can see how many donors are there, by filtering our data as shown below:
# Filter rows that have the same values on all variables except age
# select all columns where hours_per_week comes first
%>%
cleaned_income ::filter(income == "small" & education == "Bachelors" & education_num == 13 &
dplyr== "Adm-clerical" & sex == "Male" & marital_status == "Never-married") %>%
occupation select(hours_per_week, everything())
So as we can see there are two donors and the imputed value can be 45 or 30.
6.4 K-Nearest Neighbours (KNN)
Nearest neighbour donor imputation removes the restriction that the donor and the recipient have identical values for all matching variables. Instead, the matching variables are used to define a distance function between the recipient and a potential donor.
An extension of this concept is k-nearest neighbours imputation which is based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables. To compute the Gower distance between two items you compare each element and compute a term. If the element is numeric, the term is the absolute value of the difference divided by the range. If the element is non-numeric the term is 1 if the elements are different or the term is 0 if the elements are the same. The Gower distance is the average of the terms. Each individual term will be between 0.0 and 1.0 inclusive, therefore the Gower distance will always be between 0.0 and 1.0 where a distance of 0.0 means the two items are the same and a distance of 1.0 means the two items are as far apart as possible, relative to the source data set.
# Using KNN function in VIM package
# we can use any variables so we will use a cleaned income data set
# k is the number of nearest neighbour donors used (k = 1 is Nearest Neighbour imputation)
# dist_var takes the variables to be used for distance calculation
# I am choosing variables randomly but it has been taught in the theoretical course
<- VIM::kNN(data = cleaned_income, variable = "occupation", k = 4 ,
knn_imputation dist_var = c("education","sex","income","marital_status"))
# Display results
head(x = knn_imputation)
The imputed value for occupation is chosen randomly using 4 closest donors.
Notice: If you are getting an Rccp package error, then please install the Rccp package. If you get a warning then that’s fine, warning messages are generally fine in R.
Exercise
- Impute the missing value in the
hours_per_week
variable for the 4th person using K-Nearest Neighbours.
- Impute the missing value in the
hours_per_week
variable for the 4th person using K-Nearest Neighbours.
# Using KNN function in VIM package
# k is the number of nearest neighbour donors used
<- VIM::kNN(data = cleaned_income, variable = "hours_per_week", k = 7)
knn_imputation_cont
# Display results
%>%
knn_imputation_cont ::select("hours_per_week", everything()) %>%
dplyrhead()
The imputed value in the hours_per_week
column is chosen randomly using 7 closet donors.
6.5 Predictive mean matching
Finally, a special case of the nearest neighbour methods is predictive mean matching, in which the nearest neighbour donor is determined using the predicted y-value from a regression model. For each missing value, it imputes a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model.
The predictive mean matching method is more appropriate than a standard regression approach if the normality assumption is violated. (This method only imputes missing value for continuous data since it’s similar to a regression imputation method but with a donor element.). We will now impute the missing value of numerical data in cleaned income
using predictive mean matching.
# Impute data via predictive mean matching (single imputation)
<- mice::mice(data = cleaned_income, m = 1, method = "pmm") imp_single
A scalar giving the number of iterations, the default is 5. The imputed variables are age
and hours_per_week
.
# Store imputed data using complete() function
<- mice::complete(imp_single)
data_imp_single
# Display the results
head(data_imp_single)
We can also conduct multiple imputation by specifying m = 5
within the function mice (i.e. 5 imputed data sets).
# Impute missing values multiple times in hours_per_week column
<- mice::mice(cleaned_income, m = 5, method = "pmm")
imp_multi
# Store multiple imputed data
# “repeated” and include = TRUE tells the complete function
# how to store our multiply imputed data in the object data_imp_multi_all.
<- mice::complete(imp_multi,
data_imp_multi_all "repeated",
include = TRUE)
# Select all of the hours_per_week columns in this new data set
# Select only the first 5 rows
%>%
data_imp_multi_all ::select(starts_with("hours")) %>%
dplyrhead(n = 5)
- hours_per_week.0 is equal to the original
hours_per_week
with missing values. - hours_per_week.1-hours_per_week.5 are the five imputed versions of
hours_per_week
.
While hours_per_week.0
contains missing value, hours_per_week.1
–hours_per_week.5
are filled with imputed values. Note that the imputed values differ between the five imputed data sets.
Now we will check the goodness of fit so we will check how well the values are imputed. The xyplot()
functions come into the picture and help us verify our imputations.
# Using xplot to create scatter plot where the x-axis is hours_per_week and y-axis is age
::xyplot(imp_multi, age ~ hours_per_week| .imp, pch = 20, cex = 1.4) lattice
The first row and column show the data of age.0
and hours_per_week.0
. The blue points are the observed data. The second column and first row show the data of age.1
and hours_per_week.1
and so on. The red points are the imputed data (that is why the first plot does not have any red points because the columns have the missing value that isn’t imputed). To check the goodness of first, ideally, the red points should be similar to the blue points so that the imputed values are similar to observed values, which is true in our case.
A useful feature of the mice()
function is the ability to specify the set of predictors (matching variable) to be used for each incomplete variable. The basic specification is made through the predictorMatrix
argument, which is a square matrix of size ncol(data)
containing 0/1 data.
# Use mice() function within mice package
<- mice::mice(cleaned_income, print = FALSE) impute
# mice() returns a mids object containing a predictorMatrix entry
$predictorMatrix impute
The rows correspond to incomplete variables, in the sequence as they appear in the data. A value of 1 indicates that the column variable is a predictor (matching variable) to impute the incomplete variables (row variable), and a 0 means that it is not used. Thus, in the above example, age
is predicted from education_num
, capital_gain
, capital_loss
and hours_per_week
. Note that the diagonal is 0 since a variable cannot predict itself.
Exercise
Do a single imputation to impute the missing values of continuous data in
cleaned_income
data using the predictive mean matching method.Find which rows in
age
column have been imputed and their imputed value.
- Do a single imputation to impute the missing values of continuous data in
cleaned_income
data using the predictive mean matching method.
# Impute data via predictive mean matching (single imputation)
<- mice::mice(data = cleaned_income, m = 1, method = "pmm") pmm_impute
- Find which rows in
age
column have been imputed and their imputed value.
# Display the results
$imp$age pmm_impute
The imputed value in the age
column is shown above for each row with the missing value.
7 Summary
By now you should be able to:
Apply methods of reviewing data such as identifying missing values, visualising the pattern of missingness and finding duplicates in data in R.
Do automatic editing where you can apply restrictions, check which restrictions have been violated and correct those.
Impute missing value using model-based methods such as mean, ratio and regression imputation in R.
Impute missing values using donor-based imputation such as the random hot deck, sequential hot deck, hierarchical hot deck in R.
Impute missing value using KNN and predictive mean matching imputation in R.