Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'kableExtra' was built under R version 4.4.2
Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'kableExtra' was built under R version 4.4.2
Our analysis pipeline for average population density is near complete now.
So far, we have been working on one single data set. Now however we have been given access to the population density values for a range of years in different CSV files.
You now need to add to and improve the code and files to meet new requirements. The code to change can be found in example_code/case_study/initial/
.
This is not the exact methodology for calculating the mean population density in a region - however by completing this case study you will gain experience building a pipeline for analysis.
The below tasks are intended to reinforce the content covered in this course.
Tasks 1-4 are similar to exercises already covered.
Tasks 5 and 6 will require more thought and working with data frames and your chosen languages. You may need to do some research in your chosen language/package to complete them.
Answers to all of the tasks combined are given below, and the full code is contained within example_code/case_study/answers/
.
In order to be able to perform the analysis across different years, we need to access different files:
get_analyse_output()
in main.R
pop_density_filepath
locations.csv
data: location_filepath
Test that your refactoring of the code produces the same result by running the program with the original data sets.
Test that your refactoring of the code works for other data sets by using the 2018
data.
As we build analysis on top of the code we have already written, the get_analyse_output()
will not be the highest level function:
get_analyse_output()
function to return the calculated final data frame formatted_statistic
.output_filepath
of the output path of the reslting data frame.
False/FALSE
do not write out the data frame to a fileTest this new parameter by running it with a file path and with False/FALSE
.
We will be joining the data at the end of our process, this means we will be performing multiple joins in the analysis:
joins.R
join_frames()
from manipulation.R
to join.R
Ensure your code still works at this stage.
If we are going to combine all our data, we will need to change how it is represented:
column_name_year()
and put it in manipulation.R
dataframe
original_column
new_column
Test this function on a data frame.
Once we have created a new data frame for each year we will need to combine the data for each region into one data frame:
join_years()
in joins.R
dataframes
, a list containing the data frames to be joinedjoin_column
We need to bring all our new functions together to perform the final analysis:
combined_analysis()
in main.R
output_filepath
to designate where to output the final analysisget_analyse_output()
for each file path, without writing out the data to file; name the frame appropriately
column_name_year()
should be used on each data frame produced to change the mean_population_density
column to the year of the frame; you can access the year of each data set from each filepath, you may want to split the path string upcombined_analysis()
using join_years()
from joins.R
join the results of each get_analyse_output()
togetherwrite_output()
from input_output.R
write out the data frame to output_filepath
if the value is not False/FALSE
Be sure to run the whole process in one command to check that it produces the output expected.
sdg_region_name | 2017 | 2018 | 2019 | 2020 |
---|---|---|---|---|
Australia/New Zealand | 10.53 | 10.63 | 10.72 | 10.82 |
Central and Southern Asia | 317.54 | 324.57 | 330.63 | 335.32 |
Eastern and South-Eastern Asia | 2065.47 | 2089.43 | 2112.67 | 2135.78 |
Europe and Northern America | 756.12 | 760.73 | 764.93 | 768.50 |
Latin America and the Caribbean | 197.25 | 198.43 | 199.62 | 200.85 |
Northern Africa and Western Asia | 222.42 | 228.59 | 234.38 | 239.48 |
Oceania (excluding Australia and New Zealand) | 142.01 | 143.12 | 144.20 | 145.28 |
Sub-Saharan Africa | 121.30 | 123.91 | 126.55 | 129.21 |
Below are the files changed during the case study.
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
source("input_output.r")
source("analysis.r")
source("manipulation.r")
source("joins.r")
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function(pop_density_filepath, location_filepath, output_filepath) {
get_analyse_output
<- load_formatted_frame(pop_density_filepath)
pop_density <- load_formatted_frame(location_filepath)
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
# output_filepath == True if not False
if (output_filepath == TRUE) {
write_output(formatted_statistic, output_filepath=output_filepath)
}
return(formatted_statistic)
}
#' Perform population density mean analysis across multiple files
<- function(population_filepaths, location_filepath, output_filepath) {
combined_analysis
= list()
loaded_dataframes for (population_file in population_filepaths) {
# The year is given at the end of the file path, but before '.csv'
= stringr::str_split(population_file, pattern = "_")
path_broken_up = dplyr::last(dplyr::last(path_broken_up))
path_end
= substr(path_end, start = 1, stop = 4)
year
= get_analyse_output(population_file, location_filepath, output_filepath=FALSE)
year_analysis
# Change the column name to the year of the population density
= column_name_year(year_analysis, "mean_population_density", year)
formatted_year_analysis
length(loaded_dataframes) + 1]] <- formatted_year_analysis
loaded_dataframes[[
}
= join_years(loaded_dataframes, join_column="sdg_region_name")
combined_dataframes
if (output_filepath != FALSE) {
write_output(combined_dataframes, output_filepath=output_filepath)
}
return(combined_dataframes)
}
<- "../../../data/population_density_2017.csv"
pop_path_2017 <- "../../../data/population_density_2018.csv"
pop_path_2018 <- "../../../data/population_density_2019.csv"
pop_path_2019 <- "../../../data/population_density_2020.csv"
pop_path_2020
<- "../../../data/locations.csv"
location_path
# Demonstration of final output for case study
<- combined_analysis(list(pop_path_2017, pop_path_2018, pop_path_2019, pop_path_2020),
final_output
location_path, output_filepath = FALSE)
print(final_output)
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[column_name]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[column_name]
return(dataframe)
}
#' Change name of specified columns in dataframe
<- function(dataframe, original_column, new_column) {
column_name_year
colnames(dataframe)[colnames(dataframe) == original_column] <- new_column
return(dataframe)
}
#' Join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select(combined_frames, sdg_region_name, population_density)
combined_frames_reduced
return(combined_frames_reduced)
}
#' Join a list of frames with an inner join on a specified column name
<- function(dataframes, join_column) {
join_years
<- dataframes %>% purrr::reduce(.x = dataframes,
merged_frame .f = dplyr::inner_join,
by = join_column)
return(merged_frame)
}
Continue on to Summary & Further Resources