Case Study Tasks - R

Author

ONS Data Science Campus

Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'kableExtra' was built under R version 4.4.2

Our analysis pipeline for average population density is near complete now.

So far, we have been working on one single data set. Now however we have been given access to the population density values for a range of years in different CSV files.

You now need to add to and improve the code and files to meet new requirements. The code to change can be found in example_code/case_study/initial/.

This is not the exact methodology for calculating the mean population density in a region - however by completing this case study you will gain experience building a pipeline for analysis.

1 Tasks

The below tasks are intended to reinforce the content covered in this course.

Tasks 1-4 are similar to exercises already covered.

Tasks 5 and 6 will require more thought and working with data frames and your chosen languages. You may need to do some research in your chosen language/package to complete them.

Answers to all of the tasks combined are given below, and the full code is contained within example_code/case_study/answers/.

1.1 Task 1

In order to be able to perform the analysis across different years, we need to access different files:

  • add new parameters to the function get_analyse_output() in main.R
  • the first parameter should be the file path to the population density data CSV: pop_density_filepath
  • the second parameter should be the file path to the locations.csv data: location_filepath

Test that your refactoring of the code produces the same result by running the program with the original data sets.

Test that your refactoring of the code works for other data sets by using the 2018 data.

1.2 Task 2

As we build analysis on top of the code we have already written, the get_analyse_output() will not be the highest level function:

  • change the get_analyse_output() function to return the calculated final data frame formatted_statistic.
  • change the function so it takes a parameter output_filepath of the output path of the reslting data frame.
    • if this value given is False/FALSE do not write out the data frame to a file

Test this new parameter by running it with a file path and with False/FALSE.

1.3 Task 3

We will be joining the data at the end of our process, this means we will be performing multiple joins in the analysis:

  • create a new file in the directory called joins.R
  • move the function join_frames() from manipulation.R to join.R

Ensure your code still works at this stage.

1.4 Task 4

If we are going to combine all our data, we will need to change how it is represented:

  • write a function that changes a column name of a given column to a different string. Call it column_name_year() and put it in manipulation.R
    • the first parameter is the data frame to be changed dataframe
    • the second is the column name to be changed original_column
    • the third is the new column name new_column

Test this function on a data frame.

1.5 Task 5

Once we have created a new data frame for each year we will need to combine the data for each region into one data frame:

  • write a new function join_years() in joins.R
    • this function will perform a join on all the frames given to it on the same column name
    • the function takes as a parameter dataframes, a list containing the data frames to be joined
    • the function takes as a parameter the name of the column all frames are to be joined on join_column
    • the function performs an inner join on all the data frames together on the given column and returns this complete data frame of multiple years

1.6 Task 6

We need to bring all our new functions together to perform the final analysis:

  • write a new function combined_analysis() in main.R
    • the function should take a list of population density filepaths to data you want to analyse
    • the function should take as a parameter the location data filepath
    • the function should take as a parameter output_filepath to designate where to output the final analysis
    • the function should call get_analyse_output() for each file path, without writing out the data to file; name the frame appropriately
      • column_name_year() should be used on each data frame produced to change the mean_population_density column to the year of the frame; you can access the year of each data set from each filepath, you may want to split the path string up
    • within combined_analysis() using join_years() from joins.R join the results of each get_analyse_output() together
    • sort the our
    • using write_output() from input_output.R write out the data frame to output_filepath if the value is not False/FALSE
  • The function should return the final data frame

Be sure to run the whole process in one command to check that it produces the output expected.

1.7 Example Answer

1.7.1 Output Example

sdg_region_name 2017 2018 2019 2020
Australia/New Zealand 10.53 10.63 10.72 10.82
Central and Southern Asia 317.54 324.57 330.63 335.32
Eastern and South-Eastern Asia 2065.47 2089.43 2112.67 2135.78
Europe and Northern America 756.12 760.73 764.93 768.50
Latin America and the Caribbean 197.25 198.43 199.62 200.85
Northern Africa and Western Asia 222.42 228.59 234.38 239.48
Oceania (excluding Australia and New Zealand) 142.01 143.12 144.20 145.28
Sub-Saharan Africa 121.30 123.91 126.55 129.21

Below are the files changed during the case study.

library(tidyr)
library(dplyr)
library(stringr)
library(readr)

source("input_output.r")
source("analysis.r")
source("manipulation.r")
source("joins.r")




#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function(pop_density_filepath, location_filepath, output_filepath) {
  
  pop_density <- load_formatted_frame(pop_density_filepath)
  locations <- load_formatted_frame(location_filepath)
  
  
  pop_density_single_code <- access_country_code(pop_density)
  
  
  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                   column_name = "country_code",
                                                   string_value = "CC")
  
  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                                 column_name = "location_id",
                                                 string_value = '"')
  
  population_location <- join_frames(pop_density_correct_types,
                                     locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")
  
  aggreagation <- aggregate_mean(dataframe = population_location,
                                 groupby_column = "sdg_region_name",
                                 statistic_column = "population_density")
  
  formatted_statistic <- format_frame(aggreagation, "mean_population_density")
  

  # output_filepath == True if not False
  if (output_filepath == TRUE) {
    write_output(formatted_statistic, output_filepath=output_filepath)
  }
  
  return(formatted_statistic)
}



#' Perform population density mean analysis across multiple files
combined_analysis <- function(population_filepaths, location_filepath, output_filepath) {
  

  loaded_dataframes = list()
  for (population_file in population_filepaths) {
    # The year is given at the end of the file path, but before '.csv'
    path_broken_up = stringr::str_split(population_file, pattern = "_")
    path_end = dplyr::last(dplyr::last(path_broken_up))
    
    year = substr(path_end, start = 1, stop = 4)

  
    year_analysis = get_analyse_output(population_file, location_filepath, output_filepath=FALSE)
  
    # Change the column name to the year of the population density
    formatted_year_analysis = column_name_year(year_analysis, "mean_population_density", year)
  
    
    loaded_dataframes[[length(loaded_dataframes) + 1]] <- formatted_year_analysis
    
  }

  combined_dataframes = join_years(loaded_dataframes, join_column="sdg_region_name")
  
  if (output_filepath != FALSE) {
    write_output(combined_dataframes, output_filepath=output_filepath)
  }
  
  return(combined_dataframes)
}


pop_path_2017 <- "../../../data/population_density_2017.csv"
pop_path_2018 <- "../../../data/population_density_2018.csv"
pop_path_2019 <- "../../../data/population_density_2019.csv"
pop_path_2020 <- "../../../data/population_density_2020.csv"

location_path <- "../../../data/locations.csv"

# Demonstration of final output for case study
final_output <- combined_analysis(list(pop_path_2017, pop_path_2018, pop_path_2019, pop_path_2020), 
                                  location_path, 
                                  output_filepath = FALSE)
print(final_output)


#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[column_name] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[column_name] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' Change name of specified columns in dataframe
column_name_year <- function(dataframe, original_column, new_column) {

  colnames(dataframe)[colnames(dataframe) == original_column] <- new_column

  return(dataframe)
}


#' Join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)
  
  combined_frames_reduced <- dplyr::select(combined_frames, sdg_region_name, population_density)
  
  return(combined_frames_reduced)
}

#' Join a list of frames with an inner join on a specified column name
join_years <- function(dataframes, join_column) {

  merged_frame <- dataframes %>% purrr::reduce(.x = dataframes, 
                                                .f = dplyr::inner_join, 
                                                by = join_column)

  return(merged_frame)
}

Continue on to Summary & Further Resources