Case Study Tasks - Python

Author

ONS Data Science Campus

Our analysis pipeline for average population density is near complete now.

So far, we have been working on one single data set. Now however we have been given access to the population density values for a range of years in different CSV files.

You now need to add to and improve the code and files to meet new requirements. The code to change can be found in example_code/case_study/initial/.

This is not the exact methodology for calculating the mean population density in a region - however by completing this case study you will gain experience building a pipeline for analysis.

1 Tasks

The below tasks are intended to reinforce the content covered in this course.

Tasks 1-4 are similar to exercises already covered.

Tasks 5 and 6 will require more thought and working with data frames and your chosen languages. You may need to do some research in your chosen language/package to complete them.

Answers to all of the tasks combined are given below, and the full code is contained within example_code/case_study/answers/.

1.1 Task 1

In order to be able to perform the analysis across different years, we need to access different files:

add new parameters to the function get_analyse_output() in main.py
the first parameter should be the file path to the population density data CSV: pop_density_filepath
the second parameter should be the file path to the locations.csv data: location_filepath

Test that your refactoring of the code produces the same result by running the program with the original data sets.

Test that your refactoring of the code works for other data sets by using the 2018 data.

1.2 Task 2

As we build analysis on top of the code we have already written, the get_analyse_output() will not be the highest level function:

change the get_analyse_output() function to return the calculated final data frame formatted_statistic.
change the function so it takes a parameter output_filepath of the output path of the reslting data frame.
- if this value given is False/FALSE do not write out the data frame to a file

Test this new parameter by running it with a file path and with False/FALSE.

1.3 Task 3

We will be joining the data at the end of our process, this means we will be performing multiple joins in the analysis:

create a new file in the directory called joins.py
move the function join_frames() from manipulation.py to join.py

Ensure your code still works at this stage.

1.4 Task 4

If we are going to combine all our data, we will need to change how it is represented:

write a function that changes a column name of a given column to a different string. Call it column_name_year() and put it in manipulation.py
- the first parameter is the data frame to be changed dataframe
- the second is the column name to be changed original_column
- the third is the new column name new_column

Test this function on a data frame.

1.5 Task 5

Once we have created a new data frame for each year we will need to combine the data for each region into one data frame:

write a new function join_years() in joins.py
- this function will perform a join on all the frames given to it on the same column name
- the function takes as a parameter dataframes, a list containing the data frames to be joined
- the function takes as a parameter the name of the column all frames are to be joined on join_column
- the function performs an inner join on all the data frames together on the given column and returns this complete data frame of multiple years

1.6 Task 6

We need to bring all our new functions together to perform the final analysis:

write a new function combined_analysis() in main.py
- the function should take a list of population density filepaths to data you want to analyse
- the function should take as a parameter the location data filepath
- the function should take as a parameter output_filepath to designate where to output the final analysis
- the function should call get_analyse_output() for each file path, without writing out the data to file; name the frame appropriately
  - column_name_year() should be used on each data frame produced to change the mean_population_density column to the year of the frame; you can access the year of each data set from each filepath, you may want to split the path string up
- within combined_analysis() using join_years() from joins.py join the results of each get_analyse_output() together
- sort the our
- using write_output() from input_output.py write out the data frame to output_filepath if the value is not False/FALSE
The function should return the final data frame

Be sure to run the whole process in one command to check that it produces the output expected.

1.7 Example Answer

1.7.1 Output Example

sdg_region_name	2017	2018	2019	2020
Australia/New Zealand	10.53	10.63	10.72	10.82
Central and Southern Asia	317.54	324.57	330.63	335.32
Eastern and South-Eastern Asia	2065.47	2089.43	2112.67	2135.78
Europe and Northern America	756.12	760.73	764.93	768.5
Latin America and the Caribbean	197.25	198.43	199.62	200.85
Northern Africa and Western Asia	222.42	228.59	234.38	239.48
Oceania (excluding Australia and New Zealand)	142.01	143.12	144.2	145.28
Sub-Saharan Africa	121.3	123.91	126.55	129.21

Below are the files changed during the case study.

import pandas as pd

# Import our required modules
import input_output
import analysis
import manipulation
import joins




def get_analyse_output(pop_density_filepath, location_filepath, output_filepath):
    """
    Access the data, run the analysis of population density means over locations,
    output the data into a csv.
    """

    pop_density = input_output.load_formatted_frame(pop_density_filepath)
    locations = input_output.load_formatted_frame(location_filepath)
    
    
    # Added module names below
    pop_density_single_code = manipulation.access_country_code(pop_density)
    
    
    pop_density_correct_types = manipulation.convert_type_to_int(dataframe=pop_density_single_code,
                                                                 column_name="country_code",
                                                                 string_value="CC")
                                                    
    locations_correct_types = manipulation.convert_type_to_int(dataframe=locations,
                                                               column_name="location_id",
                                                               string_value='"')
    
    population_location = joins.join_frames(left_dataframe=pop_density_correct_types,
                                                   right_dataframe=locations_correct_types,
                                                   left_column="country_code",
                                                   right_column="location_id")
    
    # Added module name here
    aggreagation = analysis.aggregate_mean(dataframe=population_location,
                                           groupby_column="sdg_region_name",
                                           aggregate_column="population_density")
                                  
    formatted_statistic = analysis.format_frame(aggreagation, "mean_population_density")
    
    # output_filepath == True if not False
    if output_filepath:
        input_output.write_output(formatted_statistic, output_filepath=output_filepath)
    
    return formatted_statistic
    
    
    
def combined_analysis(population_filepaths, location_filepath, output_filepath):
    """Function to perform population density mean analysis across multiple files"""
    
    loaded_dataframes = []
    for population_file in population_filepaths:
        # The year is given at the end of the file path, but before '.csv'
        path_end = population_file.split("_")[-1]
        year = path_end[:4]
        
        year_analysis = get_analyse_output(population_file, location_filepath, output_filepath=False)
        
        # Change the column name to the year of the population density
        formatted_year_analysis = manipulation.column_name_year(year_analysis, "mean_population_density", year)

        loaded_dataframes.append(formatted_year_analysis)
        
    combined_dataframes = joins.join_years(loaded_dataframes, join_column="sdg_region_name")
    
    if output_filepath:
        input_output.write_output(combined_dataframes, output_filepath=output_filepath)
    
    return combined_dataframes
    
    
    
    
if __name__ == "__main__":
    pop_path_2017 = "../../../data/population_density_2017.csv"
    pop_path_2018 = "../../../data/population_density_2018.csv"
    pop_path_2019 = "../../../data/population_density_2019.csv"
    pop_path_2020 = "../../../data/population_density_2020.csv"

    location_path = "../../../data/locations.csv"

    # Demonstration of final output for case study
    print(combined_analysis([pop_path_2017, pop_path_2018, pop_path_2019, pop_path_2020], 
                            location_path, 
                            output_filepath=False))

import pandas as pd

def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                "country_and_parent_code",
                                                "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe



def column_name_year(dataframe, original_column, new_column):
    """Function to change name of specified columns in dataframe"""
    
    dataframe_new_name = dataframe.rename(columns={original_column: new_column})
    
    return dataframe_new_name

import pandas as pd


def join_frames(left_dataframe, right_dataframe, left_column, right_column):
    """
    Function to join the required frames on specified columns, dropping
    unrecessary column
    """
    combined_frames = left_dataframe.merge(right=right_dataframe,
                                           how="left",
                                           left_on=left_column,
                                           right_on=right_column)
                                           
    combined_frames_reduced = combined_frames.drop(labels=[right_column], axis=1)
    
    return combined_frames_reduced
    
    
    
def join_years(dataframes, join_column):
    """Function to join a list of frames with an inner join on a specified column name"""
    
    # Starting with first frame iterate over each other frame and join to initial
    temp_frame = dataframes[0]
    
    for frame in dataframes[1:]:
        temp_frame = temp_frame.merge(frame, how="inner", on=join_column)
    
  
    return temp_frame