sdg_region_name | 2017 | 2018 | 2019 | 2020 |
---|---|---|---|---|
Australia/New Zealand | 10.53 | 10.63 | 10.72 | 10.82 |
Central and Southern Asia | 317.54 | 324.57 | 330.63 | 335.32 |
Eastern and South-Eastern Asia | 2065.47 | 2089.43 | 2112.67 | 2135.78 |
Europe and Northern America | 756.12 | 760.73 | 764.93 | 768.5 |
Latin America and the Caribbean | 197.25 | 198.43 | 199.62 | 200.85 |
Northern Africa and Western Asia | 222.42 | 228.59 | 234.38 | 239.48 |
Oceania (excluding Australia and New Zealand) | 142.01 | 143.12 | 144.2 | 145.28 |
Sub-Saharan Africa | 121.3 | 123.91 | 126.55 | 129.21 |
Case Study Tasks - Python
Our analysis pipeline for average population density is near complete now.
So far, we have been working on one single data set. Now however we have been given access to the population density values for a range of years in different CSV files.
You now need to add to and improve the code and files to meet new requirements. The code to change can be found in example_code/case_study/initial/
.
This is not the exact methodology for calculating the mean population density in a region - however by completing this case study you will gain experience building a pipeline for analysis.
1 Tasks
The below tasks are intended to reinforce the content covered in this course.
Tasks 1-4 are similar to exercises already covered.
Tasks 5 and 6 will require more thought and working with data frames and your chosen languages. You may need to do some research in your chosen language/package to complete them.
Answers to all of the tasks combined are given below, and the full code is contained within example_code/case_study/answers/
.
1.1 Task 1
In order to be able to perform the analysis across different years, we need to access different files:
- add new parameters to the function
get_analyse_output()
inmain.py
- the first parameter should be the file path to the population density data CSV:
pop_density_filepath
- the second parameter should be the file path to the
locations.csv
data:location_filepath
Test that your refactoring of the code produces the same result by running the program with the original data sets.
Test that your refactoring of the code works for other data sets by using the 2018
data.
1.2 Task 2
As we build analysis on top of the code we have already written, the get_analyse_output()
will not be the highest level function:
- change the
get_analyse_output()
function to return the calculated final data frameformatted_statistic
. - change the function so it takes a parameter
output_filepath
of the output path of the reslting data frame.- if this value given is
False/FALSE
do not write out the data frame to a file
- if this value given is
Test this new parameter by running it with a file path and with False/FALSE
.
1.3 Task 3
We will be joining the data at the end of our process, this means we will be performing multiple joins in the analysis:
- create a new file in the directory called
joins.py
- move the function
join_frames()
frommanipulation.py
tojoin.py
Ensure your code still works at this stage.
1.4 Task 4
If we are going to combine all our data, we will need to change how it is represented:
- write a function that changes a column name of a given column to a different string. Call it
column_name_year()
and put it inmanipulation.py
- the first parameter is the data frame to be changed
dataframe
- the second is the column name to be changed
original_column
- the third is the new column name
new_column
- the first parameter is the data frame to be changed
Test this function on a data frame.
1.5 Task 5
Once we have created a new data frame for each year we will need to combine the data for each region into one data frame:
- write a new function
join_years()
injoins.py
- this function will perform a join on all the frames given to it on the same column name
- the function takes as a parameter
dataframes
, a list containing the data frames to be joined - the function takes as a parameter the name of the column all frames are to be joined on
join_column
- the function performs an inner join on all the data frames together on the given column and returns this complete data frame of multiple years
1.6 Task 6
We need to bring all our new functions together to perform the final analysis:
- write a new function
combined_analysis()
inmain.py
- the function should take a list of population density filepaths to data you want to analyse
- the function should take as a parameter the location data filepath
- the function should take as a parameter
output_filepath
to designate where to output the final analysis - the function should call
get_analyse_output()
for each file path, without writing out the data to file; name the frame appropriatelycolumn_name_year()
should be used on each data frame produced to change themean_population_density
column to the year of the frame; you can access the year of each data set from each filepath, you may want to split the path string up
- within
combined_analysis()
usingjoin_years()
fromjoins.py
join the results of eachget_analyse_output()
together - sort the our
- using
write_output()
frominput_output.py
write out the data frame tooutput_filepath
if the value is notFalse/FALSE
- The function should return the final data frame
Be sure to run the whole process in one command to check that it produces the output expected.
1.7 Example Answer
1.7.1 Output Example
Below are the files changed during the case study.
import pandas as pd
# Import our required modules
import input_output
import analysis
import manipulation
import joins
def get_analyse_output(pop_density_filepath, location_filepath, output_filepath):
"""
Access the data, run the analysis of population density means over locations,
output the data into a csv.
"""
= input_output.load_formatted_frame(pop_density_filepath)
pop_density = input_output.load_formatted_frame(location_filepath)
locations
# Added module names below
= manipulation.access_country_code(pop_density)
pop_density_single_code
= manipulation.convert_type_to_int(dataframe=pop_density_single_code,
pop_density_correct_types ="country_code",
column_name="CC")
string_value
= manipulation.convert_type_to_int(dataframe=locations,
locations_correct_types ="location_id",
column_name='"')
string_value
= joins.join_frames(left_dataframe=pop_density_correct_types,
population_location =locations_correct_types,
right_dataframe="country_code",
left_column="location_id")
right_column
# Added module name here
= analysis.aggregate_mean(dataframe=population_location,
aggreagation ="sdg_region_name",
groupby_column="population_density")
aggregate_column
= analysis.format_frame(aggreagation, "mean_population_density")
formatted_statistic
# output_filepath == True if not False
if output_filepath:
=output_filepath)
input_output.write_output(formatted_statistic, output_filepath
return formatted_statistic
def combined_analysis(population_filepaths, location_filepath, output_filepath):
"""Function to perform population density mean analysis across multiple files"""
= []
loaded_dataframes for population_file in population_filepaths:
# The year is given at the end of the file path, but before '.csv'
= population_file.split("_")[-1]
path_end = path_end[:4]
year
= get_analyse_output(population_file, location_filepath, output_filepath=False)
year_analysis
# Change the column name to the year of the population density
= manipulation.column_name_year(year_analysis, "mean_population_density", year)
formatted_year_analysis
loaded_dataframes.append(formatted_year_analysis)
= joins.join_years(loaded_dataframes, join_column="sdg_region_name")
combined_dataframes
if output_filepath:
=output_filepath)
input_output.write_output(combined_dataframes, output_filepath
return combined_dataframes
if __name__ == "__main__":
= "../../../data/population_density_2017.csv"
pop_path_2017 = "../../../data/population_density_2018.csv"
pop_path_2018 = "../../../data/population_density_2019.csv"
pop_path_2019 = "../../../data/population_density_2020.csv"
pop_path_2020
= "../../../data/locations.csv"
location_path
# Demonstration of final output for case study
print(combined_analysis([pop_path_2017, pop_path_2018, pop_path_2019, pop_path_2020],
location_path, =False)) output_filepath
import pandas as pd
def access_country_code(dataframe):
"""Function to split combined code columns and remove uncessary columns"""
"country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
dataframe[[str.split("_", expand=True))
.
= dataframe.drop(labels=[
dataframe_dropped "country_and_parent_code",
"parent_code"
=1)
], axisreturn dataframe_dropped
def convert_type_to_int(dataframe, column_name, string_value):
"""Function to convert string to integer column type"""
= dataframe[column_name].str.replace(string_value, "")
dataframe[column_name]
= dataframe[column_name].astype(int)
dataframe[column_name]
return dataframe
def column_name_year(dataframe, original_column, new_column):
"""Function to change name of specified columns in dataframe"""
= dataframe.rename(columns={original_column: new_column})
dataframe_new_name
return dataframe_new_name
import pandas as pd
def join_frames(left_dataframe, right_dataframe, left_column, right_column):
"""
Function to join the required frames on specified columns, dropping
unrecessary column
"""
= left_dataframe.merge(right=right_dataframe,
combined_frames ="left",
how=left_column,
left_on=right_column)
right_on
= combined_frames.drop(labels=[right_column], axis=1)
combined_frames_reduced
return combined_frames_reduced
def join_years(dataframes, join_column):
"""Function to join a list of frames with an inner join on a specified column name"""
# Starting with first frame iterate over each other frame and join to initial
= dataframes[0]
temp_frame
for frame in dataframes[1:]:
= temp_frame.merge(frame, how="inner", on=join_column)
temp_frame
return temp_frame