Modular Programming in Python

Author

ONS Data Science Campus

1 Components of Modular Programming

Functions, modules and packages structure programs, making them:

more readable
easier to fix
simpler to add new features to

diagram of package modules and functions relationships

A module is a file that contains one or more units of code - in our case: functions. A collection of module files together forms a package.

You have already been using functions, modules and packages that were written by other people. For example pandas.

You can use automated tests to check that each component of your code performs as expected.

Testing sections of your code independently using code - “Unit Testing” - is a concept covered in further courses. To do this your code needs to use functions, modules and packages.

Testing multiple sections together and their interactions with each other is called - “Integration testing”.

To show how to structure code in an analysis context we will use an example scenario to go through the steps taken.

2 Introducing the Project

You have been assigned to a group within your department responsible for analysing populations across the world. This work is in collaboration with the United Nations.

Your job is to provide analysis of population densities across the different United Nations Sustainable Development Goal (SDG) regions. You must provide average population density values for each SDG region.

One of your colleagues has already conducted this analysis on an ad hoc basis. They have given you their code to start with, but they have only analysed one year of data so far. You have been asked to write code that will be able to analyse multiple years of data, all in different files.

Before tackling the big task of analysing all the data, you are going to restructure your colleagues code. To make the process more reproducible in the future you will restructure their code into functions and modules.

This process is called “refactoring”.

Refactoring is a process of improving your code, while keeping it able to perform the same task. This helps clean the code and improve its design.

You have been sent two data sets needed to reproduce the analysis your colleague performed. Have a look through the data, what steps do you think need to be considered to make the data analysable?

2.1 Population Density

The population_density_2019.csv data contains each country’s population density, name and a unique country and parent group code column. There is also a year column.

The data is only from 2019.

Country	Country and parent code	Population Density	Year
Burundi	CC108_PC108	449.01	2019
Comoros	CC174_PC174	457.222	2019
Djibouti	CC262_PC262	41.9999	2019
Eritrea	CC232_PC232	34.6249	2019
Ethiopia	CC231_PC231	112.079	2019
Kenya	CC404_PC404	92.3744	2019
Madagascar	CC450_PC450	46.3553	2019
Malawi	CC454_PC454	197.59	2019
Mauritius	CC480_PC480	625.453	2019
Mayotte	CC175_PC175	709.741	2019
Mozambique	CC508_PC508	38.615	2019
Réunion	CC638_PC638	355.573	2019
Rwanda	CC646_PC646	511.834	2019
Seychelles	CC690_PC690	212.48	2019
Somalia	CC706_PC706	24.6165	2019
South Sudan	CC728_PC728	18.1064	2019
Uganda	CC800_PC800	221.558	2019
United Republic of Tanzania	CC834_PC834	65.4837	2019
Zambia	CC894_PC894	24.0265	2019
Zimbabwe	CC716_PC716	37.8583	2019
Angola	CC24_PC24	25.5276	2019
Cameroon	CC120_PC120	54.7405	2019
Central African Republic	CC140_PC140	7.6169	2019
Chad	CC148_PC148	12.6643	2019
Congo	CC178_PC178	15.7555	2019
Democratic Republic of the Congo	CC180_PC180	38.2835	2019
Equatorial Guinea	CC226_PC226	48.3416	2019
Gabon	CC266_PC266	8.43163	2019
Sao Tome and Principe	CC678_PC678	224.008	2019
Botswana	CC72_PC72	4.0649	2019
Eswatini	CC748_PC748	66.7519	2019
Lesotho	CC426_PC426	70.0022	2019
Namibia	CC516_PC516	3.02995	2019
South Africa	CC710_PC710	48.272	2019
Benin	CC204_PC204	104.657	2019
Burkina Faso	CC854_PC854	74.2741	2019
Cabo Verde	CC132_PC132	136.461	2019
Côte d’Ivoire	CC384_PC384	80.8697	2019
Gambia	CC270_PC270	231.986	2019
Ghana	CC288_PC288	133.681	2019
Guinea	CC324_PC324	51.9748	2019
Guinea-Bissau	CC624_PC624	68.3114	2019
Liberia	CC430_PC430	51.2601	2019
Mali	CC466_PC466	16.1106	2019
Mauritania	CC478_PC478	4.3909	2019
Niger	CC562_PC562	18.4027	2019
Nigeria	CC566_PC566	220.652	2019
Saint Helena	CC654_PC654	15.541	2019
Senegal	CC686_PC686	84.6432	2019
Sierra Leone	CC694_PC694	108.246	2019
Togo	CC768_PC768	148.6	2019
Algeria	CC12_PC12	18.0763	2019
Egypt	CC818_PC818	100.847	2019
Libya	CC434_PC434	3.85183	2019
Morocco	CC504_PC504	81.7203	2019
Sudan	CC729_PC729	24.2561	2019
Tunisia	CC788_PC788	75.275	2019
Western Sahara	CC732_PC732	2.18969	2019
Armenia	CC51_PC51	103.889	2019
Azerbaijan	CC31_PC31	121.558	2019
Bahrain	CC48_PC48	2159.43	2019
Cyprus	CC196_PC196	129.716	2019
Georgia	CC268_PC268	57.5156	2019
Iraq	CC368_PC368	90.5088	2019
Israel	CC376_PC376	393.686	2019
Jordan	CC400_PC400	113.783	2019
Kuwait	CC414_PC414	236.087	2019
Lebanon	CC422_PC422	670.157	2019
Oman	CC512_PC512	16.0743	2019
Qatar	CC634_PC634	243.934	2019
Saudi Arabia	CC682_PC682	15.9411	2019
State of Palestine	CC275_PC275	827.479	2019
Syrian Arab Republic	CC760_PC760	92.9594	2019
Turkey	CC792_PC792	108.402	2019
United Arab Emirates	CC784_PC784	116.872	2019
Yemen	CC887_PC887	55.2341	2019
Kazakhstan	CC398_PC398	6.87166	2019
Kyrgyzstan	CC417_PC417	33.4507	2019
Tajikistan	CC762_PC762	66.5978	2019
Turkmenistan	CC795_PC795	12.6446	2019
Uzbekistan	CC860_PC860	77.5311	2019
Afghanistan	CC4_PC4	58.2694	2019
Bangladesh	CC50_PC50	1252.56	2019
Bhutan	CC64_PC64	20.0198	2019
India	CC356_PC356	459.58	2019
Iran (Islamic Republic of)	CC364_PC364	50.9127	2019
Maldives	CC462_PC462	1769.86	2019
Nepal	CC524_PC524	199.572	2019
Pakistan	CC586_PC586	280.933	2019
Sri Lanka	CC144_PC144	340.037	2019
China	CC156_PC156	152.722	2019
China, Hong Kong SAR	CC344_PC344	7082.05	2019
China, Macao SAR	CC446_PC446	21419.6	2019
China, Taiwan Province of China	CC158_PC158	671.389	2019
Dem. People’s Republic of Korea	CC408_PC408	213.156	2019
Japan	CC392_PC392	347.987	2019
Mongolia	CC496_PC496	2.07598	2019
Republic of Korea	CC410_PC410	526.847	2019
Brunei Darussalam	CC96_PC96	82.2194	2019
Cambodia	CC116_PC116	93.3976	2019
Indonesia	CC360_PC360	149.387	2019
Lao People’s Democratic Republic	CC418_PC418	31.0635	2019
Malaysia	CC458_PC458	97.2448	2019
Myanmar	CC104_PC104	82.7281	2019
Philippines	CC608_PC608	362.601	2019
Singapore	CC702_PC702	8291.92	2019
Thailand	CC764_PC764	136.283	2019
Timor-Leste	CC626_PC626	86.9617	2019
Viet Nam	CC704_PC704	311.098	2019
Anguilla	CC660_PC660	165.244	2019
Antigua and Barbuda	CC28_PC28	220.716	2019
Aruba	CC533_PC533	590.611	2019
Bahamas	CC44_PC44	38.9097	2019
Barbados	CC52_PC52	667.491	2019
Bonaire, Sint Eustatius and Saba	CC535_PC535	79.2165	2019
British Virgin Islands	CC92_PC92	200.22	2019
Cayman Islands	CC136_PC136	270.617	2019
Cuba	CC192_PC192	106.478	2019
Curaçao	CC531_PC531	368.07	2019
Dominica	CC212_PC212	95.744	2019
Dominican Republic	CC214_PC214	222.247	2019
Grenada	CC308_PC308	329.418	2019
Guadeloupe	CC312_PC312	245.73	2019
Haiti	CC332_PC332	408.675	2019
Jamaica	CC388_PC388	272.232	2019
Martinique	CC474_PC474	354.299	2019
Montserrat	CC500_PC500	49.91	2019
Puerto Rico	CC630_PC630	330.711	2019
Saint Barthélemy	CC652_PC652	447.864	2019
Saint Kitts and Nevis	CC659_PC659	203.208	2019
Saint Lucia	CC662_PC662	299.664	2019
Saint Martin (French part)	CC663_PC663	717.019	2019
Saint Vincent and the Grenadines	CC670_PC670	283.572	2019
Sint Maarten (Dutch part)	CC534_PC534	1246.74	2019
Trinidad and Tobago	CC780_PC780	271.924	2019
Turks and Caicos Islands	CC796_PC796	40.2042	2019
United States Virgin Islands	CC850_PC850	298.797	2019
Belize	CC84_PC84	17.1132	2019
Costa Rica	CC188_PC188	98.8555	2019
El Salvador	CC222_PC222	311.465	2019
Guatemala	CC320_PC320	164.068	2019
Honduras	CC340_PC340	87.1044	2019
Mexico	CC484_PC484	65.627	2019
Nicaragua	CC558_PC558	54.3917	2019
Panama	CC591_PC591	57.1219	2019
Argentina	CC32_PC32	16.3631	2019
Bolivia (Plurinational State of)	CC68_PC68	10.6278	2019
Brazil	CC76_PC76	25.2508	2019
Chile	CC152_PC152	25.4892	2019
Colombia	CC170_PC170	45.3713	2019
Ecuador	CC218_PC218	69.9535	2019
Falkland Islands (Malvinas)	CC238_PC238	0.277075	2019
French Guiana	CC254_PC254	3.53799	2019
Guyana	CC328_PC328	3.9765	2019
Paraguay	CC600_PC600	17.7313	2019
Peru	CC604_PC604	25.3988	2019
Suriname	CC740_PC740	3.72669	2019
Uruguay	CC858_PC858	19.7791	2019
Venezuela (Bolivarian Republic of)	CC862_PC862	32.329	2019
Australia	CC36_PC36	3.28068	2019
New Zealand	CC554_PC554	18.1651	2019
Fiji	CC242_PC242	48.7113	2019
New Caledonia	CC540_PC540	15.4681	2019
Papua New Guinea	CC598_PC598	19.3793	2019
Solomon Islands	CC90_PC90	23.9307	2019
Vanuatu	CC548_PC548	24.6007	2019
Guam	CC316_PC316	309.806	2019
Kiribati	CC296_PC296	145.195	2019
Marshall Islands	CC584_PC584	326.617	2019
Micronesia (Fed. States of)	CC583_PC583	162.587	2019
Nauru	CC520_PC520	538.2	2019
Northern Mariana Islands	CC580_PC580	124.376	2019
Palau	CC585_PC585	39.1326	2019
American Samoa	CC16_PC16	276.56	2019
Cook Islands	CC184_PC184	73.1125	2019
French Polynesia	CC258_PC258	76.3074	2019
Niue	CC570_PC570	6.20769	2019
Samoa	CC882_PC882	69.6442	2019
Tokelau	CC772_PC772	133	2019
Tonga	CC776_PC776	145.135	2019
Tuvalu	CC798_PC798	388.5	2019
Wallis and Futuna Islands	CC876_PC876	81.6857	2019
Belarus	CC112_PC112	46.5842	2019
Bulgaria	CC100_PC100	64.4815	2019
Czechia	CC203_PC203	138.39	2019
Hungary	CC348_PC348	106.978	2019
Poland	CC616_PC616	123.723	2019
Republic of Moldova	CC498_PC498	123.082	2019
Romania	CC642_PC642	84.1315	2019
Russian Federation	CC643_PC643	8.90721	2019
Slovakia	CC703_PC703	113.48	2019
Ukraine	CC804_PC804	75.9401	2019
Channel Islands	CC830_PC830	906.653	2019
Denmark	CC208_PC208	136.033	2019
Estonia	CC233_PC233	31.2727	2019
Faroe Islands	CC234_PC234	34.8689	2019
Finland	CC246_PC246	18.2045	2019
Iceland	CC352_PC352	3.38192	2019
Ireland	CC372_PC372	70.8738	2019
Isle of Man	CC833_PC833	148.402	2019
Latvia	CC428_PC428	30.655	2019
Lithuania	CC440_PC440	44.0315	2019
Norway	CC578_PC578	14.7258	2019
Sweden	CC752_PC752	24.4587	2019
United Kingdom	CC826_PC826	279.131	2019
Albania	CC8_PC8	105.143	2019
Andorra	CC20_PC20	164.14	2019
Bosnia and Herzegovina	CC70_PC70	64.7255	2019
Croatia	CC191_PC191	73.8081	2019
Gibraltar	CC292_PC292	3370.6	2019
Greece	CC300_PC300	81.2525	2019
Holy See	CC336_PC336	1852.27	2019
Italy	CC380_PC380	205.855	2019
Malta	CC470_PC470	1376.18	2019
Montenegro	CC499_PC499	46.6906	2019
North Macedonia	CC807_PC807	82.6113	2019
Portugal	CC620_PC620	111.652	2019
San Marino	CC674_PC674	564.4	2019
Serbia	CC688_PC688	100.3	2019
Slovenia	CC705_PC705	103.21	2019
Spain	CC724_PC724	93.6984	2019
Austria	CC40_PC40	108.667	2019
Belgium	CC56_PC56	381.087	2019
France	CC250_PC250	118.946	2019
Germany	CC276_PC276	239.606	2019
Liechtenstein	CC438_PC438	237.625	2019
Luxembourg	CC442_PC442	237.734	2019
Monaco	CC492_PC492	26152.3	2019
Netherlands	CC528_PC528	507.032	2019
Switzerland	CC756_PC756	217.415	2019
Bermuda	CC60_PC60	1250.16	2019
Canada	CC124_PC124	4.11404	2019
Greenland	CC304_PC304	0.138044	2019
Saint Pierre and Miquelon	CC666_PC666	25.3087	2019
United States of America	CC840_PC840	35.9735	2019

2.2 Location IDs

The locations.csv data contains each countries location ID (the same as a country code), and which Sustainable Development Goal Region the location ID is part of.

The data is valid for all years.

Location ID	SDG Region Name
“108”	Sub-Saharan Africa
“174”	Sub-Saharan Africa
“262”	Sub-Saharan Africa
“232”	Sub-Saharan Africa
“231”	Sub-Saharan Africa
“404”	Sub-Saharan Africa
“450”	Sub-Saharan Africa
“454”	Sub-Saharan Africa
“480”	Sub-Saharan Africa
“175”	Sub-Saharan Africa
“508”	Sub-Saharan Africa
“638”	Sub-Saharan Africa
“646”	Sub-Saharan Africa
“690”	Sub-Saharan Africa
“706”	Sub-Saharan Africa
“728”	Sub-Saharan Africa
“800”	Sub-Saharan Africa
“834”	Sub-Saharan Africa
“894”	Sub-Saharan Africa
“716”	Sub-Saharan Africa
“24”	Sub-Saharan Africa
“120”	Sub-Saharan Africa
“140”	Sub-Saharan Africa
“148”	Sub-Saharan Africa
“178”	Sub-Saharan Africa
“180”	Sub-Saharan Africa
“226”	Sub-Saharan Africa
“266”	Sub-Saharan Africa
“678”	Sub-Saharan Africa
“72”	Sub-Saharan Africa
“748”	Sub-Saharan Africa
“426”	Sub-Saharan Africa
“516”	Sub-Saharan Africa
“710”	Sub-Saharan Africa
“204”	Sub-Saharan Africa
“854”	Sub-Saharan Africa
“132”	Sub-Saharan Africa
“384”	Sub-Saharan Africa
“270”	Sub-Saharan Africa
“288”	Sub-Saharan Africa
“324”	Sub-Saharan Africa
“624”	Sub-Saharan Africa
“430”	Sub-Saharan Africa
“466”	Sub-Saharan Africa
“478”	Sub-Saharan Africa
“562”	Sub-Saharan Africa
“566”	Sub-Saharan Africa
“654”	Sub-Saharan Africa
“686”	Sub-Saharan Africa
“694”	Sub-Saharan Africa
“768”	Sub-Saharan Africa
“12”	Northern Africa and Western Asia
“818”	Northern Africa and Western Asia
“434”	Northern Africa and Western Asia
“504”	Northern Africa and Western Asia
“729”	Northern Africa and Western Asia
“788”	Northern Africa and Western Asia
“732”	Northern Africa and Western Asia
“51”	Northern Africa and Western Asia
“31”	Northern Africa and Western Asia
“48”	Northern Africa and Western Asia
“196”	Northern Africa and Western Asia
“268”	Northern Africa and Western Asia
“368”	Northern Africa and Western Asia
“376”	Northern Africa and Western Asia
“400”	Northern Africa and Western Asia
“414”	Northern Africa and Western Asia
“422”	Northern Africa and Western Asia
“512”	Northern Africa and Western Asia
“634”	Northern Africa and Western Asia
“682”	Northern Africa and Western Asia
“275”	Northern Africa and Western Asia
“760”	Northern Africa and Western Asia
“792”	Northern Africa and Western Asia
“784”	Northern Africa and Western Asia
“887”	Northern Africa and Western Asia
“398”	Central and Southern Asia
“417”	Central and Southern Asia
“762”	Central and Southern Asia
“795”	Central and Southern Asia
“860”	Central and Southern Asia
“4”	Central and Southern Asia
“50”	Central and Southern Asia
“64”	Central and Southern Asia
“356”	Central and Southern Asia
“364”	Central and Southern Asia
“462”	Central and Southern Asia
“524”	Central and Southern Asia
“586”	Central and Southern Asia
“144”	Central and Southern Asia
“156”	Eastern and South-Eastern Asia
“344”	Eastern and South-Eastern Asia
“446”	Eastern and South-Eastern Asia
“158”	Eastern and South-Eastern Asia
“408”	Eastern and South-Eastern Asia
“392”	Eastern and South-Eastern Asia
“496”	Eastern and South-Eastern Asia
“410”	Eastern and South-Eastern Asia
“96”	Eastern and South-Eastern Asia
“116”	Eastern and South-Eastern Asia
“360”	Eastern and South-Eastern Asia
“418”	Eastern and South-Eastern Asia
“458”	Eastern and South-Eastern Asia
“104”	Eastern and South-Eastern Asia
“608”	Eastern and South-Eastern Asia
“702”	Eastern and South-Eastern Asia
“764”	Eastern and South-Eastern Asia
“626”	Eastern and South-Eastern Asia
“704”	Eastern and South-Eastern Asia
“660”	Latin America and the Caribbean
“28”	Latin America and the Caribbean
“533”	Latin America and the Caribbean
“44”	Latin America and the Caribbean
“52”	Latin America and the Caribbean
“535”	Latin America and the Caribbean
“92”	Latin America and the Caribbean
“136”	Latin America and the Caribbean
“192”	Latin America and the Caribbean
“531”	Latin America and the Caribbean
“212”	Latin America and the Caribbean
“214”	Latin America and the Caribbean
“308”	Latin America and the Caribbean
“312”	Latin America and the Caribbean
“332”	Latin America and the Caribbean
“388”	Latin America and the Caribbean
“474”	Latin America and the Caribbean
“500”	Latin America and the Caribbean
“630”	Latin America and the Caribbean
“652”	Latin America and the Caribbean
“659”	Latin America and the Caribbean
“662”	Latin America and the Caribbean
“663”	Latin America and the Caribbean
“670”	Latin America and the Caribbean
“534”	Latin America and the Caribbean
“780”	Latin America and the Caribbean
“796”	Latin America and the Caribbean
“850”	Latin America and the Caribbean
“84”	Latin America and the Caribbean
“188”	Latin America and the Caribbean
“222”	Latin America and the Caribbean
“320”	Latin America and the Caribbean
“340”	Latin America and the Caribbean
“484”	Latin America and the Caribbean
“558”	Latin America and the Caribbean
“591”	Latin America and the Caribbean
“32”	Latin America and the Caribbean
“68”	Latin America and the Caribbean
“76”	Latin America and the Caribbean
“152”	Latin America and the Caribbean
“170”	Latin America and the Caribbean
“218”	Latin America and the Caribbean
“238”	Latin America and the Caribbean
“254”	Latin America and the Caribbean
“328”	Latin America and the Caribbean
“600”	Latin America and the Caribbean
“604”	Latin America and the Caribbean
“740”	Latin America and the Caribbean
“858”	Latin America and the Caribbean
“862”	Latin America and the Caribbean
“36”	Australia/New Zealand
“554”	Australia/New Zealand
“242”	Oceania (excluding Australia and New Zealand)
“540”	Oceania (excluding Australia and New Zealand)
“598”	Oceania (excluding Australia and New Zealand)
“90”	Oceania (excluding Australia and New Zealand)
“548”	Oceania (excluding Australia and New Zealand)
“316”	Oceania (excluding Australia and New Zealand)
“296”	Oceania (excluding Australia and New Zealand)
“584”	Oceania (excluding Australia and New Zealand)
“583”	Oceania (excluding Australia and New Zealand)
“520”	Oceania (excluding Australia and New Zealand)
“580”	Oceania (excluding Australia and New Zealand)
“585”	Oceania (excluding Australia and New Zealand)
“16”	Oceania (excluding Australia and New Zealand)
“184”	Oceania (excluding Australia and New Zealand)
“258”	Oceania (excluding Australia and New Zealand)
“570”	Oceania (excluding Australia and New Zealand)
“882”	Oceania (excluding Australia and New Zealand)
“772”	Oceania (excluding Australia and New Zealand)
“776”	Oceania (excluding Australia and New Zealand)
“798”	Oceania (excluding Australia and New Zealand)
“876”	Oceania (excluding Australia and New Zealand)
“112”	Europe and Northern America
“100”	Europe and Northern America
“203”	Europe and Northern America
“348”	Europe and Northern America
“616”	Europe and Northern America
“498”	Europe and Northern America
“642”	Europe and Northern America
“643”	Europe and Northern America
“703”	Europe and Northern America
“804”	Europe and Northern America
“830”	Europe and Northern America
“208”	Europe and Northern America
“233”	Europe and Northern America
“234”	Europe and Northern America
“246”	Europe and Northern America
“352”	Europe and Northern America
“372”	Europe and Northern America
“833”	Europe and Northern America
“428”	Europe and Northern America
“440”	Europe and Northern America
“578”	Europe and Northern America
“752”	Europe and Northern America
“826”	Europe and Northern America
“8”	Europe and Northern America
“20”	Europe and Northern America
“70”	Europe and Northern America
“191”	Europe and Northern America
“292”	Europe and Northern America
“300”	Europe and Northern America
“336”	Europe and Northern America
“380”	Europe and Northern America
“470”	Europe and Northern America
“499”	Europe and Northern America
“807”	Europe and Northern America
“620”	Europe and Northern America
“674”	Europe and Northern America
“688”	Europe and Northern America
“705”	Europe and Northern America
“724”	Europe and Northern America
“40”	Europe and Northern America
“56”	Europe and Northern America
“250”	Europe and Northern America
“276”	Europe and Northern America
“438”	Europe and Northern America
“442”	Europe and Northern America
“492”	Europe and Northern America
“528”	Europe and Northern America
“756”	Europe and Northern America
“60”	Europe and Northern America
“124”	Europe and Northern America
“304”	Europe and Northern America
“666”	Europe and Northern America
“840”	Europe and Northern America

2.3 Steps

To analyse the data we will need to have one single data frame. We must join locations and population densities on a column.
At the moment there is no exact matching column to join on, therefore we will need to manipulate columns.
Both data frames contain a “country code” value somewhere. For the Population Density dataframe, the country code will need to be separated from the parent code. There are also prefixes “CC” and “PC” that we will need to consider. For the Location IDs dataframe, the quotation marks will need to be removed.
Once the population densities have their respective SDG region in the same table the data can be aggregated. The data will be grouped by SDG region, then the mean will be calculated on the population density value.

We have stated above that the data is valid for all years, meaning that we expect the structure to be consistent. Once the 2019 data is clean, what things should we consider about applying our program to other years?

3 Building Programs

Before getting started on the task of analysing the population density data, it is important that we are aware of different styles of programming.

3.1 Basic Programs

Scripts and notebooks can be really useful tools for quick analysis, however, they limit how we can scale and improve our project.

Our scripts become one line after another of data being slightly changed at each step.

This does not group the code in a structure helpful for us to understand.

This style of programming is sometimes referred to as “imperative”.

Programmers frequently copy and paste code to reuse it in different parts of a program, with small changes.

If the requirements of our project change, we need to hunt through the code to change all the relevant variables and values. If code sections have been copied and pasted, fixing an error in one place won’t fix the copies.

If the project expands we need to write more and more code. This is often done in the same file, making the code harder to work through and understand.

3.2 Grouping Code

illustration of clothing used to group

To structure our code better we need to be able to group a collection of code together into one object. This can be done in two ways:

converting to functions
converting to classes

Classes are beyond the scope of this course and are less prevalent in R, so we will focus on functions here. However, many of these principles can also be applied to classes.

Properties of functions:

functions complete a task
functions can take inputs and give outputs

Functions can be run in one line of code, running complicated operations that have been written elsewhere. This helps “hide” some of the detail, making it clearer what is happening in the code - a process known as extraction.

Well-named functions mean we do not need to understand the details inside the function - just what they achieve.

illustration of grouping code into objects

Within this course there is a programming styles document, explaining some of the different styles of programming. This is suggested further reading at this point in the course.

There are some important principles to keep in mind when we design functions:

functions should not have “side-effects”. Data outside the function should not be impacted by using the function
functions should serve a single purpose

4 Scripts to Functions

In this section we will discuss considerations when converting scripts to functions. In this section we will use an example script to show the steps involved structuring code.

4.1 Example Analysis Code

The code that has been given to you by your colleague is given in this section. At present it is a script that is well commented, but not well structured. Your task is to structure the code allowing for future reproducible analysis.

At a high level, the code:

loads in the two data sets
cleans the data
joins the data so all useful information is together
calculates an aggregate statistic
tidies the output data
writes the data to a CSV file

Have a read through the script you have received, be sure to look up any sections you are not comfortable with.

If you would prefer to look at it within an IDE it is located in

example_code_python/initial_script/

For all scripts and files throughout this course it is assumed that the working directory being used is the location of the file being run. This may need to be changed in your given IDE.

# File to analyse the mean population density data from the UN

# Import relevant libraries for analysis
import pandas as pd
import os

# Load the population density data 2019
population_path = os.path.join("../../data/population_density_2019.csv")
pop_density = pd.read_csv(population_path)

# Clean the column names, following naming conventions similar to PEP8
pop_density.columns = pop_density.columns.str.lower()
pop_density.columns = pop_density.columns.str.replace(" ", "_")

# The country_and_parent_code column needs to 
# be split into two columns without the strings
pop_density[["country_code", "parent_code"]] = (pop_density["country_and_parent_code"]
                                                .str.split("_", expand=True))

# Remove the country_and_parent_code and parent_code columns, not used in later analysis
# axis=1 to remove the columns
pop_density = pop_density.drop(labels=[
                                       "country_and_parent_code", 
                                       "parent_code"
                                       ], 
                               axis=1)

# Convert country_code to integer by removing strings
pop_density["country_code"] = pop_density["country_code"].str.replace("CC", "")
pop_density["country_code"] = pop_density["country_code"].astype(int)


# Load the locations data to get the Sustainable Development Goals sub regions
locations_path = os.path.join("../../data/locations.csv")
locations = pd.read_csv(locations_path)

# Clean the column names, following naming conventions similar to PEP8
locations.columns = locations.columns.str.lower()
locations.columns =  locations.columns.str.replace(" ", "_")

# The location_id data has quotation marks making it a string,
# it needs to be converted to a numeric
locations["location_id"] = locations["location_id"].str.replace('"', '')
locations["location_id"] = locations["location_id"].astype(int)

# Join the data sets
# Left merge so we keep all pop_density data
pop_density_location = pop_density.merge(locations, 
                                         how="left",
                                         left_on="country_code",
                                         right_on="location_id")

# Remove the location_id column as it is equal to country_code or missing
pop_density_location = pop_density_location.drop(labels=["location_id"], axis=1)

# Get just the relevant columns in preparation
# for the following groupby
region_density = pop_density_location[["sdg_region_name", "population_density"]]

# Calculate the mean population density for each region
# A non-weighted mean
region_mean_density = (region_density.groupby('sdg_region_name', as_index=False)
                       .agg({"population_density": "mean"}))
                       
region_mean_density = region_mean_density.rename(columns={"population_density": "mean_population_density"})

# Sort the data for clearer reading
region_mean_density = region_mean_density.sort_values(by="mean_population_density",
                                                      ascending=False)

# Round mean density for clearer reading
region_mean_density["mean_population_density"] = region_mean_density["mean_population_density"].round(2)

# Write out the final output
region_mean_density.to_csv("mean_population_density_output.csv", index=False)

Output data:

sdg_region_name	mean_population_density
Eastern and South-Eastern Asia	2112.67
Europe and Northern America	764.93
Central and Southern Asia	330.63
Northern Africa and Western Asia	234.38
Latin America and the Caribbean	199.62
Oceania (excluding Australia and New Zealand)	144.2
Sub-Saharan Africa	126.55
Australia/New Zealand	10.72

4.2 Grouping Code by Functionality

Chunks of code that do similar things should be grouped together.

Deciding which sections of code make sense as being part of the same function is a common challenge when structuring code.

When converting code into a function - the main thing we look for is that it achieves one task. It may take us a few lines of code to achieve this “one task” - but the point is the function has a specific purpose.

If a function has more than one task or “responsibility” it will become hard to maintain, as it has many reasons to be modified.

If a function has a single “responsibility”, it will be focussed and much more likely to be reusable elsewhere.

When writing scripts, we often repeat the same tasks at different points in the script. These are good parts of code to start converting into functions. Doing so reduces the amount of code written in the file - and makes what is happening at any step clearer.

You may also wish to consider writing helper functions for any common housekeeping tasks that you tend to commonly require.

If a code block isn’t repeated throughout the code that’s okay too - all the code can be converted to functions to be called one after the other.

It is much easier to read a sequence of well-named functions, rather than a long stream of commands.

illustration grouping code based on purpose

Some code is often very similar, with a variable or two difference in areas of the code. When reading the code, it’s important to think about what is happening to the variables and data involved. Consider whether a similar process is happening elsewhere, rather than whether the same data is involved. These repeating processes present opportunities to reduce the overall length of your script by writing your own custom functions.

Returning to our example script, we are going to take one task, convert it into a function, then improve the function so it can be used multiple times.

The lines of code:

load in a data frame given a path
reformat the column names of the data frame

4.2.1 Initial Code

# Import relevant libraries for analysis
import pandas as pd
import os

# Load the population density data 2019
population_path = os.path.join("../..data/population_density_2019.csv")
pop_density = pd.read_csv(population_path)

# Clean the column names, following naming conventions similar to PEP8
pop_density.columns = pop_density.columns.str.lower()
pop_density.columns = pop_density.columns.str.replace(" ", "_")

4.2.2 Basic Function

We can wrap the code into a function so that all the code can be run with one command like so.

def load_formatted_pop_frame()
    """Read population data and reformat column names"""
    # Load the population density data 2019
    population_path = os.path.join("../..data/population_density_2019.csv")
    pop_density = pd.read_csv(population_path)

    # Clean the column names, following naming conventions similar to PEP8
    pop_density.columns = pop_density.columns.str.lower()
    pop_density.columns = pop_density.columns.str.replace(" ", "_")

    return pop_density

# Call the function to assign the data frame

population_density = load_formatted_pop_frame()

4.2.3 Adding Parameters

To improve the function, we can add as an argument something that may change in the future - the path of the data string.

Consider how you would have to change the previous function if the location of the population_density_2019.csv file changed.

Variable names in functions should reflect what that variable is. If you don’t know exactly the value the variable will take, then a generic name like dataframe is appropriate. Though consider the framework that you are working in - avoid reserved words or well-established, commonly used function names.

When we add an argument to a function replacing a value within, we need to be sure to change all times that original variable was used.

Our comments should reflect the changes made too.

Note that comments should add information - the comments in this tutorial are reminders of why we are doing this, and not the style of comment you would be expected to write. Often, if functions and variables are well-named, the code does not require many comments.

The new function can now be used for both the population_density data and locations.csv.

def load_formatted_frame(path_to_data):
    """Read csv and reformat column names"""
    # Load the csv from given path
    formatted_path = os.path.join(path_to_data)
    dataframe = pd.read_csv(formatted_path)
    
    # Clean the column names, following naming conventions similar to PEP8
    dataframe.columns = dataframe.columns.str.lower()
    dataframe.columns = dataframe.columns.str.replace(" ", "_")
    
    return dataframe

# The path can be updated where the function is run if needed
population_density = load_formatted_frame("/data/population_density_2019.csv")

# The same function is used to load a formatted locations.csv
locations = load_formatted_frame("/data/locations.csv")

4.3 Scope

Scope is an important concept when creating functions and structuring code.

Scope refers to the places in a program that a variable can be accessed.

When writing scripts, variables can be accessed anywhere in the script - so long as the variable assignment has been run.

When we write scripts, we are storing all our variables at the highest, most accessible area of the program. This is referred to as “Global Scope”.

Variables with global scope are accessible in all locations of the program.

This is the easiest way to store variables when learning to program.

However, using global variables throughout our analysis often creates unexpected results in our code. If a new piece of code accidentally alters a global variable, it will affect all the code run after it, even if the function wasn’t meant to update the variable… errors like this can be very tricky to track down and fix.

Some variables can only be accessed in certain locations within a program. When this happens, it is referred to as “Local Scope”.

Variables have local scope if they are accessible within a part of a program such as a function. They cannot be accessed outside the function they are assigned in.

diagram scope across different functions and levels

At the highest level of scope are the parts of the programming language that can be accessed anywhere - the built in functions (e.g. print()).

diagram showing built in functions, global and local scope relationships

To make our functions follow functional programming principles we need to keep variable scope in mind.

When designing functions:

all variables within should either:
- be passed as arguments to the function
- be created within the function
variables with Global Scope should only be given as arguments to the function
- although all function can access global variables, doing so makes our code harder to understand
if we need to access data with local scope (within a function) it needs to be returned by the function

If we are clear about what variables we are accessing, we can be sure about what their values are. Using only variables passed as arguments clarifies what data a function is operating on, and makes it much easier to reuse elsewhere (as it just needs its arguments defined, no hidden dependencies on global variables).

Think of your functions as having an entrance and an exit.

the entrance is the arguments and variables it takes as inputs
the exit is the value it returns

When choosing which parameters to give a function there are a few things to consider:

scope
clarity - avoid bundling parameters together into an object, make sure each parameter is clearly named
purpose - only include parameters needed for the task

diagram of function model of inputs and outputs

Not all functions need to return a value, such as a function that writes out a file. In this case do not use a return statement, making it clear nothing will be returned. By default if there is no return statement in a function Python will return None

4.3.1 Function Inputs

Below are examples of code which have similar purposes, one uses parameter variables well, the other does not.

This is bad because we are altering data that has not been passed as arguments to the function.

letters = ["a", "b", "c", "d", "e"]

def add_letter():
    long_letters  = letters + ["f"]
    return long_letters

# Run on original data
print(add_letter())

# the value of letters could be changed elsewhere in the program
letters = ["1", "2", "3", "4", "5"]

# Without changing our function call at all we get a different result
# with the same function call
print(add_letter())

['a', 'b', 'c', 'd', 'e', 'f']
['1', '2', '3', '4', '5', 'f']

This is better because the function is more clearly dependent on the input values.

letters = ["a", "b", "c", "d", "e"]

def add_letter(charcter_list):
    long_letters = charcter_list + ["f"]
    return long_letters

# Run on the original data
print("Initial")
print(add_letter(letters))

# If the data changes later so does our result
letters = ["1", "2", "3", "4", "5"]

# This time we can see why the result is different
# We need to check the value of `letters`
print("Changed")
print(add_letter(letters))

Initial
['a', 'b', 'c', 'd', 'e', 'f']
Changed
['1', '2', '3', '4', '5', 'f']

4.3.2 Data Frame Considerations

As analysts and data scientists, we will often use data frames in our programs.

There are some special considerations that need to be taken when working with these objects, with regards to functional programming principles.

We need to avoid unintentionally altering an existing object when we give them to functions.

This is primarily a challenge in Python / pandas due to Python being an object oriented language.

If we give a data frame to a function, some pandas code can edit the original data frame “in-place” from outside of the scope of the function.

This means any other variables in the code referring to that data frame will be changed unintentionally.

Below is an example of the unintentional effect this can have.

initial_frame = pd.DataFrame(columns=["first", "second"])

print("initial_frame before function called:", initial_frame)


def add_values(dataframe):
    
    dataframe[["first", "second"]] = pd.Series(["value1, value2"]).str.split(", ", expand=True)
    
    return dataframe


changed_frame = add_values(initial_frame)

print("changed_frame:", changed_frame)

print("initial_frame after function called:", initial_frame)

# Without intending we have edited the inital_frames
print("Inital and new frame are equal:", initial_frame.equals(changed_frame))

initial_frame before function called: Empty DataFrame
Columns: [first, second]
Index: []
changed_frame:     first  second
0  value1  value2
initial_frame after function called:     first  second
0  value1  value2
Inital and new frame are equal: True

There are different approaches to prevent this phenomena. We will look at one in-particular to tackle the problem.

A local copy of the original dataframe can be made within each function that takes as a parameter a data frame. This local copy is worked with and manipulated, then returned.

This prevents the original object from being changed.

The .copy() method is used to achieve this.

initial_frame = pd.DataFrame(columns=["first", "second"])

print("initial_frame before function called:", initial_frame)


def add_values(dataframe):
    # Typically the original data frame name is overwritten
    # This avoids potential naming issues
    dataframe = dataframe.copy()
    
    dataframe[["first", "second"]] = pd.Series(["value1, value2"]).str.split(", ", expand=True)
    
    return dataframe


changed_frame = add_values(initial_frame)

print("changed_frame:", changed_frame)

print("initial_frame after function called:", initial_frame)

# Check if we have edited the original
print("Inital and new frame are equal:", initial_frame.equals(changed_frame))

initial_frame before function called: Empty DataFrame
Columns: [first, second]
Index: []
changed_frame:     first  second
0  value1  value2
initial_frame after function called: Empty DataFrame
Columns: [first, second]
Index: []
Inital and new frame are equal: False

This approach helps to prevent side effects of our functions.

There is however a trade-off - as we make more copies of data our memory usage will increase.

While this is rarely a problem for small data sets, it is something to keep in mind as your projects get bigger.

There are design approaches that can be used to reduce the memory usage of your program.

For example, removing unneeded data and duplicates, piping and manipulating data with inplace parameters can help.

Using the code snippets from the example analysis below, write a function that:

takes an input population density data frame
splits the country_and_parent_code column into parent_code and country_code columns
drops the country_and_parent_code and parent_code columns
returns the new data frame

Add this function into the file example_code_python/function_input/exercise1.py or example_code_R/function_input/exercise1.R depending on your chosen framework. Use the code already there to test your result on pop_density.

Name the function access_country_code().

# The country_and_parent_code column needs to
# be split into two columns without the strings
pop_density[["country_code", "parent_code"]] = (pop_density["country_and_parent_code"]
                                                .str.split("_", expand=True))

# Remove the country_and_parent_code and parent_code columns, not used in later analysis
# axis=1 to remove the columns
pop_density = pop_density.drop(labels=[
                                       "country_and_parent_code",
                                       "parent_code"
                                       ],
                               axis=1)

import pandas as pd
import os

## Code to be improved to complete exercise 1

def load_formatted_frame(path_to_data):
    """Read csv and reformat column names"""
    formatted_path = os.path.join(path_to_data)
    # Load the csv from given path
    dataframe = pd.read_csv(formatted_path)
    
    # Clean the column names, following naming conventions similar to PEP8
    dataframe.columns = dataframe.columns.str.lower()
    dataframe.columns = dataframe.columns.str.replace(" ", "_")
    
    return dataframe


def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    # Copy the incoming data to prevent editing the original
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                 "country_and_parent_code",
                                                 "parent_code"
                                              ], axis=1)
    return dataframe_dropped
    

# Loading both data frames

pop_density = load_formatted_frame("../../data/population_density_2019.csv")
locations = load_formatted_frame("../../data/locations.csv")

  
pop_density_single_code = access_country_code(pop_density)
print(pop_density_single_code["country_code"])

Using the code snippets from our example analysis below, write a function that:

takes a data frame as an input
can replace a string within a specified column
can convert the type of a given column

This function will be used across both data frames later - so be sure it is general enough to work for both. In addition, it must use only data it gets as arguments.

Add this function into the file example_code/function_input/exercise2.py. Use the code already there to test your result on locations and pop_density.

Name the function convert_type_to_int().

# Replace specific string in column
locations["location_id"] = locations["location_id"].str.replace('"', '')

# Convert the type of the column
locations["location_id"] = locations["location_id"].astype(int)

import pandas as pd
import os

## Code to be improved to complete exercise 2

def load_formatted_frame(path_to_data):
    """Read csv and reformat column names"""
    formatted_path = os.path.join(path_to_data)
    # Load the csv from given path
    dataframe = pd.read_csv(formatted_path)
    
    # Clean the column names, following naming conventions similar to PEP8
    dataframe.columns = dataframe.columns.str.lower()
    dataframe.columns = dataframe.columns.str.replace(" ", "_")
    
    return dataframe


def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                 "country_and_parent_code",
                                                 "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe = dataframe.copy()
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe


## Run the functions created

pop_density = load_formatted_frame("../../data/population_density_2019.csv")
locations = load_formatted_frame("../../data/locations.csv")


pop_density_single_code = access_country_code(pop_density)


pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                column_name="country_code",
                                                string_value="CC")
                                                
locations_correct_types = convert_type_to_int(dataframe=locations,
                                                column_name="location_id",
                                                string_value='"')

print(pop_density_correct_types.dtypes)
print(locations_correct_types.dtypes)

Using the code snippets from our example analysis below, write a function that:

takes two data frames as inputs
takes two string inputs
performs a left join on a column from each data frame, the columns are given by the strings input
removes the second specified string column from the joined data frame
returns a single data frame

This function will be used after the previous functions using the data frames outputted.

Add this function into the file example_code_python|R/function_input/exercise3.py|r. Use the code already there to test your result on the new data frame.

This function will be useful for our specific case, but also if we want to join other data frames or use different column names.

Our column names could change if we change an upstream function, so it’s important we give them as inputs.

Name the function join_frames().

# Join the data sets
# Left merge so we keep all pop_density data
pop_density_location = pop_density.merge(locations,
                                         how="left",
                                         left_on="country_code",
                                         right_on="location_id")

# Remove the location_id column as it is equal to country_code or missing
pop_density_location = pop_density_location.drop(labels=["location_id"], axis=1)

import pandas as pd
import os


def load_formatted_frame(path_to_data):
    """Read csv and reformat column names"""
    formatted_path = os.path.join(path_to_data)
    # Load the csv from given path
    dataframe = pd.read_csv(formatted_path)
    
    # Clean the column names, following naming conventions similar to PEP8
    dataframe.columns = dataframe.columns.str.lower()
    dataframe.columns = dataframe.columns.str.replace(" ", "_")
    
    return dataframe


def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                 "country_and_parent_code",
                                                 "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe = dataframe.copy()
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe


def join_frames(left_dataframe, right_dataframe, left_column, right_column):
    """
    Function to join the required frames on specified columns, dropping
    unrecessary column
    """
    
    left_dataframe = left_dataframe.copy()
    right_dataframe = right_dataframe.copy()
    
    combined_frames = left_dataframe.merge(right=right_dataframe,
                                           how="left",
                                           left_on=left_column,
                                           right_on=right_column)
                                           
    combined_frames_reduced = combined_frames.drop(labels=[right_column], axis=1)
    
    return combined_frames_reduced



## Run the functions created

pop_density = load_formatted_frame("../../data/population_density_2019.csv")
locations = load_formatted_frame("../../data/locations.csv")


pop_density_single_code = access_country_code(pop_density)


pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                column_name="country_code",
                                                string_value="CC")
                                                
locations_correct_types = convert_type_to_int(dataframe=locations,
                                                column_name="location_id",
                                                string_value='"')

population_location = join_frames(left_dataframe=pop_density_correct_types,
                                   right_dataframe=locations_correct_types,
                                   left_column="country_code",
                                   right_column="location_id")

print(population_location.columns)
print(population_location.head(10))

4.5 High Level Functions

This section will introduce some concepts and good practice that are relevant for when you have converted your script into functions.

In the section below, a version of code with all tasks broken into functions is shown. To help consolidate your learning from the previous exercises, an extension exercise is to convert the remaining code to functions yourself.

4.5.1 Extension Exercise

Using exercise3_answers.py convert the remaining script code into functions. The functions should be called:

aggregate_statistic()
format_frame()
write_output()

Each of these functions perform one task. They are general enough that they work for our specific situation but leave some room for if we wanted to make minor adjustments upstream, such as column or filenames.

Side Note: We are writing the function write_output() as practice, it only contains one single line of code so in practice it wouldn’t be used as a function. It’s important to avoid writing functions that are too small.

import pandas as pd
import os


def load_formatted_frame(path_to_data):
    """Read csv and reformat column names"""
    formatted_path = os.path.join(path_to_data)
    # Load the csv from given path
    dataframe = pd.read_csv(formatted_path)
    
    # Clean the column names, following naming conventions similar to PEP8
    dataframe.columns = dataframe.columns.str.lower()
    dataframe.columns = dataframe.columns.str.replace(" ", "_")
    
    return dataframe


def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                "country_and_parent_code",
                                                "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe = dataframe.copy()
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe


def join_frames(left_dataframe, right_dataframe, left_column, right_column):
    """
    Function to join the required frames on specified columns, dropping
    unrecessary column
    """
    
    left_dataframe = left_dataframe.copy()
    right_dataframe = right_dataframe.copy()
    
    
    combined_frames = left_dataframe.merge(right=right_dataframe,
                                           how="left",
                                           left_on=left_column,
                                           right_on=right_column)
                                           
    combined_frames_reduced = combined_frames.drop(labels=[right_column], axis=1)
    
    return combined_frames_reduced


def aggregate_mean(dataframe, groupby_column, statistic_column):
    """Function to groupby and calculate the aggregate mean of two columns"""
    
    dataframe = dataframe.copy()
    
    # Remove unecessary columns
    subset = dataframe[[groupby_column, statistic_column]]
    
    # Perform mean calculation
    statistic = (subset.groupby(groupby_column, as_index=False)
                       .agg({statistic_column: "mean"}))
                       
    statistic_renamed = statistic.rename(columns={statistic_column: "mean_" + aggregate_column})
  
    return statistic_renamed
    
    
    
def format_frame(dataframe, statistic_column):
    """Function to format the dataframe for output"""
    
    dataframe = dataframe.copy()
    
    dataframe_sorted = dataframe.sort_values(by=statistic_column,
                                             ascending=False)
                                      
    dataframe_sorted[statistic_column] = dataframe_sorted[statistic_column].round(2)
    
    return dataframe_sorted
    
    
def write_output(dataframe, output_filepath):
    """Function to write output statistic in formatted manner"""

    dataframe.to_csv(output_filepath, index=False, sep=",")
    
    # We are not returning anything so our function
    # does not need a return value. By default this
    # will return `None`



## Run the functions created

pop_density = load_formatted_frame("../../data/population_density_2019.csv")
locations = load_formatted_frame("../../data/locations.csv")


pop_density_single_code = access_country_code(pop_density)


pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                column_name="country_code",
                                                string_value="CC")
                                                
locations_correct_types = convert_type_to_int(dataframe=locations,
                                              column_name="location_id",
                                              string_value='"')
                                              

population_location = join_frames(left_dataframe=pop_density_correct_types,
                                   right_dataframe=locations_correct_types,
                                   left_column="country_code",
                                   right_column="location_id")

aggreagation = aggregate_mean(dataframe=population_location,
                              groupby_column="sdg_region_name",
                              aggregate_column="population_density")
                              
formatted_statistic = format_frame(aggreagation, "mean_population_density")

write_output(formatted_statistic, "./mean_pop_density.csv")

4.5.2 Execute Program

Now we have converted all our code tasks into functions we can run each function, passing their output into the input of the next function.

Looking at the code at the end of our script there are a group of lines which describe the running of the program. These lines of code describe the whole analysis, showing each step in the process with a function corresponding to each step.

When we hit “Run” on our code, the code shown is run. The functions above it in the file are loaded into the program’s global scope, allowing them to be called by this code.

## Run the functions created

pop_density = load_formatted_frame("/data/population_density_2019.csv")
locations = load_formatted_frame("/data/locations.csv")


pop_density_single_code = access_country_code(pop_density)


pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                column_name="country_code",
                                                string_value="CC")

locations_correct_types = convert_type_to_int(dataframe=locations,
                                              column_name="location_id",
                                              string_value='"')

population_location = join_frames(left_dataframe=pop_density_correct_types,
                                   right_dataframe=locations_correct_types,
                                   left_column="country_code",
                                   right_column="location_id")

aggregation = aggregate_mean(dataframe=population_location,
                              groupby_column="sdg_region_name",
                              aggregate_column="population_density")

formatted_statistic = format_frame(aggregation, "mean_population_density")

write_output(formatted_statistic, output_filepath="./mean_pop_density.csv")

4.5.3 Main Function

The code above makes what we are doing much easier to understand. To find out what the code is doing at each step, we can just read the name of the function, or look up what it does in the documentation.

The way the code is currently designed, however, still uses variables in the global scope, something to generally avoid.

If we add one more function, that calls our other functions, we can run our whole program by calling this one function. This will make it much easier to run the analysis later down the line, and to extend our code into modules and packages.

Functions that run other functions are called “high level” functions. Using high level functions lets us build more structure to our code.

Often the convention you will see for naming a highest level function in code is calling it main(), however it does not have to be this name. We will call our highest level analysis get_analyse_output().

In effect, we put all the code that was used to “run” the program within the get_analyse_output() function. This way we can run the program only when we call get_analyse_output().

This is the point where typical convention between Python and R starts to differ. Be sure to check both methods if you regularly code in both.

How many levels of “high level” functions we have should be proprortionate to our code. For a small task we probably don’t need high level functions. For a larger pipeline they become significantly more important.

In Python there is an extra line of code we add to help us split our code into modules.

The line is as follows:

if __name__ == "__main__":
    #Add your code to run here

When Python runs a file (module) the interpreter assigns it a value for the attribute __name__.

When we click “Run” in our IDE (such as Spyder), or run a script directly in command line, the __name__ value of that file run is "__main__".

If the file (module) is imported elsewhere in a program, then the value of __name__ is not equal to "__main__". It is instead assigned the name of the module file. Therefore any code within the block if __name__ == "__main__": will not be run.

At the moment, that is not very useful to us, but when we start expanding our code into multiple files (Convert to Modules) it becomes key.

If we want to alter the behaviour of the get_analyse_output() function we have two options:

alter the main function to change variables passed to the function
add parameters

Below is our get_analyse_output() function, and the code used to run it.

def get_analyse_output():
    """
    Access the data, run the analysis of population density means over locations,
    output the data into a CSV.
    """

    pop_density = load_formatted_frame("/data/population_density_2019.csv")
    locations = load_formatted_frame("/data/locations.csv")


    pop_density_single_code = access_country_code(pop_density)


    pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                    column_name="country_code",
                                                    string_value="CC")

    locations_correct_types = convert_type_to_int(dataframe=locations,
                                                  column_name="location_id",
                                                  string_value='"')

    population_location = join_frames(left_dataframe=pop_density_correct_types,
                                       right_dataframe=locations_correct_types,
                                       left_column="country_code",
                                       right_column="location_id")

    aggregation = aggregate_mean(dataframe=population_location,
                                  groupby_column="sdg_region_name",
                                  aggregate_column="population_density")

    formatted_statistic = format_frame(aggregation, "mean_population_density")

    write_output(formatted_statistic, output_filepath="./mean_pop_density.csv")


if __name__ == "__main__":
    get_analyse_output()

If we were to use this analysis on different data sets, it may be useful for us to be able to change the data inputs and outputs.

def get_analyse_output(population_filepath, location_filepath, output_filepath):
    """
    Access the data, run the analysis of population density means over locations,
    output the data into a CSV.
    """

    pop_density = load_formatted_frame(population_filepath)
    locations = load_formatted_frame(location_filepath)


    pop_density_single_code = access_country_code(pop_density)


    pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                    column_name="country_code",
                                                    string_value="CC")

    locations_correct_types = convert_type_to_int(dataframe=locations,
                                                  column_name="location_id",
                                                  string_value='"')

    population_location = join_frames(left_dataframe=pop_density_correct_types,
                                       right_dataframe=locations_correct_types,
                                       left_column="country_code",
                                       right_column="location_id")

    aggregation = aggregate_mean(dataframe=population_location,
                                  groupby_column="sdg_region_name",
                                  aggregate_column="population_density")

    formatted_statistic = format_frame(aggregation, "mean_population_density")

    write_output(formatted_statistic, output_filepath)



if __name__ == "__main__":
    get_analyse_output(population_filepath="/data/population_density_2019.csv",
                       location_filepath="/data/locations.csv",
                       output_filepath="./mean_pop_density.csv")

4.6 Hierarchies

We have now introduced a higher level function that runs other functions for us.

This is a great step forward in structuring our code. If we want to understand what the program does:

we first look at this high level get_analyse_output() function
each function within the higher function describes a step of the process, a task
for more information on how each task is completed, the function can be found in the script

By having some functions that call others we now have levels and dependencies of functions.

Well documented high-level functions mean we do not need to dive into the lower level functions to understand what the code does.

These relationships between functions can be described with hierarchical diagrams. Writing down the relationship between tasks in your code is an extremely useful practice in structuring code.

Below is what the code in main_func.py looks like as a hierarchy of functions.

diagram of relationships of functions in main_funcs.py

As you can see, a lot of steps are being run by the single get_analyse_output() function. It is really important we have this high level function, but we can have more if it makes the structure of our program clearer.

Below we will first look at a new code diagram with a different structure to the previous, then the code it corresponds to.

This is slight overkill for our program at the moment due to it’s small size, but the principle is very useful as our code becomes more complex.

The new structure:

still has a highest level get_analyse_output() function
contains multiple functions in between the lowest level and the highest
has middle functions which perform a larger task, grouping smaller tasks together

diagram of relationships of functions in multi-level function hierarchy in main_funcs_middle.py

Note that we have not added an additional higher level function above write_output(). This is because we don’t need to have a higher function calling just one lower level function. In addition, we do not always want to write out data out while we test the analysis pipeline.

The benefits of this structure is that we can more easily access the data produced by our pipeline at relevant steps.

if we want to look at the joined data after cleaning and manipulation we just call the extract_transform() function
to perform a different analysis on the cleaned frame we can write a different analyse() function and call that instead within get_analyse_output()

4.6.1 New Structure

import pandas as pd
import os


def load_formatted_frame(path_to_data):
    """Read csv and reformat column names"""
    formatted_path = os.path.join(path_to_data)
    # Load the csv from given path
    dataframe = pd.read_csv(formatted_path)
    
    # Clean the column names, following naming conventions similar to PEP8
    dataframe.columns = dataframe.columns.str.lower()
    dataframe.columns = dataframe.columns.str.replace(" ", "_")
    
    return dataframe


def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                "country_and_parent_code",
                                                "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe = dataframe.copy()
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe


def join_frames(left_dataframe, right_dataframe, left_column, right_column):
    """
    Function to join the required frames on specified columns, dropping
    unrecessary column
    """
    
    left_dataframe = left_dataframe.copy()
    right_dataframe = right_dataframe.copy()
    
    combined_frames = left_dataframe.merge(right=right_dataframe,
                                           how="left",
                                           left_on=left_column,
                                           right_on=right_column)
                                           
    combined_frames_reduced = combined_frames.drop(labels=[right_column], axis=1)
    
    return combined_frames_reduced


def aggregate_mean(dataframe, groupby_column, aggregate_column):
    """Function to groupby and calculate the aggregate mean of two columns"""
    
    dataframe = dataframe.copy()
    
    # Remove unecessary columns
    subset = dataframe[[groupby_column, aggregate_column]]
    
    # Perform mean calculation
    statistic = (subset.groupby(groupby_column, as_index=False)
                       .agg({aggregate_column: "mean"}))
                       
    statistic_renamed = statistic.rename(columns={aggregate_column: "mean_" + aggregate_column})
  
    return statistic_renamed
    
    
    
def format_frame(dataframe, statistic_column):
    """Function to format the dataframe for output"""
    
    dataframe = dataframe.copy()
    
    dataframe_sorted = dataframe.sort_values(by=statistic_column,
                                             ascending=False)
                                      
    dataframe_sorted[statistic_column] = dataframe_sorted[statistic_column].round(2)
    
    return dataframe_sorted
    
    
def write_output(dataframe, output_filepath):
    """Function to write output statistic in formatted manner"""

    dataframe.to_csv(output_filepath, index=False, sep=",")
    


def extract_transform(population_filepath, location_filepath):
    """Load the data and convert it to clean joined format for analysis"""
    
    pop_density = load_formatted_frame(population_filepath)
    locations = load_formatted_frame(location_filepath)
    
    
    pop_density_single_code = access_country_code(pop_density)
    
    
    pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                    column_name="country_code",
                                                    string_value="CC")
                                                    
    locations_correct_types = convert_type_to_int(dataframe=locations,
                                                  column_name="location_id",
                                                  string_value='"')
    
    population_location = join_frames(left_dataframe=pop_density_correct_types,
                                       right_dataframe=locations_correct_types,
                                       left_column="country_code",
                                       right_column="location_id")
                                       
    return population_location
    
    
def analyse(full_dataframe, groupby_column, aggregate_column, statistic_column):
    """Function to perform groupby mean of population density and reformat result"""
  
    full_dataframe = full_dataframe.copy()
    
    aggreagation = aggregate_mean(dataframe=full_dataframe,
                                  groupby_column=groupby_column,
                                  aggregate_column=aggregate_column)
                                  
    formatted_statistic = format_frame(aggreagation, statistic_column=statistic_column)
    
    
    return formatted_statistic



def get_analyse_output(population_filepath, location_filepath, output_filepath):
    """
    Access the data, run the analysis of population density means over locations,
    output the data into a csv.
    """
    
    population_location = extract_transform(population_filepath=population_filepath,
                                            location_filepath=location_filepath)
    
    formatted_statistic = analyse(full_dataframe=population_location,
                                  groupby_column="sdg_region_name",
                                  aggregate_column="population_density",
                                  statistic_column="mean_population_density")
    
    write_output(formatted_statistic, output_filepath)
    
    
    
if __name__ == "__main__":
    get_analyse_output(population_filepath="../../data/population_density_2019.csv",
                       location_filepath="../../data/locations.csv",
                       output_filepath="./mean_pop_density.csv")

4.7 Interaction

“Who will need to access this part of the program?” is a useful question to think about when structuring your code.

As the main developer you will likely be accessing the whole code base, every function.

To run the program a user only needs to interact with a small part of the program. The part of the program a user will be interacting with is called the “application programming interface”, API. Other areas of the code can be seen, but rarely used by the user.

Parts of your code can be “hidden” from the user. The end user does not need to understand every line of code or function. The user only needs to run the program.

By structuring the code properly, it is possible to “hide” the private parts from users - they do not need to understand or access the inner workings of every function - they just need to run the program.

In our code the API part would be the get_analyse_output() function.

Separating public facing and lower level functions improves clarity and usability. All code should be as clear as possible whether it is the API or lower to help with future development.

Having this distinction allows us to test the code at the correct levels.

diagram of which areas of the code are accessible to the public user and developer

By having a hierarchy of functions with distinctions about what the API is can make the code simpler. Structuring the code well makes it easier to run, test and fix for developers and users.

This concept becomes more important in:

software products with non-technical users
object-oriented programming, using public interfaces

Ideally, a user does not need to open any code files to run analysis. Instead, the user can work with a graphical user interface (GUI) or command line interface. Parameters such as the input data file paths and output paths are written in a separate file or by the user in the interface.

“What will the end product of my analysis pipeline be?” is an important question to consider when structuring your project.

5 Functions to Modules

Earlier in this course a scenario was introduced explaining that a single script can grow large and become difficult to maintain.

Although adding structure with functions makes our code better, it can make it longer. Larger code files are difficult to maintain and understand.

We can make our code even clearer and better-structured by moving the functions in our code into different files. By grouping related functions together into different files it will be easier to look up different parts of our code. We no longer need to scroll through thousands of lines of code, we just navigate to the relevant file.

When we move functions (or other objects) into different files, they then need to be imported back into the file we are using those functions in.

When we move code into different files the code in files are imported as “modules” in Python.

Before structuring code in different files, we need to discuss how to structure our directory properly to help us with this.

5.1 Project Structure

Now we are moving beyond working with just one script we need to consider our project, files, folders/directories and paths.

A key part of building a reproducible collection of code is making the project folder simple to understand, navigate and work with.

There is no single folder structure that is perfect for all analysis, however, there are good minimum requirements and guiding principles.

The situation to avoid is having all your data, course code, notebooks and documentation in the same location. This is confusing to anyone else looking in, and makes it harder for your project to be extended.

In this section we will outline basic components of project structure, their relevance to this course, and point to good resources for deciding your own approach.

5.1.1 Guiding Principles

The main principles are

the complexity of the folder structure should be in line with the size/complexity of the project
- smaller analysis should a have simpler folder structure
- larger projects require more depth of structure (more sections, more folders dividing areas)
different file types should generally be separated, for example keep the .py files together, the CSVs together, the Jupyter notebooks in one place
what the end product of the project is should impact the structure. If the code is to become a package, an appropriate structure should be used.

5.1.2 Minimum requirements

A directory structure for analysis should separate the:

data used to analyse
course code to perform analysis
report generation / notebook files, figures and images
documentation
READMEs, licenses, package requirements

In addition, relevant version control folders/files will be present (not covered in this course) - .git folder, .gitignore file.

How this is done may depend on your team, language, and specific use case.

An example folder structure for our project is shown below, this is a minimum and could be extended.

Note: /src/ stands for “source” - referring to your source code, the files your program is written with.

5.1.3 Additional content

There are other folders and considerations to structure your project beyond the minimum.

You may want to have separate folders for:

different parts of your code within the /src/ folder
references such as data dictionaries and user manuals
Further divisions of your /data/ folder
models or other output products produced
notebooks for enhanced documentations and examples
A way to produce example data
environment building

In Python there are alternative resources for data science project structure:

Cookiecutter outlines important considerations for Python projects, with a way to download example structures

For both Python and R there is a project structure designed by the Government Digital Service for data science projects.

Cookiecutter project structure for data science in government The repository is designed to use best practices. It uses pre-populated templates to encourage collaboration. In addition it prevents the committing of passwords and rendered notebooks.

5.2 Using Separate Files

Now that we are aware of good project folder structure, we can discuss separating our big full code file into more logical smaller files.

This section will focus on the code contained within the /src/ folder shown in the last section.

Group functions with similar purpose together, such as data cleaning, loading, modelling. Make each file/module as focussed as possible to make it easy to find any required function.

To move the functions between files there are four main steps that need to be taken:

move the code between files (copy and paste)
check the new file can access all the code it needs
import the new file / function into the relevant files in the code base
check this has not affected how our code runs (test it hasn’t changed)

Moving code between files when a script already exists is a task that can be avoided by designing your project files in a useful way when starting to write your code. Any new analysis should make use of existing modules that you have created.

5.2.1 Moving Code

In this section we will learn how to move functions between files.

In the earlier part of the course “Function Inputs” we discussed why it is important that variables are only accessed through function inputs and outputs. This principle is even more important when moving code between files.

We are first going to make a new file called input_output.py. This file is going to contain all the code we need for loading and exporting our data frames. It is good practice to group related functions into the same file - especially around data access.

In addition, we are going to rename our original script to main.py. This is the file that will run all our code.

Within the input_output.py file we are going to put the following functions, removing them from main.py:

load_formatted_frame()
write_output()

Our files will now appear as below. Note, they will not currently run.

main.py
input_output.py

File contains most of the code used to run the program.

import pandas as pd


def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                "country_and_parent_code",
                                                "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe = dataframe.copy()
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe


def join_frames(left_dataframe, right_dataframe, left_column, right_column):
    """
    Function to join the required frames on specified columns, dropping
    unrecessary column
    """
    
    left_dataframe = left_dataframe.copy()
    right_dataframe = right_dataframe.copy()
    
    combined_frames = left_dataframe.merge(right=right_dataframe,
                                           how="left",
                                           left_on=left_column,
                                           right_on=right_column)
                                           
    combined_frames_reduced = combined_frames.drop(labels=[right_column], axis=1)
    
    return combined_frames_reduced


def aggregate_mean(dataframe, groupby_column, aggregate_column):
    """Function to groupby and calculate the aggregate mean of two columns"""
    
    dataframe = dataframe.copy()
    
    # Remove unecessary columns
    subset = dataframe[[groupby_column, aggregate_column]]
    
    # Perform mean calculation
    statistic = (subset.groupby(groupby_column, as_index=False)
                       .agg({aggregate_column: "mean"}))
                       
    statistic_renamed = statistic.rename(columns={aggregate_column: "mean_" + aggregate_column})
  
    return statistic_renamed
    
    
    
def format_frame(dataframe, statistic_column):
    """Function to format the dataframe for output"""
    
    dataframe = dataframe.copy()
    
    dataframe_sorted = dataframe.sort_values(by=statistic_column,
                                             ascending=False)
                                      
    dataframe_sorted[statistic_column] = dataframe_sorted[statistic_column].round(2)
    
    return dataframe_sorted
    
    



def get_analyse_output():
    """
    Access the data, run the analysis of population density means over locations,
    output the data into a csv.
    """
    
    pop_density = load_formatted_frame("../../data/population_density_2019.csv")
    locations = load_formatted_frame("../../data/locations.csv")
    
    
    pop_density_single_code = access_country_code(pop_density)
    
    
    pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                    column_name="country_code",
                                                    string_value="CC")
                                                    
    locations_correct_types = convert_type_to_int(dataframe=locations,
                                                  column_name="location_id",
                                                  string_value='"')
    
    population_location = join_frames(left_dataframe=pop_density_correct_types,
                                       right_dataframe=locations_correct_types,
                                       left_column="country_code",
                                       right_column="location_id")
    
    aggreagation = aggregate_mean(dataframe=population_location,
                                  groupby_column="sdg_region_name",
                                  aggregate_column="population_density")
                                  
    formatted_statistic = format_frame(aggreagation, "mean_population_density")
    
    write_output(formatted_statistic, output_filepath="./mean_pop_density.csv")
    
    
    
if __name__ == "__main__":
    get_analyse_output()

File contains the functions used for input and output operations.

import pandas as pd
import os


def load_formatted_frame(path_to_data):
    """Read csv and reformat column names"""
    formatted_path = os.path.join(path_to_data)
    # Load the csv from given path
    dataframe = pd.read_csv(formatted_path)
    
    # Clean the column names, following naming conventions similar to PEP8
    dataframe.columns = dataframe.columns.str.lower()
    dataframe.columns = dataframe.columns.str.replace(" ", "_")
    
    return dataframe
    

def write_output(dataframe, output_filepath):
    """Function to write output statistic in formatted manner"""

    dataframe.to_csv(output_filepath, index=False, sep=",")

5.2.2 Loading Code Between Files

The code shown above will not run because the main.py code cannot access the functions contained within input_output.py.

For a program to access code in another location the functions need to be loaded into that program explicitly. In Python this is called importing a module.

We load the code from one file into the another allowing our code to access the contents of the loaded file.

Loading a file puts the objects within into the scope of our program.

If we load a file’s code in the global scope of our program, then the file’s contents will be accessible anywhere in the program. If we load the file in a specific local scope it will only be accessible in that local area.

In Python we load our own modules (files) in a very similar way to how we load third party packages such as pandas, matplotlib or numpy.

Python will search a range of locations to find the module we have requested. For now we will assume that all modules are in the same directory as the file we are loading them from.

When a module is imported into a file it becomes an object. The contents of the module can be accessed from this module object.

To import a module (file)

import input_output

In order to use a function from the module, we prepend the function name with the module name; this would also work for classes.

import input_output

dataframe = input_output.load_formatted_frame(path_to_data="./data.csv")

This method makes it clear from which module each function is being used.

This means we will need to change our existing main.py file to use the correct module name and function.

Alternatively, if we only need one specific function from a module, we can import the function on its own.

from input_output import load_formatted_frame

dataframe = load_formatted_frame(path_to_data="./data.csv")

It is good practice to avoid loading in specific functions where there may already be functions in the program with the same or similar names to prevent overwriting of functions or confusion around names.

Below is the main.py file which imports the input_output module. The function calls in main have been changed to refer to the input_output module. Read through the code to see the changes.

By convention modules are loaded at the top of a file. This makes it clear what modules are used in the code and ensures all parts of the code that need the module can access it.

import pandas as pd

import input_output


def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                "country_and_parent_code",
                                                "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe = dataframe.copy()
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe


def join_frames(left_dataframe, right_dataframe, left_column, right_column):
    """
    Function to join the required frames on specified columns, dropping
    unrecessary column
    """
    
    left_dataframe = left_dataframe.copy()
    right_dataframe = right_dataframe.copy()
    
    combined_frames = left_dataframe.merge(right=right_dataframe,
                                           how="left",
                                           left_on=left_column,
                                           right_on=right_column)
                                           
    combined_frames_reduced = combined_frames.drop(labels=[right_column], axis=1)
    
    return combined_frames_reduced


def aggregate_mean(dataframe, groupby_column, aggregate_column):
    """Function to groupby and calculate the aggregate mean of two columns"""
    
    dataframe = dataframe.copy()
    
    # Remove unecessary columns
    subset = dataframe[[groupby_column, aggregate_column]]
    
    # Perform mean calculation
    statistic = (subset.groupby(groupby_column, as_index=False)
                       .agg({aggregate_column: "mean"}))
                       
    statistic_renamed = statistic.rename(columns={aggregate_column: "mean_" + aggregate_column})
  
    return statistic_renamed
    
    
    
def format_frame(dataframe, statistic_column):
    """Function to format the dataframe for output"""
    
    dataframe = dataframe.copy()
    
    dataframe_sorted = dataframe.sort_values(by=statistic_column,
                                             ascending=False)
                                      
    dataframe_sorted[statistic_column] = dataframe_sorted[statistic_column].round(2)
    
    return dataframe_sorted
    
    



def get_analyse_output():
    """
    Access the data, run the analysis of population density means over locations,
    output the data into a csv.
    """
    # Added the module name here
    pop_density = input_output.load_formatted_frame("../../../data/population_density_2019.csv")
    locations = input_output.load_formatted_frame("../../../data/locations.csv")

    
    pop_density_single_code = access_country_code(pop_density)
    
    
    pop_density_correct_types = convert_type_to_int(dataframe=pop_density_single_code,
                                                    column_name="country_code",
                                                    string_value="CC")
                                                    
    locations_correct_types = convert_type_to_int(dataframe=locations,
                                                  column_name="location_id",
                                                  string_value='"')
    
    population_location = join_frames(left_dataframe=pop_density_correct_types,
                                       right_dataframe=locations_correct_types,
                                       left_column="country_code",
                                       right_column="location_id")
    
    aggreagation = aggregate_mean(dataframe=population_location,
                                  groupby_column="sdg_region_name",
                                  aggregate_column="population_density")
                                  
    formatted_statistic = format_frame(aggreagation, "mean_population_density")
    
    # Added the module name here
    input_output.write_output(formatted_statistic, output_filepath="./mean_pop_density.csv")
    
    
    
if __name__ == "__main__":
    get_analyse_output()

If we want to debug or manually test our functions within input_output we can add the following line of code to the end of our script. The code within this block will only run if we run the input_output.py file itself, not the main.py file.

if __name__ == "__main__":
  # code to debug and test

As you look at others’ code you will see a variety of ways to import modules. Below we discuss two frequent conventions.

You can change the name of an imported module for ease of use by using the as keyword.

This is often seen with common third-party packages - such as pandas or numpy but will work for any module.

import pandas as pd
import numpy as np

This does not change the functionality of the module at all, just changes the name.

All objects in a module can be imported to the global scope at once, this should be avoided.

The below code is being shown to make you aware of what can be done, but should be avoided.

Instead of importing the module and then specific functions from within - or using the module itself all functions within can be imported. This is done using the * operator. This signifies “all”.

The below code would make all the functions in input_output available to use in the global scope.

from input_output import *

population = load_formatted_frame("path_to_population")

This may seem useful, but it will cause us to overwrite functions of the same name, and make it unclear from which module each function came from. It is not clear in the example above that load_formatted_frame() comes from input_output. For large projects this will cause issues in readability and bug fixing.

5.3 Exercises

These exercises will help you practice splitting code into different files and loading them back into the main.py script.

Create a new file in the example_code/modules/exercises/start/ folder called analysis.py.

Put the following functions within the new file:

aggregate_mean()
format_frame()

Change the code in main.py such that the file loads the relevant functions and runs the whole analysis.

Create a new file in the example_code/modules/exercises/start/ folder called manipulation.py.

Put the following functions within the new file:

convert_type_to_int()
access_country_code()
join_frames()

Change the code in main.py such that the file loads the relevant functions and runs the whole analysis.

import pandas as pd


def aggregate_mean(dataframe, groupby_column, aggregate_column):
    """Function to groupby and calculate the aggregate mean of two columns"""
    
    dataframe = dataframe.copy()
    
    # Remove unecessary columns
    subset = dataframe[[groupby_column, aggregate_column]]
    
    # Perform mean calculation
    statistic = (subset.groupby(groupby_column, as_index=False)
                       .agg({aggregate_column: "mean"}))
                       
    statistic_renamed = statistic.rename(columns={aggregate_column: "mean_" + aggregate_column})
  
    return statistic_renamed
    
    
    
def format_frame(dataframe, statistic_column):
    """Function to format the dataframe for output"""
    
    dataframe = dataframe.copy()
    
    dataframe_sorted = dataframe.sort_values(by=statistic_column,
                                             ascending=False)
                                      
    dataframe_sorted[statistic_column] = dataframe_sorted[statistic_column].round(2)
    
    return dataframe_sorted

import pandas as pd

def access_country_code(dataframe):
    """Function to split combined code columns and remove uncessary columns"""
    
    dataframe = dataframe.copy()
    
    dataframe[["country_code", "parent_code"]] = (dataframe["country_and_parent_code"]
                                                .str.split("_", expand=True))
                                                
    dataframe_dropped = dataframe.drop(labels=[
                                                "country_and_parent_code",
                                                "parent_code"
                                              ], axis=1)
    return dataframe_dropped  
    


def convert_type_to_int(dataframe, column_name, string_value):
    """Function to convert string to integer column type"""
    
    dataframe = dataframe.copy()
    
    dataframe[column_name] = dataframe[column_name].str.replace(string_value, "")
    
    dataframe[column_name] = dataframe[column_name].astype(int)
    
    return dataframe


def join_frames(left_dataframe, right_dataframe, left_column, right_column):
    """
    Function to join the required frames on specified columns, dropping
    unrecessary column
    """
    
    left_dataframe = left_dataframe.copy()
    right_dataframe = right_dataframe.copy()
    
    combined_frames = left_dataframe.merge(right=right_dataframe,
                                           how="left",
                                           left_on=left_column,
                                           right_on=right_column)
                                           
    combined_frames_reduced = combined_frames.drop(labels=[right_column], axis=1)
    
    return combined_frames_reduced

import pandas as pd

# Import our required modules
import input_output
import analysis
import manipulation





def get_analyse_output():
    """
    Access the data, run the analysis of population density means over locations,
    output the data into a csv.
    """

    pop_density = input_output.load_formatted_frame("../../../../data/population_density_2019.csv")
    locations = input_output.load_formatted_frame("../../../../data/locations.csv")
    
    
    # Added module names below
    pop_density_single_code = manipulation.access_country_code(pop_density)
    
    
    pop_density_correct_types = manipulation.convert_type_to_int(dataframe=pop_density_single_code,
                                                                 column_name="country_code",
                                                                 string_value="CC")
                                                    
    locations_correct_types = manipulation.convert_type_to_int(dataframe=locations,
                                                               column_name="location_id",
                                                               string_value='"')
    
    population_location = manipulation.join_frames(left_dataframe=pop_density_correct_types,
                                                   right_dataframe=locations_correct_types,
                                                   left_column="country_code",
                                                   right_column="location_id")
    
    # Added module name here
    aggreagation = analysis.aggregate_mean(dataframe=population_location,
                                           groupby_column="sdg_region_name",
                                           aggregate_column="population_density")
                                  
    formatted_statistic = analysis.format_frame(aggreagation, "mean_population_density")
    

    input_output.write_output(formatted_statistic, output_filepath="./mean_pop_density.csv")
    
    
    
if __name__ == "__main__":
    get_analyse_output()

Continue on to the case study