Modular Programming

Author

ONS Data Science Campus

Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'kableExtra' was built under R version 4.4.2

1 Components of Modular Programming

Functions, modules and packages structure programs, making them:

  • more readable
  • easier to fix
  • simpler to add new features to

diagram of package modules and functions relationships

A module is a file that contains one or more units of code - in our case: functions. A collection of module files together forms a package.

You have already been using functions, modules and packages that were written by other people. For example dplyr is a package which contains modules and functions.

You can use automated tests to check that each component of your code performs as expected.

Testing sections of your code independently using code - “Unit Testing” - is a concept covered in further courses. To do this your code needs to use functions, modules and packages.

Testing multiple sections together and their interactions with each other is called - “Integration testing”.

To show how to structure code in an analysis context we will use an example scenario to go through the steps taken.

2 Introducing the Project

You have been assigned to a group within your department responsible for analysing populations across the world. This work is in collaboration with the United Nations.

Your job is to provide analysis of population densities across the different United Nations Sustainable Development Goal (SDG) regions. You must provide average population density values for each SDG region.

One of your colleagues has already conducted this analysis on an ad hoc basis. They have given you their code to start with, but they have only analysed one year of data so far. You have been asked to write code that will be able to analyse multiple years of data, all in different files.

Before tackling the big task of analysing all the data, you are going to restructure your colleagues code. To make the process more reproducible in the future you will restructure their code into functions and modules.


This process is called “refactoring”.

Refactoring is a process of improving your code, while keeping it able to perform the same task. This helps clean the code and improve its design.

You have been sent two data sets needed to reproduce the analysis your colleague performed. Have a look through the data, what steps do you think need to be considered to make the data analysable?

2.1 Population Density

The population_density_2019.csv data contains each country’s population density, name and a unique country and parent group code column. There is also a year column.

The data is only from 2019.

Country Country and parent code Population Density Year
Burundi CC108_PC108 4.490100e+02 2019
Comoros CC174_PC174 4.572225e+02 2019
Djibouti CC262_PC262 4.199987e+01 2019
Eritrea CC232_PC232 3.462492e+01 2019
Ethiopia CC231_PC231 1.120787e+02 2019
Kenya CC404_PC404 9.237440e+01 2019
Madagascar CC450_PC450 4.635534e+01 2019
Malawi CC454_PC454 1.975896e+02 2019
Mauritius CC480_PC480 6.254532e+02 2019
Mayotte CC175_PC175 7.097413e+02 2019
Mozambique CC508_PC508 3.861497e+01 2019
Réunion CC638_PC638 3.555728e+02 2019
Rwanda CC646_PC646 5.118337e+02 2019
Seychelles CC690_PC690 2.124804e+02 2019
Somalia CC706_PC706 2.461649e+01 2019
South Sudan CC728_PC728 1.810637e+01 2019
Uganda CC800_PC800 2.215584e+02 2019
United Republic of Tanzania CC834_PC834 6.548370e+01 2019
Zambia CC894_PC894 2.402647e+01 2019
Zimbabwe CC716_PC716 3.785827e+01 2019
Angola CC24_PC24 2.552763e+01 2019
Cameroon CC120_PC120 5.474051e+01 2019
Central African Republic CC140_PC140 7.616904e+00 2019
Chad CC148_PC148 1.266430e+01 2019
Congo CC178_PC178 1.575550e+01 2019
Democratic Republic of the Congo CC180_PC180 3.828348e+01 2019
Equatorial Guinea CC226_PC226 4.834160e+01 2019
Gabon CC266_PC266 8.431630e+00 2019
Sao Tome and Principe CC678_PC678 2.240083e+02 2019
Botswana CC72_PC72 4.064904e+00 2019
Eswatini CC748_PC748 6.675192e+01 2019
Lesotho CC426_PC426 7.000221e+01 2019
Namibia CC516_PC516 3.029946e+00 2019
South Africa CC710_PC710 4.827199e+01 2019
Benin CC204_PC204 1.046572e+02 2019
Burkina Faso CC854_PC854 7.427406e+01 2019
Cabo Verde CC132_PC132 1.364605e+02 2019
Côte d'Ivoire CC384_PC384 8.086967e+01 2019
Gambia CC270_PC270 2.319858e+02 2019
Ghana CC288_PC288 1.336814e+02 2019
Guinea CC324_PC324 5.197479e+01 2019
Guinea-Bissau CC624_PC624 6.831142e+01 2019
Liberia CC430_PC430 5.126011e+01 2019
Mali CC466_PC466 1.611062e+01 2019
Mauritania CC478_PC478 4.390897e+00 2019
Niger CC562_PC562 1.840271e+01 2019
Nigeria CC566_PC566 2.206524e+02 2019
Saint Helena CC654_PC654 1.554103e+01 2019
Senegal CC686_PC686 8.464323e+01 2019
Sierra Leone CC694_PC694 1.082461e+02 2019
Togo CC768_PC768 1.486001e+02 2019
Algeria CC12_PC12 1.807630e+01 2019
Egypt CC818_PC818 1.008469e+02 2019
Libya CC434_PC434 3.851832e+00 2019
Morocco CC504_PC504 8.172029e+01 2019
Sudan CC729_PC729 2.425613e+01 2019
Tunisia CC788_PC788 7.527498e+01 2019
Western Sahara CC732_PC732 2.189692e+00 2019
Armenia CC51_PC51 1.038893e+02 2019
Azerbaijan CC31_PC31 1.215577e+02 2019
Bahrain CC48_PC48 2.159426e+03 2019
Cyprus CC196_PC196 1.297158e+02 2019
Georgia CC268_PC268 5.751564e+01 2019
Iraq CC368_PC368 9.050882e+01 2019
Israel CC376_PC376 3.936864e+02 2019
Jordan CC400_PC400 1.137835e+02 2019
Kuwait CC414_PC414 2.360874e+02 2019
Lebanon CC422_PC422 6.701573e+02 2019
Oman CC512_PC512 1.607429e+01 2019
Qatar CC634_PC634 2.439338e+02 2019
Saudi Arabia CC682_PC682 1.594115e+01 2019
State of Palestine CC275_PC275 8.274787e+02 2019
Syrian Arab Republic CC760_PC760 9.295939e+01 2019
Turkey CC792_PC792 1.084022e+02 2019
United Arab Emirates CC784_PC784 1.168723e+02 2019
Yemen CC887_PC887 5.523405e+01 2019
Kazakhstan CC398_PC398 6.871663e+00 2019
Kyrgyzstan CC417_PC417 3.345074e+01 2019
Tajikistan CC762_PC762 6.659776e+01 2019
Turkmenistan CC795_PC795 1.264464e+01 2019
Uzbekistan CC860_PC860 7.753106e+01 2019
Afghanistan CC4_PC4 5.826939e+01 2019
Bangladesh CC50_PC50 1.252563e+03 2019
Bhutan CC64_PC64 2.001978e+01 2019
India CC356_PC356 4.595797e+02 2019
Iran (Islamic Republic of) CC364_PC364 5.091271e+01 2019
Maldives CC462_PC462 1.769857e+03 2019
Nepal CC524_PC524 1.995725e+02 2019
Pakistan CC586_PC586 2.809326e+02 2019
Sri Lanka CC144_PC144 3.400372e+02 2019
China CC156_PC156 1.527217e+02 2019
China, Hong Kong SAR CC344_PC344 7.082054e+03 2019
China, Macao SAR CC446_PC446 2.141960e+04 2019
China, Taiwan Province of China CC158_PC158 6.713889e+02 2019
Dem. People's Republic of Korea CC408_PC408 2.131564e+02 2019
Japan CC392_PC392 3.479867e+02 2019
Mongolia CC496_PC496 2.075984e+00 2019
Republic of Korea CC410_PC410 5.268469e+02 2019
Brunei Darussalam CC96_PC96 8.221935e+01 2019
Cambodia CC116_PC116 9.339759e+01 2019
Indonesia CC360_PC360 1.493873e+02 2019
Lao People's Democratic Republic CC418_PC418 3.106350e+01 2019
Malaysia CC458_PC458 9.724483e+01 2019
Myanmar CC104_PC104 8.272807e+01 2019
Philippines CC608_PC608 3.626006e+02 2019
Singapore CC702_PC702 8.291919e+03 2019
Thailand CC764_PC764 1.362829e+02 2019
Timor-Leste CC626_PC626 8.696167e+01 2019
Viet Nam CC704_PC704 3.110978e+02 2019
Anguilla CC660_PC660 1.652444e+02 2019
Antigua and Barbuda CC28_PC28 2.207159e+02 2019
Aruba CC533_PC533 5.906111e+02 2019
Bahamas CC44_PC44 3.890969e+01 2019
Barbados CC52_PC52 6.674907e+02 2019
Bonaire, Sint Eustatius and Saba CC535_PC535 7.921646e+01 2019
British Virgin Islands CC92_PC92 2.002200e+02 2019
Cayman Islands CC136_PC136 2.706167e+02 2019
Cuba CC192_PC192 1.064777e+02 2019
Curaçao CC531_PC531 3.680698e+02 2019
Dominica CC212_PC212 9.574400e+01 2019
Dominican Republic CC214_PC214 2.222466e+02 2019
Grenada CC308_PC308 3.294176e+02 2019
Guadeloupe CC312_PC312 2.457297e+02 2019
Haiti CC332_PC332 4.086749e+02 2019
Jamaica CC388_PC388 2.722324e+02 2019
Martinique CC474_PC474 3.542991e+02 2019
Montserrat CC500_PC500 4.991000e+01 2019
Puerto Rico CC630_PC630 3.307107e+02 2019
Saint Barthélemy CC652_PC652 4.478636e+02 2019
Saint Kitts and Nevis CC659_PC659 2.032077e+02 2019
Saint Lucia CC662_PC662 2.996639e+02 2019
Saint Martin (French part) CC663_PC663 7.170189e+02 2019
Saint Vincent and the Grenadines CC670_PC670 2.835718e+02 2019
Sint Maarten (Dutch part) CC534_PC534 1.246735e+03 2019
Trinidad and Tobago CC780_PC780 2.719238e+02 2019
Turks and Caicos Islands CC796_PC796 4.020421e+01 2019
United States Virgin Islands CC850_PC850 2.987971e+02 2019
Belize CC84_PC84 1.711315e+01 2019
Costa Rica CC188_PC188 9.885548e+01 2019
El Salvador CC222_PC222 3.114648e+02 2019
Guatemala CC320_PC320 1.640675e+02 2019
Honduras CC340_PC340 8.710443e+01 2019
Mexico CC484_PC484 6.562696e+01 2019
Nicaragua CC558_PC558 5.439175e+01 2019
Panama CC591_PC591 5.712187e+01 2019
Argentina CC32_PC32 1.636308e+01 2019
Bolivia (Plurinational State of) CC68_PC68 1.062781e+01 2019
Brazil CC76_PC76 2.525078e+01 2019
Chile CC152_PC152 2.548920e+01 2019
Colombia CC170_PC170 4.537129e+01 2019
Ecuador CC218_PC218 6.995352e+01 2019
Falkland Islands (Malvinas) CC238_PC238 2.770748e-01 2019
French Guiana CC254_PC254 3.537993e+00 2019
Guyana CC328_PC328 3.976505e+00 2019
Paraguay CC600_PC600 1.773128e+01 2019
Peru CC604_PC604 2.539880e+01 2019
Suriname CC740_PC740 3.726686e+00 2019
Uruguay CC858_PC858 1.977906e+01 2019
Venezuela (Bolivarian Republic of) CC862_PC862 3.232904e+01 2019
Australia CC36_PC36 3.280684e+00 2019
New Zealand CC554_PC554 1.816514e+01 2019
Fiji CC242_PC242 4.871128e+01 2019
New Caledonia CC540_PC540 1.546811e+01 2019
Papua New Guinea CC598_PC598 1.937932e+01 2019
Solomon Islands CC90_PC90 2.393073e+01 2019
Vanuatu CC548_PC548 2.460066e+01 2019
Guam CC316_PC316 3.098056e+02 2019
Kiribati CC296_PC296 1.451951e+02 2019
Marshall Islands CC584_PC584 3.266167e+02 2019
Micronesia (Fed. States of) CC583_PC583 1.625871e+02 2019
Nauru CC520_PC520 5.382000e+02 2019
Northern Mariana Islands CC580_PC580 1.243761e+02 2019
Palau CC585_PC585 3.913261e+01 2019
American Samoa CC16_PC16 2.765600e+02 2019
Cook Islands CC184_PC184 7.311250e+01 2019
French Polynesia CC258_PC258 7.630738e+01 2019
Niue CC570_PC570 6.207692e+00 2019
Samoa CC882_PC882 6.964417e+01 2019
Tokelau CC772_PC772 1.330000e+02 2019
Tonga CC776_PC776 1.451347e+02 2019
Tuvalu CC798_PC798 3.885000e+02 2019
Wallis and Futuna Islands CC876_PC876 8.168571e+01 2019
Belarus CC112_PC112 4.658424e+01 2019
Bulgaria CC100_PC100 6.448155e+01 2019
Czechia CC203_PC203 1.383896e+02 2019
Hungary CC348_PC348 1.069776e+02 2019
Poland CC616_PC616 1.237233e+02 2019
Republic of Moldova CC498_PC498 1.230824e+02 2019
Romania CC642_PC642 8.413155e+01 2019
Russian Federation CC643_PC643 8.907212e+00 2019
Slovakia CC703_PC703 1.134797e+02 2019
Ukraine CC804_PC804 7.594014e+01 2019
Channel Islands CC830_PC830 9.066526e+02 2019
Denmark CC208_PC208 1.360329e+02 2019
Estonia CC233_PC233 3.127268e+01 2019
Faroe Islands CC234_PC234 3.486891e+01 2019
Finland CC246_PC246 1.820448e+01 2019
Iceland CC352_PC352 3.381915e+00 2019
Ireland CC372_PC372 7.087383e+01 2019
Isle of Man CC833_PC833 1.484018e+02 2019
Latvia CC428_PC428 3.065498e+01 2019
Lithuania CC440_PC440 4.403151e+01 2019
Norway CC578_PC578 1.472579e+01 2019
Sweden CC752_PC752 2.445872e+01 2019
United Kingdom CC826_PC826 2.791310e+02 2019
Albania CC8_PC8 1.051428e+02 2019
Andorra CC20_PC20 1.641404e+02 2019
Bosnia and Herzegovina CC70_PC70 6.472545e+01 2019
Croatia CC191_PC191 7.380806e+01 2019
Gibraltar CC292_PC292 3.370600e+03 2019
Greece CC300_PC300 8.125254e+01 2019
Holy See CC336_PC336 1.852273e+03 2019
Italy CC380_PC380 2.058547e+02 2019
Malta CC470_PC470 1.376178e+03 2019
Montenegro CC499_PC499 4.669056e+01 2019
North Macedonia CC807_PC807 8.261134e+01 2019
Portugal CC620_PC620 1.116517e+02 2019
San Marino CC674_PC674 5.644000e+02 2019
Serbia CC688_PC688 1.002999e+02 2019
Slovenia CC705_PC705 1.032102e+02 2019
Spain CC724_PC724 9.369844e+01 2019
Austria CC40_PC40 1.086666e+02 2019
Belgium CC56_PC56 3.810874e+02 2019
France CC250_PC250 1.189460e+02 2019
Germany CC276_PC276 2.396059e+02 2019
Liechtenstein CC438_PC438 2.376250e+02 2019
Luxembourg CC442_PC442 2.377336e+02 2019
Monaco CC492_PC492 2.615235e+04 2019
Netherlands CC528_PC528 5.070321e+02 2019
Switzerland CC756_PC756 2.174147e+02 2019
Bermuda CC60_PC60 1.250160e+03 2019
Canada CC124_PC124 4.114037e+00 2019
Greenland CC304_PC304 1.380436e-01 2019
Saint Pierre and Miquelon CC666_PC666 2.530870e+01 2019
United States of America CC840_PC840 3.597352e+01 2019

Location IDs

The locations.csv data contains each countries location ID (the same as a country code), and which Sustainable Development Goal Region the location ID is part of.

The data is valid for all years.

Location ID SDG Region Name
"108" Sub-Saharan Africa
"174" Sub-Saharan Africa
"262" Sub-Saharan Africa
"232" Sub-Saharan Africa
"231" Sub-Saharan Africa
"404" Sub-Saharan Africa
"450" Sub-Saharan Africa
"454" Sub-Saharan Africa
"480" Sub-Saharan Africa
"175" Sub-Saharan Africa
"508" Sub-Saharan Africa
"638" Sub-Saharan Africa
"646" Sub-Saharan Africa
"690" Sub-Saharan Africa
"706" Sub-Saharan Africa
"728" Sub-Saharan Africa
"800" Sub-Saharan Africa
"834" Sub-Saharan Africa
"894" Sub-Saharan Africa
"716" Sub-Saharan Africa
"24" Sub-Saharan Africa
"120" Sub-Saharan Africa
"140" Sub-Saharan Africa
"148" Sub-Saharan Africa
"178" Sub-Saharan Africa
"180" Sub-Saharan Africa
"226" Sub-Saharan Africa
"266" Sub-Saharan Africa
"678" Sub-Saharan Africa
"72" Sub-Saharan Africa
"748" Sub-Saharan Africa
"426" Sub-Saharan Africa
"516" Sub-Saharan Africa
"710" Sub-Saharan Africa
"204" Sub-Saharan Africa
"854" Sub-Saharan Africa
"132" Sub-Saharan Africa
"384" Sub-Saharan Africa
"270" Sub-Saharan Africa
"288" Sub-Saharan Africa
"324" Sub-Saharan Africa
"624" Sub-Saharan Africa
"430" Sub-Saharan Africa
"466" Sub-Saharan Africa
"478" Sub-Saharan Africa
"562" Sub-Saharan Africa
"566" Sub-Saharan Africa
"654" Sub-Saharan Africa
"686" Sub-Saharan Africa
"694" Sub-Saharan Africa
"768" Sub-Saharan Africa
"12" Northern Africa and Western Asia
"818" Northern Africa and Western Asia
"434" Northern Africa and Western Asia
"504" Northern Africa and Western Asia
"729" Northern Africa and Western Asia
"788" Northern Africa and Western Asia
"732" Northern Africa and Western Asia
"51" Northern Africa and Western Asia
"31" Northern Africa and Western Asia
"48" Northern Africa and Western Asia
"196" Northern Africa and Western Asia
"268" Northern Africa and Western Asia
"368" Northern Africa and Western Asia
"376" Northern Africa and Western Asia
"400" Northern Africa and Western Asia
"414" Northern Africa and Western Asia
"422" Northern Africa and Western Asia
"512" Northern Africa and Western Asia
"634" Northern Africa and Western Asia
"682" Northern Africa and Western Asia
"275" Northern Africa and Western Asia
"760" Northern Africa and Western Asia
"792" Northern Africa and Western Asia
"784" Northern Africa and Western Asia
"887" Northern Africa and Western Asia
"398" Central and Southern Asia
"417" Central and Southern Asia
"762" Central and Southern Asia
"795" Central and Southern Asia
"860" Central and Southern Asia
"4" Central and Southern Asia
"50" Central and Southern Asia
"64" Central and Southern Asia
"356" Central and Southern Asia
"364" Central and Southern Asia
"462" Central and Southern Asia
"524" Central and Southern Asia
"586" Central and Southern Asia
"144" Central and Southern Asia
"156" Eastern and South-Eastern Asia
"344" Eastern and South-Eastern Asia
"446" Eastern and South-Eastern Asia
"158" Eastern and South-Eastern Asia
"408" Eastern and South-Eastern Asia
"392" Eastern and South-Eastern Asia
"496" Eastern and South-Eastern Asia
"410" Eastern and South-Eastern Asia
"96" Eastern and South-Eastern Asia
"116" Eastern and South-Eastern Asia
"360" Eastern and South-Eastern Asia
"418" Eastern and South-Eastern Asia
"458" Eastern and South-Eastern Asia
"104" Eastern and South-Eastern Asia
"608" Eastern and South-Eastern Asia
"702" Eastern and South-Eastern Asia
"764" Eastern and South-Eastern Asia
"626" Eastern and South-Eastern Asia
"704" Eastern and South-Eastern Asia
"660" Latin America and the Caribbean
"28" Latin America and the Caribbean
"533" Latin America and the Caribbean
"44" Latin America and the Caribbean
"52" Latin America and the Caribbean
"535" Latin America and the Caribbean
"92" Latin America and the Caribbean
"136" Latin America and the Caribbean
"192" Latin America and the Caribbean
"531" Latin America and the Caribbean
"212" Latin America and the Caribbean
"214" Latin America and the Caribbean
"308" Latin America and the Caribbean
"312" Latin America and the Caribbean
"332" Latin America and the Caribbean
"388" Latin America and the Caribbean
"474" Latin America and the Caribbean
"500" Latin America and the Caribbean
"630" Latin America and the Caribbean
"652" Latin America and the Caribbean
"659" Latin America and the Caribbean
"662" Latin America and the Caribbean
"663" Latin America and the Caribbean
"670" Latin America and the Caribbean
"534" Latin America and the Caribbean
"780" Latin America and the Caribbean
"796" Latin America and the Caribbean
"850" Latin America and the Caribbean
"84" Latin America and the Caribbean
"188" Latin America and the Caribbean
"222" Latin America and the Caribbean
"320" Latin America and the Caribbean
"340" Latin America and the Caribbean
"484" Latin America and the Caribbean
"558" Latin America and the Caribbean
"591" Latin America and the Caribbean
"32" Latin America and the Caribbean
"68" Latin America and the Caribbean
"76" Latin America and the Caribbean
"152" Latin America and the Caribbean
"170" Latin America and the Caribbean
"218" Latin America and the Caribbean
"238" Latin America and the Caribbean
"254" Latin America and the Caribbean
"328" Latin America and the Caribbean
"600" Latin America and the Caribbean
"604" Latin America and the Caribbean
"740" Latin America and the Caribbean
"858" Latin America and the Caribbean
"862" Latin America and the Caribbean
"36" Australia/New Zealand
"554" Australia/New Zealand
"242" Oceania (excluding Australia and New Zealand)
"540" Oceania (excluding Australia and New Zealand)
"598" Oceania (excluding Australia and New Zealand)
"90" Oceania (excluding Australia and New Zealand)
"548" Oceania (excluding Australia and New Zealand)
"316" Oceania (excluding Australia and New Zealand)
"296" Oceania (excluding Australia and New Zealand)
"584" Oceania (excluding Australia and New Zealand)
"583" Oceania (excluding Australia and New Zealand)
"520" Oceania (excluding Australia and New Zealand)
"580" Oceania (excluding Australia and New Zealand)
"585" Oceania (excluding Australia and New Zealand)
"16" Oceania (excluding Australia and New Zealand)
"184" Oceania (excluding Australia and New Zealand)
"258" Oceania (excluding Australia and New Zealand)
"570" Oceania (excluding Australia and New Zealand)
"882" Oceania (excluding Australia and New Zealand)
"772" Oceania (excluding Australia and New Zealand)
"776" Oceania (excluding Australia and New Zealand)
"798" Oceania (excluding Australia and New Zealand)
"876" Oceania (excluding Australia and New Zealand)
"112" Europe and Northern America
"100" Europe and Northern America
"203" Europe and Northern America
"348" Europe and Northern America
"616" Europe and Northern America
"498" Europe and Northern America
"642" Europe and Northern America
"643" Europe and Northern America
"703" Europe and Northern America
"804" Europe and Northern America
"830" Europe and Northern America
"208" Europe and Northern America
"233" Europe and Northern America
"234" Europe and Northern America
"246" Europe and Northern America
"352" Europe and Northern America
"372" Europe and Northern America
"833" Europe and Northern America
"428" Europe and Northern America
"440" Europe and Northern America
"578" Europe and Northern America
"752" Europe and Northern America
"826" Europe and Northern America
"8" Europe and Northern America
"20" Europe and Northern America
"70" Europe and Northern America
"191" Europe and Northern America
"292" Europe and Northern America
"300" Europe and Northern America
"336" Europe and Northern America
"380" Europe and Northern America
"470" Europe and Northern America
"499" Europe and Northern America
"807" Europe and Northern America
"620" Europe and Northern America
"674" Europe and Northern America
"688" Europe and Northern America
"705" Europe and Northern America
"724" Europe and Northern America
"40" Europe and Northern America
"56" Europe and Northern America
"250" Europe and Northern America
"276" Europe and Northern America
"438" Europe and Northern America
"442" Europe and Northern America
"492" Europe and Northern America
"528" Europe and Northern America
"756" Europe and Northern America
"60" Europe and Northern America
"124" Europe and Northern America
"304" Europe and Northern America
"666" Europe and Northern America
"840" Europe and Northern America



2.2 Steps

  • To analyse the data we will need to have one single data frame. We must join locations and population densities on a column.

  • At the moment there is no exact matching column to join on, therefore we will need to manipulate columns.

  • Both data frames contain a “country code” value somewhere. For the Population Density dataframe, the country code will need to be separated from the parent code. There are also prefixes “CC” and “PC” that we will need to consider. For the Location IDs dataframe, the quotation marks will need to be removed.

  • Once the population densities have their respective SDG region in the same table the data can be aggregated. The data will be grouped by SDG region, then the mean will be calculated on the population density value.

We have stated above that the data is valid for all years, meaning that we expect the structure to be consistent. Once the 2019 data is clean, what things should we consider about applying our program to other years?

3 Building Programs

Before getting started on the task of analysing the population density data, it is important that we are aware of different styles of programming.

3.1 Basic Programs

Scripts and notebooks can be really useful tools for quick analysis, however, they limit how we can scale and improve our project.

Our scripts become one line after another of data being slightly changed at each step.

This does not group the code in a structure helpful for us to understand.

This style of programming is sometimes referred to as “imperative”.

Programmers frequently copy and paste code to reuse it in different parts of a program, with small changes.

If the requirements of our project change, we need to hunt through the code to change all the relevant variables and values. If code sections have been copied and pasted, fixing an error in one place won’t fix the copies.

If the project expands we need to write more and more code. This is often done in the same file, making the code harder to work through and understand.

3.2 Grouping Code

illustration of clothing used to group

To structure our code better we need to be able to group a collection of code together into one object. This can be done in two ways:

  • converting to functions
  • converting to classes

Classes are beyond the scope of this course and are less prevalent in R, so we will focus on functions here. However, many of these principles can also be applied to classes.

Properties of functions:

  • functions complete a task
  • functions can take inputs and give outputs

Functions can be run in one line of code, running complicated operations that have been written elsewhere. This helps “hide” some of the detail, making it clearer what is happening in the code - a process known as extraction.

Well-named functions mean we do not need to understand the details inside the function - just what they achieve.

illustration of grouping code into objects


Within this course there is a programming styles document, explaining some of the different styles of programming. This is suggested further reading at this point in the course.

There are some important principles to keep in mind when we design functions:

  • functions should not have “side-effects”. Data outside the function should not be impacted by using the function
  • functions should serve a single purpose

4 Scripts to Functions

In this section we will discuss considerations when converting scripts to functions. In this section we will use an example script to show the steps involved structuring code.

4.1 Example Analysis Code

The code that has been given to you by your colleague is given in this section. At present it is a script that is well commented, but not well structured. Your task is to structure the code allowing for future reproducible analysis.

At a high level, the code:

  • loads in the two data sets
  • cleans the data
  • joins the data so all useful information is together
  • calculates an aggregate statistic
  • tidies the output data
  • writes the data to a CSV file

Have a read through the script you have received, be sure to look up any sections you are not comfortable with.

If you would prefer to look at it within an IDE it is located in

  • example_code_R/initial_script/.

For all scripts and files throughout this course it is assumed that the working directory being used is the location of the file being run. This may need to be changed in your given IDE.

# File to analyse the mean population density data from the UN

# Import relevant libraries for analysis
library(tidyr)
library(dplyr)
library(stringr)
library(readr)

# Load the population density data 2019
population_path <- file.path("../../data/population_density_2019.csv")
pop_density <- readr::read_csv(population_path)

# Clean the column names, following snake_case convention
colnames(pop_density) <- tolower(colnames(pop_density))
colnames(pop_density) <- stringr::str_replace_all(colnames(pop_density), pattern = " ", replacement = "_")


# The country_and_parent_code column needs to 
# be split into two columns without the strings
pop_density <- tidyr::separate(data = pop_density, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), 
                               sep = "_")


# Remove the  parent_code column, not used in later analysis
pop_density <- dplyr::select(pop_density, everything(), -parent_code)

# Convert country_code to integer by removing strings
pop_density$country_code <- stringr::str_remove_all(pop_density$country_code, pattern = "CC")
pop_density$country_code <- as.integer(pop_density$country_code)


# Load the locations data to get the Sustainable Development Goals sub regions
locations_path <- file.path("../../data/locations.csv")
locations <- readr::read_csv(locations_path)

# Clean the column names, following naming conventions similar to PEP8
colnames(locations) <- tolower(colnames(locations))
colnames(locations) <- stringr::str_replace_all(colnames(locations), pattern = " ", replacement = "_")

# The location_id data has quotation marks making it a string,
# it needs to be converted to a numeric
locations$location_id <- stringr::str_remove_all(locations$location_id, pattern = '"')
locations$location_id <- as.integer(locations$location_id)


# Change location_id to be called country_code for join
colnames(locations)[colnames(locations) == "location_id"] <- "country_code"

# Join the data sets
# Left merge so we keep all pop_density data
pop_density_location <- dplyr::left_join(pop_density,
                                         locations,
                                         by = "country_code")


# Get just the relevant columns in preparation
# for the following groupby
region_density <- dplyr::select(pop_density_location, sdg_region_name, population_density)

# Calculate the mean population density for each region
# A non-weighted mean

region_density_grouped <- dplyr::group_by(region_density, sdg_region_name)

region_mean_density <- dplyr::summarise(region_density_grouped,
                                        "mean_population_density" = mean(population_density)
                                        )

# Sort the data for clearer reading, descending order
region_mean_density <- dplyr::arrange(region_mean_density,  -mean_population_density)

# Round mean density for clearer reading
region_mean_density$mean_population_density <- round(region_mean_density$mean_population_density,
                                                     digits = 2)

# Write out the final output
readr::write_csv(x = region_mean_density, file = "mean_population_density_output.csv")

Output data:

sdg_region_name mean_population_density
Eastern and South-Eastern Asia 2112.67
Europe and Northern America 764.93
Central and Southern Asia 330.63
Northern Africa and Western Asia 234.38
Latin America and the Caribbean 199.62
Oceania (excluding Australia and New Zealand) 144.20
Sub-Saharan Africa 126.55
Australia/New Zealand 10.72

4.2 Grouping Code by Functionality

Chunks of code that do similar things should be grouped together.

Deciding which sections of code make sense as being part of the same function is a common challenge when structuring code.

When converting code into a function - the main thing we look for is that it achieves one task. It may take us a few lines of code to achieve this “one task” - but the point is the function has a specific purpose.

If a function has more than one task or “responsibility” it will become hard to maintain, as it has many reasons to be modified.

If a function has a single “responsibility”, it will be focussed and much more likely to be reusable elsewhere.

When writing scripts, we often repeat the same tasks at different points in the script. These are good parts of code to start converting into functions. Doing so reduces the amount of code written in the file - and makes what is happening at any step clearer.

You may also wish to consider writing helper functions for any common housekeeping tasks that you tend to commonly require.

If a code block isn’t repeated throughout the code that’s okay too - all the code can be converted to functions to be called one after the other.

It is much easier to read a sequence of well-named functions, rather than a long stream of commands.

illustration grouping code based on purpose

Some code is often very similar, with a variable or two difference in areas of the code. When reading the code, it’s important to think about what is happening to the variables and data involved. Consider whether a similar process is happening elsewhere, rather than whether the same data is involved. These repeating processes present opportunities to reduce the overall length of your script by writing your own custom functions.


Returning to our example script, we are going to take one task, convert it into a function, then improve the function so it can be used multiple times.

The lines of code:

  • load in a data frame given a path
  • reformat the column names of the data frame

4.2.1 Initial Code

# Load the population density data 2019
pop_density <- readr::read_csv("../../data/population_density_2019.csv")

# Clean the column names, following snake_case convention
colnames(pop_density) <- tolower(colnames(pop_density))
colnames(pop_density) <- stringr::str_replace_all(colnames(pop_density),
                                                  pattern = " ",
                                                  replacement = "_")

4.2.2 Basic Function

We can wrap the code into a function so that all the code can be run with one command like so.

#' Read population data and reformat column names
load_formatted_pop_frame <- function() {
  # Load the population density data 2019
  population_path <- file.path("../../data/population_density_2019.csv")
  pop_density <- readr::read_csv(population_path)

  # Clean the column names, following snake_case convention
  colnames(pop_density) <- tolower(colnames(pop_density))
  colnames(pop_density) <- stringr::str_replace_all(colnames(pop_density), 
                                                    pattern = " ", 
                                                    replacement = "_")
  
  return(pop_density)
}

# Call the function to assign the data frame
population_density <- load_formatted_pop_frame()

4.2.3 Adding Parameters

To improve the function, we can add as an argument something that may change in the future - the path of the data string.

Consider how you would have to change the previous function if the location of the population_density_2019.csv file changed.

Variable names in functions should reflect what that variable is. If you don’t know exactly the value the variable will take, then a generic name like dataframe is appropriate. Though consider the framework that you are working in - avoid reserved words or well-established, commonly used function names.

When we add an argument to a function replacing a value within, we need to be sure to change all times that original variable was used.

Our comments should reflect the changes made too.

Note that comments should add information - the comments in this tutorial are reminders of why we are doing this, and not the style of comment you would be expected to write. Often, if functions and variables are well-named, the code does not require many comments.

The new function can now be used for both the population_density data and locations.csv.

#' Read population data and reformat column names
load_formatted_frame <- function(path_to_data) {
  # Load the population density data 2019
  formatted_path <- file.path(path_to_data)
  dataframe <- readr::read_csv(formatted_path)

  # Clean the column names, following snake_case convention
  colnames(dataframe) <- tolower(colnames(dataframe))
  colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
  
  return(dataframe)
}

# The path can be updated where the function is run if needed
population_density <- load_formatted_frame("../../data/population_density_2019.csv")

# The same function is used to load a formatted locations.csv
locations <- load_formatted_frame("../../data/locations.csv")

4.3 Scope

Scope is an important concept when creating functions and structuring code.

Scope refers to the places in a program that a variable can be accessed.


When writing scripts, variables can be accessed anywhere in the script - so long as the variable assignment has been run.

When we write scripts, we are storing all our variables at the highest, most accessible area of the program. This is referred to as “Global Scope”.

Variables with global scope are accessible in all locations of the program.

This is the easiest way to store variables when learning to program.

However, using global variables throughout our analysis often creates unexpected results in our code. If a new piece of code accidentally alters a global variable, it will affect all the code run after it, even if the function wasn’t meant to update the variable… errors like this can be very tricky to track down and fix.


Some variables can only be accessed in certain locations within a program. When this happens, it is referred to as “Local Scope”.

Variables have local scope if they are accessible within a part of a program such as a function. They cannot be accessed outside the function they are assigned in.

diagram scope across different functions and levels

At the highest level of scope are the parts of the programming language that can be accessed anywhere - the built in functions (e.g. print()).

diagram showing built in functions, global and local scope relationships


To make our functions follow functional programming principles we need to keep variable scope in mind.

When designing functions:

  • all variables within should either:
    • be passed as arguments to the function
    • be created within the function
  • variables with Global Scope should only be given as arguments to the function
    • although all function can access global variables, doing so makes our code harder to understand
  • if we need to access data with local scope (within a function) it needs to be returned by the function

If we are clear about what variables we are accessing, we can be sure about what their values are. Using only variables passed as arguments clarifies what data a function is operating on, and makes it much easier to reuse elsewhere (as it just needs its arguments defined, no hidden dependencies on global variables).

Think of your functions as having an entrance and an exit.

  • the entrance is the arguments and variables it takes as inputs
  • the exit is the value it returns

When choosing which parameters to give a function there are a few things to consider:

  • scope
  • clarity - avoid bundling parameters together into an object, make sure each parameter is clearly named
  • purpose - only include parameters needed for the task

diagram of function model of inputs and outputs

Not all functions need to return a value, such as a function that writes out a file. In this case do not use a return statement, making it clear nothing will be returned. By default if there is no return statement in a function R will return NULL

4.3.1 Function Inputs

Below are examples of code which have similar purposes, one uses parameter variables well, the other does not.

This is bad because we are altering data that has not been passed as arguments to the function.

letters <- c("a", "b", "c", "d", "e")

add_letter <- function() {
    long_letters <- c(letters, "f")
    return(long_letters)
}
    
# Run on original data
print("Initial")
[1] "Initial"
print(add_letter())
[1] "a" "b" "c" "d" "e" "f"
# the value of letters could be changed elsewhere in the program
letters <- c("1", "2", "3", "4", "5")

# Without changing our function call at all we get a different result
# with the same function call
print("Changed")
[1] "Changed"
print(add_letter())
[1] "1" "2" "3" "4" "5" "f"
letters <- c("a", "b", "c", "d", "e")

add_letter <- function(character_vector) {
    long_characters <- c(character_vector, "f")
    return(long_characters)
}
    
# Run on original data
print("Initial")
[1] "Initial"
print(add_letter(letters))
[1] "a" "b" "c" "d" "e" "f"
# the value of letters could be changed elsewhere in the program
letters <- c("1", "2", "3", "4", "5")

# Without changing our function call at all we get a different result
# with the same function call
print("Changed")
[1] "Changed"

4.3.2 Data Frame Considerations

As analysts and data scientists, we will often use data frames in our programs.

There are some special considerations that need to be taken when working with these objects, with regards to functional programming principles.

R has a number of properties that make it an effective langauge for writing functions.

There are some considerations that need to be taken into account when passing parameters to functions in R, especially when using the tidyverse family of functions.

4.3.2.1 From Strings to Variables

When we convert a script to a function we often need to put variable names in our code instead of strings / names.

For example, if we want to write a function that takes a data frame and column name as a parameter.

These variables need to be handled slightly differently than the strings they once were.

For example, when accessing a column name we can specify the name using $, or we can use square brackets [], which also works with columns.

Assuming we have some data frame survey with a column "people".

# To access column "people" from the dataframe "survey"
survey$people

# To access the column name stored as a variable, "people" from "survey"
column_name <- "people"

survey[column_name]

To access the specific column needed you can use single square brackets survey[column_name].

To access the specific vector of values in a column you use double square brackets survey[[column_name]].

4.3.2.2 Lazy Evaluation

By default in R when variables are given to functions their values will not be evaluated.

This means if we pass column names into the function as variables they will not be recognised as names of columns themselves.

When using tidyverse functions this issue can be avoided in two ways.

# This code will not run

column_name <- "people"

grouped_survey <- dplyr::group_by(.data = survey, column_name)

We can either use standard evaluation function versions, these are given with an underscore at the end of a function name.

# This code will run

column_name <- "people"

grouped_survey <- dplyr::group_by_(.data = survey, column_name)

Alternatively, we can use the dplyr function get() to evaluate a variable we give to it.

# This code will run

column_name <- "people"

grouped_survey <- group_by(.data = survey, get(col_name))

In functions we pass variables to other functions frequently, so it is important to be able to access those variables.

4.4 Exercises

Using the code snippets from the example analysis below, write a function that:

  • takes an input population density data frame
  • splits the country_and_parent_code column into parent_code and country_code columns
  • drops the country_and_parent_code and parent_code columns
  • returns the new data frame

Add this function into the file example_code_python/function_input/exercise1.py or example_code_R/function_input/exercise1.R depending on your chosen framework. Use the code already there to test your result on pop_density.

Name the function access_country_code().

# The country_and_parent_code column needs to 
# be split into two columns without the strings
pop_density <- tidyr::separate(data = pop_density, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")

# Remove the  parent_code column, not used in later analysis
pop_density <- dplyr::select(pop_density, everything(), -parent_code)
library(tidyr)
library(dplyr)
library(stringr)
library(readr)

## Code to be improved to complete exercise 1


#' Read population data and reformat column names
load_formatted_frame <- function(path_to_data) {
  # Load the population density data 2019
  formatted_path <- file.path(path_to_data)
  dataframe <- readr::read_csv(formatted_path)
  
  # Clean the column names, following snake_case convention
  colnames(dataframe) <- tolower(colnames(dataframe))
  colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
  
  return(dataframe)
}


#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), 
                               sep = "_")
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


# Loading both data frames
population_density <- load_formatted_frame("../../data/population_density_2019.csv")
locations <- load_formatted_frame("../../data/locations.csv")

# Run the code created checking output
pop_density_single_code <- access_country_code(population_density)
print(pop_density_single_code$country_code)

Using the code snippets from our example analysis below, write a function that:

  • takes a data frame as an input
  • can replace a string within a specified column
  • can convert the type of a given column

This function will be used across both data frames later - so be sure it is general enough to work for both. In addition, it must use only data it gets as arguments.

Add this function into the file example_code/function_input/exercise2.py|R. Use the code already there to test your result on locations and pop_density.

Name the function convert_type_to_int().

# Convert country_code to integer by removing extra strings
pop_density$country_code <- stringr::str_remove(pop_density$country_code, pattern = "CC")

# Convert type
pop_density$country_code <- as.integer(pop_density$country_code)
library(tidyr)
library(dplyr)
library(stringr)
library(readr)

## Code to be improved to complete exercise 2


#' Read population data and reformat column names
load_formatted_frame <- function(path_to_data) {
  # Load the population density data 2019
  formatted_path <- file.path(path_to_data)
  dataframe <- readr::read_csv(formatted_path)
  
  # Clean the column names, following snake_case convention
  colnames(dataframe) <- tolower(colnames(dataframe))
  colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
  
  return(dataframe)
}



#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[[column_name]] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[[column_name]] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}




pop_density <- load_formatted_frame("../../data/population_density_2019.csv")
locations <- load_formatted_frame("../../data/locations.csv")

pop_density_single_code <- access_country_code(pop_density)

# Using the conversion function created
population_density_correct_types <- convert_type_to_int(pop_density_single_code,
                                                        column_name = "country_code",
                                                        string_value = "CC")

locations_correct_types <- convert_type_to_int(locations,
                                               column_name = "location_id",
                                               string_value = '"')

print(str(population_density_correct_types))
print(str(locations_correct_types))

Using the code snippets from our example analysis below, write a function that:

  • takes two data frames as inputs
  • takes two string inputs
  • performs a left join on a column from each data frame, the columns are given by the strings input
  • removes the second specified string column from the joined data frame
  • returns a single data frame

This function will be used after the previous functions using the data frames outputted.

Add this function into the file example_code_R/function_input/exercise3.r. Use the code already there to test your result on the new data frame.

This function will be useful for our specific case, but also if we want to join other data frames or use different column names.

Our column names could change if we change an upstream function, so it’s important we give them as inputs.

Name the function join_frames().

# Change location_id to be called country_code for join
locations <- dplyr::rename(locations, country_code = location_id)

# Join the data sets
# Left merge so we keep all pop_density data
pop_density_location <- dplyr::left_join(pop_density,
                                         locations,
                                         by = "country_code")


# Get just the relevant columns in preparation
# for the following groupby
region_density <- dplyr::select(pop_density_location, sdg_region_name, population_density)
library(tidyr)
library(dplyr)
library(stringr)
library(readr)

## Code to be improved to complete exercise 3


#' Read population data and reformat column names
load_formatted_frame <- function(path_to_data) {
  # Load the population density data 2019
  formatted_path <- file.path(path_to_data)
  dataframe <- readr::read_csv(formatted_path)
  
  # Clean the column names, following snake_case convention
  colnames(dataframe) <- tolower(colnames(dataframe))
  colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
  
  return(dataframe)
}


#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[[column_name]] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[[column_name]] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)

  combined_frames_reduced <- dplyr::select(combined_frames, sdg_region_name, population_density)

  return(combined_frames_reduced)
}


## Run the functions created

pop_density <- load_formatted_frame("../../data/population_density_2019.csv")
locations <- load_formatted_frame("../../data/locations.csv")


pop_density_single_code <- access_country_code(pop_density)


pop_density_correct_types <- convert_type_to_int(dataframe=pop_density_single_code,
                                                column_name="country_code",
                                                string_value="CC")

locations_correct_types <- convert_type_to_int(dataframe=locations,
                                              column_name="location_id",
                                              string_value='"')

population_location <- join_frames(pop_density_correct_types,
                                   locations_correct_types,
                                   left_column = "country_code",
                                   right_column = "location_id")

print(colnames(population_location))
print(head(population_location, 10))

4.5 High Level Functions

This section will introduce some concepts and good practice that are relevant for when you have converted your script into functions.

In the section below, a version of code with all tasks broken into functions is shown. To help consolidate your learning from the previous exercises, an extension exercise is to convert the remaining code to functions yourself.

4.5.1 Final Script Conversion

4.5.1.1 Extension Exercise

Using exercise3_answers.R convert the remaining script code into functions. The functions should be called:

  • aggregate_statistic()
  • format_frame()
  • write_output()

Each of these functions perform one task. They are general enough that they work for our specific situation but leave some room for if we wanted to make minor adjustments upstream, such as column or filenames.


Side Note: We are writing the function write_output() as practice, it only contains one single line of code so in practice it wouldn’t be used as a function. It’s important to avoid writing functions that are too small.

library(tidyr)
library(dplyr)
library(stringr)
library(readr)



#' Read population data and reformat column names
load_formatted_frame <- function(path_to_data) {
  # Load the population density data 2019
  formatted_path <- file.path(path_to_data)
  dataframe <- readr::read_csv(formatted_path)
  
  # Clean the column names, following snake_case convention
  colnames(dataframe) <- tolower(colnames(dataframe))
  colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
  
  return(dataframe)
}


#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[column_name] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[column_name] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' Join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)
  
  combined_frames_reduced <- dplyr::select(combined_frames, sdg_region_name, population_density)
  
  return(combined_frames_reduced)
}


#' Function to groupby and calculate the mean of two columns
aggregate_mean <- function(dataframe, groupby_column, statistic_column) {

  # Perform aggregation and summary
  
  # Use group_by_ because of variable column name
  region_mean_density_grouped <- group_by_(.data = dataframe, groupby_column)
  
  # use get() to access column name
  region_mean_density <- dplyr::summarise(.data = region_mean_density_grouped,
                                          "mean_population_density" = mean(get(statistic_column)))

  return(region_mean_density)
}


#' Format the dataframe for output
format_frame <- function(dataframe, statistic_column) {

  # Sort the data for clearer reading, descending order
  dataframe_sorted <- dplyr::arrange(.data = dataframe)
  
  # Round mean density for clearer reading
  dataframe_sorted[statistic_column] <- round(dataframe_sorted[statistic_column],
                                             digits = 2)

  return(dataframe_sorted)
}

#' write output statistic in formatted manner
write_output <- function(dataframe, output_filepath) {

  readr::write_csv(x = dataframe, path = output_filepath)

}



## Run the functions created

pop_density <- load_formatted_frame("../../data/population_density_2019.csv")
locations <- load_formatted_frame("../../data/locations.csv")


pop_density_single_code <- access_country_code(pop_density)


pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                 column_name = "country_code",
                                                 string_value = "CC")

locations_correct_types <- convert_type_to_int(dataframe = locations,
                                               column_name = "location_id",
                                               string_value = '"')

population_location <- join_frames(pop_density_correct_types,
                                   locations_correct_types,
                                   left_column = "country_code",
                                   right_column = "location_id")

aggreagation <- aggregate_mean(dataframe = population_location,
                              groupby_column = "sdg_region_name",
                              statistic_column = "population_density")

formatted_statistic <- format_frame(aggreagation, "mean_population_density")

write_output(formatted_statistic, "./mean_pop_density.csv")
    

4.5.2 Execute Program

Now we have converted all our code tasks into functions we can run each function, passing their output into the input of the next function.

Looking at the code at the end of our script there are a group of lines which describe the running of the program. These lines of code describe the whole analysis, showing each step in the process with a function corresponding to each step.

When we hit “Run” on our code, the code shown is run. The functions above it in the file are loaded into the program’s global scope, allowing them to be called by this code.

## Run the functions created

pop_density <- load_formatted_frame("../../data/population_density_2019.csv")
locations <- load_formatted_frame("../../data/locations.csv")


pop_density_single_code <- access_country_code(pop_density)


pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                 column_name = "country_code",
                                                 string_value = "CC")

locations_correct_types <- convert_type_to_int(dataframe = locations,
                                               column_name = "location_id",
                                               string_value = '"')

population_location <- join_frames(pop_density_correct_types,
                                   locations_correct_types,
                                   left_column = "country_code",
                                   right_column = "location_id")

aggreagation <- aggregate_mean(dataframe = population_location,
                              groupby_column = "sdg_region_name",
                              statistic_column = "population_density")

formatted_statistic <- format_frame(aggreagation, "mean_population_density")

write_output(formatted_statistic, "./mean_pop_density.csv")

4.5.3 Main Function

The code above makes what we are doing much easier to understand. To find out what the code is doing at each step, we can just read the name of the function, or look up what it does in the documentation.

The way the code is currently designed, however, still uses variables in the global scope, something to generally avoid.

If we add one more function, that calls our other functions, we can run our whole program by calling this one function. This will make it much easier to run the analysis later down the line, and to extend our code into modules and packages.

Functions that run other functions are called “high level” functions. Using high level functions lets us build more structure to our code.

Often the convention you will see for naming a highest level function in code is calling it main(), however it does not have to be this name. We will call our highest level analysis get_analyse_output().

In effect, we put all the code that was used to “run” the program within the get_analyse_output() function. This way we can run the program only when we call get_analyse_output().

This is the point where typical convention between Python and R starts to differ. Be sure to check both methods if you regularly code in both.

How many levels of “high level” functions we have should be proprortionate to our code. For a small task we probably don’t need high level functions. For a larger pipeline they become significantly more important.

If we want to alter the behaviour of the run_analysis() function we have two options:

  • alter the main function to change variables passed to the function
  • add parameters

Below is our get_analyse_output() function, and the code used to run it.

#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function() {
  
  pop_density <- load_formatted_frame("../../data/population_density_2019.csv")
  locations <- load_formatted_frame("../../data/locations.csv")
  
  
  pop_density_single_code <- access_country_code(pop_density)
  
  
  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                   column_name = "country_code",
                                                   string_value = "CC")
  
  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                                 column_name = "location_id",
                                                 string_value = '"')
  
  population_location <- join_frames(pop_density_correct_types,
                                     locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")
  
  aggreagation <- aggregate_mean(dataframe = population_location,
                                 groupby_column = "sdg_region_name",
                                 statistic_column = "population_density")
  
  formatted_statistic <- format_frame(aggreagation, "mean_population_density")
  
  write_output(formatted_statistic, "./mean_pop_density.csv")
  
}

get_analyse_output()

If we were to use this analysis on different data sets, it may be useful for us to be able to change the data inputs and outputs.

#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function(population_filepath, location_filepath, output_filepath) {
  
  pop_density <- load_formatted_frame(population_filepath)
  locations <- load_formatted_frame(location_filepath)
  
  
  pop_density_single_code <- access_country_code(pop_density)
  
  
  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                   column_name = "country_code",
                                                   string_value = "CC")
  
  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                                 column_name = "location_id",
                                                 string_value = '"')
  
  population_location <- join_frames(pop_density_correct_types,
                                     locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")
  
  aggreagation <- aggregate_mean(dataframe = population_location,
                                 groupby_column = "sdg_region_name",
                                 statistic_column = "population_density")
  
  formatted_statistic <- format_frame(aggreagation, "mean_population_density")
  
  write_output(formatted_statistic, output_filepath)
  
}


## Run the main function created

get_analyse_output(population_filepath = "../../../data/population_density_2019.csv",
                   location_filepath = "../../../data/locations.csv",
                   output_filepath = "./mean_pop_density.csv")

4.6 Hierarchies

We have now introduced a higher level function that runs other functions for us.

This is a great step forward in structuring our code. If we want to understand what the program does:

  • we first look at this high level get_analyse_output() function
  • each function within the higher function describes a step of the process, a task
  • for more information on how each task is completed, the function can be found in the script

By having some functions that call others we now have levels and dependencies of functions.

Well documented high-level functions mean we do not need to dive into the lower level functions to understand what the code does.

These relationships between functions can be described with hierarchical diagrams. Writing down the relationship between tasks in your code is an extremely useful practice in structuring code.

Below is what the code in main_func.py|R looks like as a hierarchy of functions.

diagram of relationships of functions in main_funcs.py


As you can see, a lot of steps are being run by the single get_analyse_output() function. It is really important we have this high level function, but we can have more if it makes the structure of our program clearer.


Below we will first look at a new code diagram with a different structure to the previous, then the code it corresponds to.

This is slight overkill for our program at the moment due to it’s small size, but the principle is very useful as our code becomes more complex.

The new structure:

  • still has a highest level get_analyse_output() function
  • contains multiple functions in between the lowest level and the highest
  • has middle functions which perform a larger task, grouping smaller tasks together

diagram of relationships of functions in multi-level function hierarchy  in main_funcs_middle.py

Note that we have not added an additional higher level function above write_output(). This is because we don’t need to have a higher function calling just one lower level function. In addition, we do not always want to write out data out while we test the analysis pipeline.

The benefits of this structure is that we can more easily access the data produced by our pipeline at relevant steps.

  • if we want to look at the joined data after cleaning and manipulation we just call the extract_transform() function
  • to perform a different analysis on the cleaned frame we can write a different analyse() function and call that instead within get_analyse_output()

4.6.1 New Structure

library(tidyr)
library(dplyr)
library(stringr)
library(readr)


#' Read population data and reformat column names
load_formatted_frame <- function(path_to_data) {
  # Load the population density data 2019
  formatted_path <- file.path(path_to_data)
  dataframe <- readr::read_csv(formatted_path)
  
  # Clean the column names, following snake_case convention
  colnames(dataframe) <- tolower(colnames(dataframe))
  colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
  
  return(dataframe)
}


#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[column_name] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[column_name] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' Join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)
  
  combined_frames_reduced <- dplyr::select(combined_frames, sdg_region_name, population_density)
  
  return(combined_frames_reduced)
}


#' Function to groupby and calculate the mean of two columns
aggregate_mean <- function(dataframe, groupby_column, statistic_column) {
  
  # Perform aggregation and summary
  
  # Use group_by_ because of variable column name
  region_mean_density_grouped <- group_by_(.data = dataframe, groupby_column)
  
  # use get() to access column name
  region_mean_density <- dplyr::summarise(.data = region_mean_density_grouped,
                                          "mean_population_density" = mean(get(statistic_column)))
  
  return(region_mean_density)
}


#' Format the dataframe for output
format_frame <- function(dataframe, statistic_column) {
  
  # Sort the data for clearer reading, descending order
  dataframe_sorted <- dplyr::arrange(.data = dataframe)
  
  # Round mean density for clearer reading
  dataframe_sorted[statistic_column] <- round(dataframe_sorted[statistic_column],
                                              digits = 2)
  
  return(dataframe_sorted)
}

#' write output statistic in formatted manner
write_output <- function(dataframe, output_filepath) {
  
  readr::write_csv(x = dataframe, path = output_filepath)
  
}


#' Load the data and convert it to clean joined format for analysis
extract_transform <- function(population_filepath, location_filepath) {
  
  pop_density <- load_formatted_frame(population_filepath)
  locations <- load_formatted_frame(location_filepath)


  pop_density_single_code <- access_country_code(pop_density)


  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                column_name = "country_code",
                                                string_value = "CC")

  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                              column_name = "location_id",
                                              string_value = '"')

  population_location <- join_frames(left_dataframe = pop_density_correct_types,
                                     right_dataframe = locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")

  return(population_location)
}

#' Perform groupby mean of population density and reformat result
analyse <- function(full_dataframe, groupby_column, aggregate_column, statistic_column) {

  aggreagation = aggregate_mean(dataframe = full_dataframe,
                                groupby_column = groupby_column,
                                statistic_column = aggregate_column)

  formatted_statistic = format_frame(aggreagation, statistic_column = statistic_column)


  return(formatted_statistic)
}


#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function(population_filepath, location_filepath, output_filepath) {

  population_location = extract_transform(population_filepath = population_filepath,
                                        location_filepath = location_filepath)

  formatted_statistic = analyse(full_dataframe = population_location,
                                groupby_column = "sdg_region_name",
                                aggregate_column = "population_density",
                                statistic_column = "mean_population_density")

  write_output(formatted_statistic, output_filepath)

}


    
    

get_analyse_output(population_filepath="../../data/population_density_2019.csv",
                   location_filepath="../../data/locations.csv",
                   output_filepath="./mean_pop_density.csv")

4.7 Interaction

“Who will need to access this part of the program?” is a useful question to think about when structuring your code.

As the main developer you will likely be accessing the whole code base, every function.

To run the program a user only needs to interact with a small part of the program. The part of the program a user will be interacting with is called the “application programming interface”, API. Other areas of the code can be seen, but rarely used by the user.

Parts of your code can be “hidden” from the user. The end user does not need to understand every line of code or function. The user only needs to run the program.

By structuring the code properly, it is possible to “hide” the private parts from users - they do not need to understand or access the inner workings of every function - they just need to run the program.

In our code the API part would be the get_analyse_output() function.

Separating public facing and lower level functions improves clarity and usability. All code should be as clear as possible whether it is the API or lower to help with future development.

Having this distinction allows us to test the code at the correct levels.

diagram of which areas of the code are accessible to the public user and developer

By having a hierarchy of functions with distinctions about what the API is can make the code simpler. Structuring the code well makes it easier to run, test and fix for developers and users.

This concept becomes more important in:

  • software products with non-technical users
  • object-oriented programming, using public interfaces

Ideally, a user does not need to open any code files to run analysis. Instead, the user can work with a graphical user interface (GUI) or command line interface. Parameters such as the input data file paths and output paths are written in a separate file or by the user in the interface.

“What will the end product of my analysis pipeline be?” is an important question to consider when structuring your project.

5 Functions to Modules

Earlier in this course a scenario was introduced explaining that a single script can grow large and become difficult to maintain.

Although adding structure with functions makes our code better, it can make it longer. Larger code files are difficult to maintain and understand.

We can make our code even clearer and better-structured by moving the functions in our code into different files. By grouping related functions together into different files it will be easier to look up different parts of our code. We no longer need to scroll through thousands of lines of code, we just navigate to the relevant file.

When we move functions (or other objects) into different files, they then need to be imported back into the file we are using those functions in.

When we move code into different files the code in files are “sourced” into the R environment.

Before structuring code in different files, we need to discuss how to structure our directory properly to help us with this.

5.1 Project Structure

Now we are moving beyond working with just one script we need to consider our project, files, folders/directories and paths.

A key part of building a reproducible collection of code is making the project folder simple to understand, navigate and work with.

There is no single folder structure that is perfect for all analysis, however, there are good minimum requirements and guiding principles.

The situation to avoid is having all your data, course code, notebooks and documentation in the same location. This is confusing to anyone else looking in, and makes it harder for your project to be extended.

In this section we will outline basic components of project structure, their relevance to this course, and point to good resources for deciding your own approach.


5.1.1 Guiding Principles

The main principles are

  • the complexity of the folder structure should be in line with the size/complexity of the project
    • smaller analysis should a have simpler folder structure
    • larger projects require more depth of structure (more sections, more folders dividing areas)
  • different file types should generally be separated, for example keep the R files together, the CSVs together, the R Markdowns in one place
  • what the end product of the project is should impact the structure. If the code is to become a package, an appropriate structure should be used.

5.1.2 Minimum requirements

A directory structure for analysis should separate the:

  • data used to analyse
  • course code to perform analysis
  • report generation / notebook files, figures and images
  • documentation
  • READMEs, licenses, package requirements

In addition, relevant version control folders/files will be present (not covered in this course) - .git folder, .gitignore file.

How this is done may depend on your team, language, and specific use case.

An example folder structure for our project is shown below, this is a minimum and could be extended.

population_density_analysis
|   LICENSE.txt
|   README.md
|   requirements.txt
|   
+---data
|   +---processed
|   |       mean_pop_density.csv
|   |       
|   \---raw
|           locations.csv
|           population_density_2019.csv
|           
+---docs
|       documentation.txt
|       user_guide.html
|       
+---reports
|   |   population_analysis_report.html
|   |   population_analysis_report.rmd
|   |   
|   \---figures
|           graph.png
|           
\---src
        main_func (to be broken up).py

Note: /src/ stands for “source” - referring to your source code, the files your program is written with.

5.1.3 Additional content

There are other folders and considerations to structure your project beyond the minimum.

You may want to have separate folders for:

  • different parts of your code within the /src/ folder
  • references such as data dictionaries and user manuals
  • Further divisions of your /data/ folder
  • models or other output products produced
  • notebooks for enhanced documentations and examples
  • A way to produce example data
  • environment building

In R, consider using a predefined project structure using the .Rproj method, which generates a structure for you:

There is a project structure designed by the Government Digital Service for data science projects.

5.2 Using Separate Files

Now that we are aware of good project folder structure, we can discuss separating our big full code file into more logical smaller files.

This section will focus on the code contained within the /src/ folder shown in the last section.

Group functions with similar purpose together, such as data cleaning, loading, modelling. Make each file/module as focussed as possible to make it easy to find any required function.

To move the functions between files there are four main steps that need to be taken:

  • move the code between files (copy and paste)
  • check the new file can access all the code it needs
  • import the new file / function into the relevant files in the code base
  • check this has not affected how our code runs (test it hasn’t changed)

Moving code between files when a script already exists is a task that can be avoided by designing your project files in a useful way when starting to write your code. Any new analysis should make use of existing modules that you have created.

Note: in some other people’s code, particularly R code, you may see many files with only one function in each file. This is to be avoided as it does not help group code making it easier to work with. Generally, avoid files containing all the functions of a program, and avoid having many files each containing one function. For further information about this, with reference to R code package conventions have a look at the “R Packages” book.

5.2.1 Moving Code

In this section we will learn how to move functions between files.

In the earlier part of the course “Function Inputs” we discussed why it is important that variables are only accessed through function inputs and outputs. This principle is even more important when moving code between files.

We are first going to make a new file called input_output.R. This file is going to contain all the code we need for loading and exporting our data frames. It is good practice to group related functions into the same file - especially around data access.

In addition, we are going to rename our original script to main.R. This is the file that will run all our code.

Within the input_output.R file we are going to put the following functions, removing them from main.R:

  • load_formatted_frame()
  • write_output()

Our files will now appear as below. Note, they will not currently run.

File contains most of the code used to run the program.

library(tidyr)
library(dplyr)
library(stringr)
library(readr)

#' Split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[column_name] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[column_name] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' Join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)
  
  combined_frames_reduced <- dplyr::select(combined_frames, sdg_region_name, population_density)
  
  return(combined_frames_reduced)
}


#' Function to groupby and calculate the mean of two columns
aggregate_mean <- function(dataframe, groupby_column, statistic_column) {
  
  # Perform aggregation and summary
  
  # Use group_by_ because of variable column name
  region_mean_density_grouped <- group_by_(.data = dataframe, groupby_column)
  
  # use get() to access column name
  region_mean_density <- dplyr::summarise(.data = region_mean_density_grouped,
                                          "mean_population_density" = mean(get(statistic_column)))
  
  return(region_mean_density)
}


#' Format the dataframe for output
format_frame <- function(dataframe, statistic_column) {
  
  # Sort the data for clearer reading, descending order
  dataframe_sorted <- dplyr::arrange(.data = dataframe)
  
  # Round mean density for clearer reading
  dataframe_sorted[statistic_column] <- round(dataframe_sorted[statistic_column],
                                              digits = 2)
  
  return(dataframe_sorted)
}


#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function() {
  
  pop_density <- load_formatted_frame("../../../data/population_density_2019.csv")
  locations <- load_formatted_frame("../../../data/locations.csv")
  
  
  pop_density_single_code <- access_country_code(pop_density)
  
  
  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                   column_name = "country_code",
                                                   string_value = "CC")
  
  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                                 column_name = "location_id",
                                                 string_value = '"')
  
  population_location <- join_frames(pop_density_correct_types,
                                     locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")
  
  aggreagation <- aggregate_mean(dataframe = population_location,
                                 groupby_column = "sdg_region_name",
                                 statistic_column = "population_density")
  
  formatted_statistic <- format_frame(aggreagation, "mean_population_density")
  
  write_output(formatted_statistic, "./mean_pop_density.csv")
  
}


## Run the main function created

get_analyse_output()

File contains the functions used for input and output operations.

library(readr)

#' Read population data and reformat column names
load_formatted_frame <- function(path_to_data) {
    # Load the population density data 2019
    formatted_path <- file.path(path_to_data)
    dataframe <- readr::read_csv(formatted_path)
    
    # Clean the column names, following snake_case convention
    colnames(dataframe) <- tolower(colnames(dataframe))
    colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
    
    return(dataframe)
}

    

#' write output statistic in formatted manner
write_output <- function(dataframe, output_filepath) {
    
    readr::write_csv(x = dataframe, path = output_filepath)
    
}

5.2.2 Loading Code Between Files

The code shown above will not run because the main.R code cannot access the functions contained within input_output.R.

For a program to access code in another location the functions need to be loaded into that program explicitly. In R this is called sourcing function(s)

We load the code from one file into the another allowing our code to access the contents of the loaded file.

diagram of which areas of the code are accessible to the public user and developer

Loading a file puts the objects within into the scope of our program.

If we load a file’s code in the global scope of our program, then the file’s contents will be accessible anywhere in the program. If we load the file in a specific local scope it will only be accessible in that local area.

diagram of original scope

In R there is a distinction between loading in code from source files and loading in packages.

To load functions into an R program you need to first put those functions into a .R file.

Within the file containing the program you want to run, you “source” the file containing the functions. This loads the functions into the program’s global scope.

To source a file use the source() function.

Within the source() function give the file path of the .R file you want to source.

source("path to R file")

For our purposes at the top of the main.R file we would write:

source("./input_output.R")

to access the functions within input_output.R.

By convention files are loaded at the top of a file. This makes it clear what files are used in the code and ensures all parts of the code that need the objects in the file can access them.

The sourcing of a file will by default load in all the contents of that file - not just functions. For this reason, it is important to keep the files containing your functions clear of other unnecessary objects such as unnecessary variables.

The main.R file will then look like the below script, allowing us to access the functions from input_output.R.

library(tidyr)
library(dplyr)
library(stringr)
library(readr)

source("input_output.r")

#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[column_name] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[column_name] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' Join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)
  
  combined_frames_reduced <- dplyr::select_(combined_frames, left_column)
  
  return(combined_frames_reduced)
}


#' Function to groupby and calculate the mean of two columns
aggregate_mean <- function(dataframe, groupby_column, statistic_column) {
  
  # Perform aggregation and summary
  
  # Use group_by_ because of variable column name
  region_mean_density_grouped <- group_by_(.data = dataframe, groupby_column)
  
  # use get() to access column name
  region_mean_density <- dplyr::summarise(.data = region_mean_density_grouped,
                                          "mean_population_density" = mean(get(statistic_column)))
  
  return(region_mean_density)
}


#' Format the dataframe for output
format_frame <- function(dataframe, statistic_column) {
  
  # Sort the data for clearer reading, descending order
  dataframe_sorted <- dplyr::arrange(.data = dataframe)
  
  # Round mean density for clearer reading
  dataframe_sorted[statistic_column] <- round(dataframe_sorted[statistic_column],
                                              digits = 2)
  
  return(dataframe_sorted)
}


#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function() {
  
  pop_density <- load_formatted_frame("../../../data/population_density_2019.csv")
  locations <- load_formatted_frame("../../../data/locations.csv")
  
  
  pop_density_single_code <- access_country_code(pop_density)
  
  
  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                   column_name = "country_code",
                                                   string_value = "CC")
  
  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                                 column_name = "location_id",
                                                 string_value = '"')
  
  population_location <- join_frames(pop_density_correct_types,
                                     locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")
  
  aggreagation <- aggregate_mean(dataframe = population_location,
                                 groupby_column = "sdg_region_name",
                                 statistic_column = "population_density")
  
  formatted_statistic <- format_frame(aggreagation, "mean_population_density")
  
  write_output(formatted_statistic, "./mean_pop_density.csv")
  
}


## Run the main function created

get_analyse_output()

5.3 Exercises

These exercises will help you practice splitting code into different files and loading them back into the main.R script.

Create a new file in the example_code/modules/exercises/start/ folder called analysis.R.

Put the following functions within the new file:

  • aggregate_mean()
  • format_frame()

Change the code in main.R such that the file loads the relevant functions and runs the whole analysis.


Create a new file in the example_code/modules/exercises/start/ folder called manipulation.R.

Put the following functions within the new file:

  • convert_type_to_int()
  • access_country_code()
  • join_frames()

Change the code in main.R such that the file loads the relevant functions and runs the whole analysis.

5.3.1 Answers

#' Function to groupby and calculate the mean of two columns
aggregate_mean <- function(dataframe, groupby_column, statistic_column) {
  
  # Perform aggregation and summary
  
  # Use group_by_ because of variable column name
  region_mean_density_grouped <- group_by_(.data = dataframe, groupby_column)
  
  # use get() to access column name
  region_mean_density <- dplyr::summarise(.data = region_mean_density_grouped,
                                          "mean_population_density" = mean(get(statistic_column)))
  
  return(region_mean_density)
}


#' Format the dataframe for output
format_frame <- function(dataframe, statistic_column) {
  
  # Sort the data for clearer reading, descending order
  dataframe_sorted <- dplyr::arrange(.data = dataframe)
  
  # Round mean density for clearer reading
  dataframe_sorted[statistic_column] <- round(dataframe_sorted[statistic_column],
                                              digits = 2)
  
  return(dataframe_sorted)
}
#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[column_name] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[column_name] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' Join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)
  
  combined_frames_reduced <- dplyr::select(combined_frames, sdg_region_name, population_density)
  
  return(combined_frames_reduced)
}
library(tidyr)
library(dplyr)
library(stringr)
library(readr)

source("input_output.r")
source("analysis.r")
source("manipulation.r")




#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function() {
  
  pop_density <- load_formatted_frame("../../../../data/population_density_2019.csv")
  locations <- load_formatted_frame("../../../../data/locations.csv")
  
  
  pop_density_single_code <- access_country_code(pop_density)
  
  
  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                   column_name = "country_code",
                                                   string_value = "CC")
  
  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                                 column_name = "location_id",
                                                 string_value = '"')
  
  population_location <- join_frames(pop_density_correct_types,
                                     locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")
  
  aggreagation <- aggregate_mean(dataframe = population_location,
                                 groupby_column = "sdg_region_name",
                                 statistic_column = "population_density")
  
  formatted_statistic <- format_frame(aggreagation, "mean_population_density")
  
  write_output(formatted_statistic, "./mean_pop_density.csv")
  
}


## Run the main function created

get_analyse_output()

6 Case Study

Our analysis pipeline for average population density is near complete now.

So far, we have been working on one single data set. Now however we have been given access to the population density values for a range of years in different CSV files.

You now need to add to and improve the code and files to meet new requirements. The code to change can be found in example_code/case_study/initial/.

This is not the exact methodology for calculating the mean population density in a region - however by completing this case study you will gain experience building a pipeline for analysis.

6.1 Tasks

The below tasks are intended to reinforce the content covered in this course.

Tasks 1-4 are similar to exercises already covered.

Tasks 5 and 6 will require more thought and working with data frames and your chosen languages. You may need to do some research in your chosen language/package to complete them.

Answers to all of the tasks combined are given below, and the full code is contained within example_code/case_study/answers/.

6.1.1 Task 1

In order to be able to perform the analysis across different years, we need to access different files:

  • add new parameters to the function get_analyse_output() in main.R
  • the first parameter should be the file path to the population density data CSV: pop_density_filepath
  • the second parameter should be the file path to the locations.csv data: location_filepath

Test that your refactoring of the code produces the same result by running the program with the original data sets.

Test that your refactoring of the code works for other data sets by using the 2018 data.

6.1.2 Task 2

As we build analysis on top of the code we have already written, the get_analyse_output() will not be the highest level function:

  • change the get_analyse_output() function to return the calculated final data frame formatted_statistic.
  • change the function so it takes a parameter output_filepath of the output path of the reslting data frame.
    • if this value given is False/FALSE do not write out the data frame to a file

Test this new parameter by running it with a file path and with False/FALSE.

6.1.3 Task 3

We will be joining the data at the end of our process, this means we will be performing multiple joins in the analysis:

  • create a new file in the directory called joins.R
  • move the function join_frames() from manipulation.R to join.R

Ensure your code still works at this stage.

6.1.4 Task 4

If we are going to combine all our data, we will need to change how it is represented:

  • write a function that changes a column name of a given column to a different string. Call it column_name_year() and put it in manipulation.R
    • the first parameter is the data frame to be changed dataframe
    • the second is the column name to be changed original_column
    • the third is the new column name new_column

Test this function on a data frame.

6.1.5 Task 5

Once we have created a new data frame for each year we will need to combine the data for each region into one data frame:

  • write a new function join_years() in joins.R
    • this function will perform a join on all the frames given to it on the same column name
    • the function takes as a parameter dataframes, a list containing the data frames to be joined
    • the function takes as a parameter the name of the column all frames are to be joined on join_column
    • the function performs an inner join on all the data frames together on the given column and returns this complete data frame of multiple years

6.1.6 Task 6

We need to bring all our new functions together to perform the final analysis:

  • write a new function combined_analysis() in main.R
    • the function should take a list of population density filepaths to data you want to analyse
    • the function should take as a parameter the location data filepath
    • the function should take as a parameter output_filepath to designate where to output the final analysis
    • the function should call get_analyse_output() for each file path, without writing out the data to file; name the frame appropriately
      • column_name_year() should be used on each data frame produced to change the mean_population_density column to the year of the frame; you can access the year of each data set from each filepath, you may want to split the path string up
    • within combined_analysis() using join_years() from joins.R join the results of each get_analyse_output() together
    • sort the our
    • using write_output() from input_output.R write out the data frame to output_filepath if the value is not False/FALSE
  • The function should return the final data frame

Be sure to run the whole process in one command to check that it produces the output expected.

6.2 Example Answer

Output Example

sdg_region_name 2017 2018 2019 2020
Australia/New Zealand 10.53 10.63 10.72 10.82
Central and Southern Asia 317.54 324.57 330.63 335.32
Eastern and South-Eastern Asia 2065.47 2089.43 2112.67 2135.78
Europe and Northern America 756.12 760.73 764.93 768.50
Latin America and the Caribbean 197.25 198.43 199.62 200.85
Northern Africa and Western Asia 222.42 228.59 234.38 239.48
Oceania (excluding Australia and New Zealand) 142.01 143.12 144.20 145.28
Sub-Saharan Africa 121.30 123.91 126.55 129.21

Below are the files changed during the case study.

main.r

library(tidyr)
library(dplyr)
library(stringr)
library(readr)

source("input_output.r")
source("analysis.r")
source("manipulation.r")
source("joins.r")




#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
get_analyse_output <- function(pop_density_filepath, location_filepath, output_filepath) {
  
  pop_density <- load_formatted_frame(pop_density_filepath)
  locations <- load_formatted_frame(location_filepath)
  
  
  pop_density_single_code <- access_country_code(pop_density)
  
  
  pop_density_correct_types <- convert_type_to_int(dataframe = pop_density_single_code,
                                                   column_name = "country_code",
                                                   string_value = "CC")
  
  locations_correct_types <- convert_type_to_int(dataframe = locations,
                                                 column_name = "location_id",
                                                 string_value = '"')
  
  population_location <- join_frames(pop_density_correct_types,
                                     locations_correct_types,
                                     left_column = "country_code",
                                     right_column = "location_id")
  
  aggreagation <- aggregate_mean(dataframe = population_location,
                                 groupby_column = "sdg_region_name",
                                 statistic_column = "population_density")
  
  formatted_statistic <- format_frame(aggreagation, "mean_population_density")
  

  # output_filepath == True if not False
  if (output_filepath == TRUE) {
    write_output(formatted_statistic, output_filepath=output_filepath)
  }
  
  return(formatted_statistic)
}



#' Perform population density mean analysis across multiple files
combined_analysis <- function(population_filepaths, location_filepath, output_filepath) {
  

  loaded_dataframes = list()
  for (population_file in population_filepaths) {
    # The year is given at the end of the file path, but before '.csv'
    path_broken_up = stringr::str_split(population_file, pattern = "_")
    path_end = dplyr::last(dplyr::last(path_broken_up))
    
    year = substr(path_end, start = 1, stop = 4)

  
    year_analysis = get_analyse_output(population_file, location_filepath, output_filepath=FALSE)
  
    # Change the column name to the year of the population density
    formatted_year_analysis = column_name_year(year_analysis, "mean_population_density", year)
  
    
    loaded_dataframes[[length(loaded_dataframes) + 1]] <- formatted_year_analysis
    
  }

  combined_dataframes = join_years(loaded_dataframes, join_column="sdg_region_name")
  
  if (output_filepath != FALSE) {
    write_output(combined_dataframes, output_filepath=output_filepath)
  }
  
  return(combined_dataframes)
}


pop_path_2017 <- "../../../data/population_density_2017.csv"
pop_path_2018 <- "../../../data/population_density_2018.csv"
pop_path_2019 <- "../../../data/population_density_2019.csv"
pop_path_2020 <- "../../../data/population_density_2020.csv"

location_path <- "../../../data/locations.csv"

# Demonstration of final output for case study
final_output <- combined_analysis(list(pop_path_2017, pop_path_2018, pop_path_2019, pop_path_2020), 
                                  location_path, 
                                  output_filepath = FALSE)
print(final_output)

6.2.0.1 manipulation.r

#' Function to split combined code columns 
#' and remove uncessary columns
access_country_code <- function(dataframe) {
  # The country_and_parent_code column needs to 
  # be split into two columns without the strings
  dataframe <- tidyr::separate(data = dataframe, col = country_and_parent_code,
                               into = c("country_code", "parent_code"), sep = "_")
  
  
  # Remove the  parent_code column, not used in later analysis
  dataframe <- dplyr::select(dataframe, everything(), -parent_code)
  
  return(dataframe)
}


#' Function to convert string to integer column type
convert_type_to_int <- function(dataframe, column_name, string_value) {
  
  # Convert country_code to integer by removing extra strings
  # Using dataframe$column_name to get a column won't work when the column name is a variable
  dataframe[column_name] <- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
  
  # Convert type
  dataframe[column_name] <- as.integer(dataframe[[column_name]])
  
  return(dataframe)
}

#' Change name of specified columns in dataframe
column_name_year <- function(dataframe, original_column, new_column) {

  colnames(dataframe)[colnames(dataframe) == original_column] <- new_column

  return(dataframe)
}

6.2.0.2 joins.r

#' Join the required frames on specified columns, 
#' dropping unecessary columns
join_frames <- function(left_dataframe, right_dataframe, left_column, right_column) {
  
  # Change location_id to be called country_code for join
  colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
  
  combined_frames <- dplyr::left_join(x = left_dataframe,
                                      y = right_dataframe,
                                      by = left_column)
  
  combined_frames_reduced <- dplyr::select(combined_frames, sdg_region_name, population_density)
  
  return(combined_frames_reduced)
}

#' Join a list of frames with an inner join on a specified column name
join_years <- function(dataframes, join_column) {

  merged_frame <- dataframes %>% purrr::reduce(.x = dataframes, 
                                                .f = dplyr::inner_join, 
                                                by = join_column)

  return(merged_frame)
}

Continue on to the case study