Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'kableExtra' was built under R version 4.4.2
Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'kableExtra' was built under R version 4.4.2
Functions, modules and packages structure programs, making them:
A module is a file that contains one or more units of code - in our case: functions. A collection of module files together forms a package.
You have already been using functions, modules and packages that were written by other people. For example dplyr
is a package which contains modules and functions.
You can use automated tests to check that each component of your code performs as expected.
Testing sections of your code independently using code - “Unit Testing” - is a concept covered in further courses. To do this your code needs to use functions, modules and packages.
Testing multiple sections together and their interactions with each other is called - “Integration testing”.
To show how to structure code in an analysis context we will use an example scenario to go through the steps taken.
You have been assigned to a group within your department responsible for analysing populations across the world. This work is in collaboration with the United Nations.
Your job is to provide analysis of population densities across the different United Nations Sustainable Development Goal (SDG) regions. You must provide average population density values for each SDG region.
One of your colleagues has already conducted this analysis on an ad hoc basis. They have given you their code to start with, but they have only analysed one year of data so far. You have been asked to write code that will be able to analyse multiple years of data, all in different files.
Before tackling the big task of analysing all the data, you are going to restructure your colleagues code. To make the process more reproducible in the future you will restructure their code into functions and modules.
This process is called “refactoring”.
Refactoring is a process of improving your code, while keeping it able to perform the same task. This helps clean the code and improve its design.
You have been sent two data sets needed to reproduce the analysis your colleague performed. Have a look through the data, what steps do you think need to be considered to make the data analysable?
The population_density_2019.csv
data contains each country’s population density, name and a unique country and parent group code column. There is also a year column.
The data is only from 2019.
Country | Country and parent code | Population Density | Year |
---|---|---|---|
Burundi | CC108_PC108 | 4.490100e+02 | 2019 |
Comoros | CC174_PC174 | 4.572225e+02 | 2019 |
Djibouti | CC262_PC262 | 4.199987e+01 | 2019 |
Eritrea | CC232_PC232 | 3.462492e+01 | 2019 |
Ethiopia | CC231_PC231 | 1.120787e+02 | 2019 |
Kenya | CC404_PC404 | 9.237440e+01 | 2019 |
Madagascar | CC450_PC450 | 4.635534e+01 | 2019 |
Malawi | CC454_PC454 | 1.975896e+02 | 2019 |
Mauritius | CC480_PC480 | 6.254532e+02 | 2019 |
Mayotte | CC175_PC175 | 7.097413e+02 | 2019 |
Mozambique | CC508_PC508 | 3.861497e+01 | 2019 |
Réunion | CC638_PC638 | 3.555728e+02 | 2019 |
Rwanda | CC646_PC646 | 5.118337e+02 | 2019 |
Seychelles | CC690_PC690 | 2.124804e+02 | 2019 |
Somalia | CC706_PC706 | 2.461649e+01 | 2019 |
South Sudan | CC728_PC728 | 1.810637e+01 | 2019 |
Uganda | CC800_PC800 | 2.215584e+02 | 2019 |
United Republic of Tanzania | CC834_PC834 | 6.548370e+01 | 2019 |
Zambia | CC894_PC894 | 2.402647e+01 | 2019 |
Zimbabwe | CC716_PC716 | 3.785827e+01 | 2019 |
Angola | CC24_PC24 | 2.552763e+01 | 2019 |
Cameroon | CC120_PC120 | 5.474051e+01 | 2019 |
Central African Republic | CC140_PC140 | 7.616904e+00 | 2019 |
Chad | CC148_PC148 | 1.266430e+01 | 2019 |
Congo | CC178_PC178 | 1.575550e+01 | 2019 |
Democratic Republic of the Congo | CC180_PC180 | 3.828348e+01 | 2019 |
Equatorial Guinea | CC226_PC226 | 4.834160e+01 | 2019 |
Gabon | CC266_PC266 | 8.431630e+00 | 2019 |
Sao Tome and Principe | CC678_PC678 | 2.240083e+02 | 2019 |
Botswana | CC72_PC72 | 4.064904e+00 | 2019 |
Eswatini | CC748_PC748 | 6.675192e+01 | 2019 |
Lesotho | CC426_PC426 | 7.000221e+01 | 2019 |
Namibia | CC516_PC516 | 3.029946e+00 | 2019 |
South Africa | CC710_PC710 | 4.827199e+01 | 2019 |
Benin | CC204_PC204 | 1.046572e+02 | 2019 |
Burkina Faso | CC854_PC854 | 7.427406e+01 | 2019 |
Cabo Verde | CC132_PC132 | 1.364605e+02 | 2019 |
Côte d'Ivoire | CC384_PC384 | 8.086967e+01 | 2019 |
Gambia | CC270_PC270 | 2.319858e+02 | 2019 |
Ghana | CC288_PC288 | 1.336814e+02 | 2019 |
Guinea | CC324_PC324 | 5.197479e+01 | 2019 |
Guinea-Bissau | CC624_PC624 | 6.831142e+01 | 2019 |
Liberia | CC430_PC430 | 5.126011e+01 | 2019 |
Mali | CC466_PC466 | 1.611062e+01 | 2019 |
Mauritania | CC478_PC478 | 4.390897e+00 | 2019 |
Niger | CC562_PC562 | 1.840271e+01 | 2019 |
Nigeria | CC566_PC566 | 2.206524e+02 | 2019 |
Saint Helena | CC654_PC654 | 1.554103e+01 | 2019 |
Senegal | CC686_PC686 | 8.464323e+01 | 2019 |
Sierra Leone | CC694_PC694 | 1.082461e+02 | 2019 |
Togo | CC768_PC768 | 1.486001e+02 | 2019 |
Algeria | CC12_PC12 | 1.807630e+01 | 2019 |
Egypt | CC818_PC818 | 1.008469e+02 | 2019 |
Libya | CC434_PC434 | 3.851832e+00 | 2019 |
Morocco | CC504_PC504 | 8.172029e+01 | 2019 |
Sudan | CC729_PC729 | 2.425613e+01 | 2019 |
Tunisia | CC788_PC788 | 7.527498e+01 | 2019 |
Western Sahara | CC732_PC732 | 2.189692e+00 | 2019 |
Armenia | CC51_PC51 | 1.038893e+02 | 2019 |
Azerbaijan | CC31_PC31 | 1.215577e+02 | 2019 |
Bahrain | CC48_PC48 | 2.159426e+03 | 2019 |
Cyprus | CC196_PC196 | 1.297158e+02 | 2019 |
Georgia | CC268_PC268 | 5.751564e+01 | 2019 |
Iraq | CC368_PC368 | 9.050882e+01 | 2019 |
Israel | CC376_PC376 | 3.936864e+02 | 2019 |
Jordan | CC400_PC400 | 1.137835e+02 | 2019 |
Kuwait | CC414_PC414 | 2.360874e+02 | 2019 |
Lebanon | CC422_PC422 | 6.701573e+02 | 2019 |
Oman | CC512_PC512 | 1.607429e+01 | 2019 |
Qatar | CC634_PC634 | 2.439338e+02 | 2019 |
Saudi Arabia | CC682_PC682 | 1.594115e+01 | 2019 |
State of Palestine | CC275_PC275 | 8.274787e+02 | 2019 |
Syrian Arab Republic | CC760_PC760 | 9.295939e+01 | 2019 |
Turkey | CC792_PC792 | 1.084022e+02 | 2019 |
United Arab Emirates | CC784_PC784 | 1.168723e+02 | 2019 |
Yemen | CC887_PC887 | 5.523405e+01 | 2019 |
Kazakhstan | CC398_PC398 | 6.871663e+00 | 2019 |
Kyrgyzstan | CC417_PC417 | 3.345074e+01 | 2019 |
Tajikistan | CC762_PC762 | 6.659776e+01 | 2019 |
Turkmenistan | CC795_PC795 | 1.264464e+01 | 2019 |
Uzbekistan | CC860_PC860 | 7.753106e+01 | 2019 |
Afghanistan | CC4_PC4 | 5.826939e+01 | 2019 |
Bangladesh | CC50_PC50 | 1.252563e+03 | 2019 |
Bhutan | CC64_PC64 | 2.001978e+01 | 2019 |
India | CC356_PC356 | 4.595797e+02 | 2019 |
Iran (Islamic Republic of) | CC364_PC364 | 5.091271e+01 | 2019 |
Maldives | CC462_PC462 | 1.769857e+03 | 2019 |
Nepal | CC524_PC524 | 1.995725e+02 | 2019 |
Pakistan | CC586_PC586 | 2.809326e+02 | 2019 |
Sri Lanka | CC144_PC144 | 3.400372e+02 | 2019 |
China | CC156_PC156 | 1.527217e+02 | 2019 |
China, Hong Kong SAR | CC344_PC344 | 7.082054e+03 | 2019 |
China, Macao SAR | CC446_PC446 | 2.141960e+04 | 2019 |
China, Taiwan Province of China | CC158_PC158 | 6.713889e+02 | 2019 |
Dem. People's Republic of Korea | CC408_PC408 | 2.131564e+02 | 2019 |
Japan | CC392_PC392 | 3.479867e+02 | 2019 |
Mongolia | CC496_PC496 | 2.075984e+00 | 2019 |
Republic of Korea | CC410_PC410 | 5.268469e+02 | 2019 |
Brunei Darussalam | CC96_PC96 | 8.221935e+01 | 2019 |
Cambodia | CC116_PC116 | 9.339759e+01 | 2019 |
Indonesia | CC360_PC360 | 1.493873e+02 | 2019 |
Lao People's Democratic Republic | CC418_PC418 | 3.106350e+01 | 2019 |
Malaysia | CC458_PC458 | 9.724483e+01 | 2019 |
Myanmar | CC104_PC104 | 8.272807e+01 | 2019 |
Philippines | CC608_PC608 | 3.626006e+02 | 2019 |
Singapore | CC702_PC702 | 8.291919e+03 | 2019 |
Thailand | CC764_PC764 | 1.362829e+02 | 2019 |
Timor-Leste | CC626_PC626 | 8.696167e+01 | 2019 |
Viet Nam | CC704_PC704 | 3.110978e+02 | 2019 |
Anguilla | CC660_PC660 | 1.652444e+02 | 2019 |
Antigua and Barbuda | CC28_PC28 | 2.207159e+02 | 2019 |
Aruba | CC533_PC533 | 5.906111e+02 | 2019 |
Bahamas | CC44_PC44 | 3.890969e+01 | 2019 |
Barbados | CC52_PC52 | 6.674907e+02 | 2019 |
Bonaire, Sint Eustatius and Saba | CC535_PC535 | 7.921646e+01 | 2019 |
British Virgin Islands | CC92_PC92 | 2.002200e+02 | 2019 |
Cayman Islands | CC136_PC136 | 2.706167e+02 | 2019 |
Cuba | CC192_PC192 | 1.064777e+02 | 2019 |
Curaçao | CC531_PC531 | 3.680698e+02 | 2019 |
Dominica | CC212_PC212 | 9.574400e+01 | 2019 |
Dominican Republic | CC214_PC214 | 2.222466e+02 | 2019 |
Grenada | CC308_PC308 | 3.294176e+02 | 2019 |
Guadeloupe | CC312_PC312 | 2.457297e+02 | 2019 |
Haiti | CC332_PC332 | 4.086749e+02 | 2019 |
Jamaica | CC388_PC388 | 2.722324e+02 | 2019 |
Martinique | CC474_PC474 | 3.542991e+02 | 2019 |
Montserrat | CC500_PC500 | 4.991000e+01 | 2019 |
Puerto Rico | CC630_PC630 | 3.307107e+02 | 2019 |
Saint Barthélemy | CC652_PC652 | 4.478636e+02 | 2019 |
Saint Kitts and Nevis | CC659_PC659 | 2.032077e+02 | 2019 |
Saint Lucia | CC662_PC662 | 2.996639e+02 | 2019 |
Saint Martin (French part) | CC663_PC663 | 7.170189e+02 | 2019 |
Saint Vincent and the Grenadines | CC670_PC670 | 2.835718e+02 | 2019 |
Sint Maarten (Dutch part) | CC534_PC534 | 1.246735e+03 | 2019 |
Trinidad and Tobago | CC780_PC780 | 2.719238e+02 | 2019 |
Turks and Caicos Islands | CC796_PC796 | 4.020421e+01 | 2019 |
United States Virgin Islands | CC850_PC850 | 2.987971e+02 | 2019 |
Belize | CC84_PC84 | 1.711315e+01 | 2019 |
Costa Rica | CC188_PC188 | 9.885548e+01 | 2019 |
El Salvador | CC222_PC222 | 3.114648e+02 | 2019 |
Guatemala | CC320_PC320 | 1.640675e+02 | 2019 |
Honduras | CC340_PC340 | 8.710443e+01 | 2019 |
Mexico | CC484_PC484 | 6.562696e+01 | 2019 |
Nicaragua | CC558_PC558 | 5.439175e+01 | 2019 |
Panama | CC591_PC591 | 5.712187e+01 | 2019 |
Argentina | CC32_PC32 | 1.636308e+01 | 2019 |
Bolivia (Plurinational State of) | CC68_PC68 | 1.062781e+01 | 2019 |
Brazil | CC76_PC76 | 2.525078e+01 | 2019 |
Chile | CC152_PC152 | 2.548920e+01 | 2019 |
Colombia | CC170_PC170 | 4.537129e+01 | 2019 |
Ecuador | CC218_PC218 | 6.995352e+01 | 2019 |
Falkland Islands (Malvinas) | CC238_PC238 | 2.770748e-01 | 2019 |
French Guiana | CC254_PC254 | 3.537993e+00 | 2019 |
Guyana | CC328_PC328 | 3.976505e+00 | 2019 |
Paraguay | CC600_PC600 | 1.773128e+01 | 2019 |
Peru | CC604_PC604 | 2.539880e+01 | 2019 |
Suriname | CC740_PC740 | 3.726686e+00 | 2019 |
Uruguay | CC858_PC858 | 1.977906e+01 | 2019 |
Venezuela (Bolivarian Republic of) | CC862_PC862 | 3.232904e+01 | 2019 |
Australia | CC36_PC36 | 3.280684e+00 | 2019 |
New Zealand | CC554_PC554 | 1.816514e+01 | 2019 |
Fiji | CC242_PC242 | 4.871128e+01 | 2019 |
New Caledonia | CC540_PC540 | 1.546811e+01 | 2019 |
Papua New Guinea | CC598_PC598 | 1.937932e+01 | 2019 |
Solomon Islands | CC90_PC90 | 2.393073e+01 | 2019 |
Vanuatu | CC548_PC548 | 2.460066e+01 | 2019 |
Guam | CC316_PC316 | 3.098056e+02 | 2019 |
Kiribati | CC296_PC296 | 1.451951e+02 | 2019 |
Marshall Islands | CC584_PC584 | 3.266167e+02 | 2019 |
Micronesia (Fed. States of) | CC583_PC583 | 1.625871e+02 | 2019 |
Nauru | CC520_PC520 | 5.382000e+02 | 2019 |
Northern Mariana Islands | CC580_PC580 | 1.243761e+02 | 2019 |
Palau | CC585_PC585 | 3.913261e+01 | 2019 |
American Samoa | CC16_PC16 | 2.765600e+02 | 2019 |
Cook Islands | CC184_PC184 | 7.311250e+01 | 2019 |
French Polynesia | CC258_PC258 | 7.630738e+01 | 2019 |
Niue | CC570_PC570 | 6.207692e+00 | 2019 |
Samoa | CC882_PC882 | 6.964417e+01 | 2019 |
Tokelau | CC772_PC772 | 1.330000e+02 | 2019 |
Tonga | CC776_PC776 | 1.451347e+02 | 2019 |
Tuvalu | CC798_PC798 | 3.885000e+02 | 2019 |
Wallis and Futuna Islands | CC876_PC876 | 8.168571e+01 | 2019 |
Belarus | CC112_PC112 | 4.658424e+01 | 2019 |
Bulgaria | CC100_PC100 | 6.448155e+01 | 2019 |
Czechia | CC203_PC203 | 1.383896e+02 | 2019 |
Hungary | CC348_PC348 | 1.069776e+02 | 2019 |
Poland | CC616_PC616 | 1.237233e+02 | 2019 |
Republic of Moldova | CC498_PC498 | 1.230824e+02 | 2019 |
Romania | CC642_PC642 | 8.413155e+01 | 2019 |
Russian Federation | CC643_PC643 | 8.907212e+00 | 2019 |
Slovakia | CC703_PC703 | 1.134797e+02 | 2019 |
Ukraine | CC804_PC804 | 7.594014e+01 | 2019 |
Channel Islands | CC830_PC830 | 9.066526e+02 | 2019 |
Denmark | CC208_PC208 | 1.360329e+02 | 2019 |
Estonia | CC233_PC233 | 3.127268e+01 | 2019 |
Faroe Islands | CC234_PC234 | 3.486891e+01 | 2019 |
Finland | CC246_PC246 | 1.820448e+01 | 2019 |
Iceland | CC352_PC352 | 3.381915e+00 | 2019 |
Ireland | CC372_PC372 | 7.087383e+01 | 2019 |
Isle of Man | CC833_PC833 | 1.484018e+02 | 2019 |
Latvia | CC428_PC428 | 3.065498e+01 | 2019 |
Lithuania | CC440_PC440 | 4.403151e+01 | 2019 |
Norway | CC578_PC578 | 1.472579e+01 | 2019 |
Sweden | CC752_PC752 | 2.445872e+01 | 2019 |
United Kingdom | CC826_PC826 | 2.791310e+02 | 2019 |
Albania | CC8_PC8 | 1.051428e+02 | 2019 |
Andorra | CC20_PC20 | 1.641404e+02 | 2019 |
Bosnia and Herzegovina | CC70_PC70 | 6.472545e+01 | 2019 |
Croatia | CC191_PC191 | 7.380806e+01 | 2019 |
Gibraltar | CC292_PC292 | 3.370600e+03 | 2019 |
Greece | CC300_PC300 | 8.125254e+01 | 2019 |
Holy See | CC336_PC336 | 1.852273e+03 | 2019 |
Italy | CC380_PC380 | 2.058547e+02 | 2019 |
Malta | CC470_PC470 | 1.376178e+03 | 2019 |
Montenegro | CC499_PC499 | 4.669056e+01 | 2019 |
North Macedonia | CC807_PC807 | 8.261134e+01 | 2019 |
Portugal | CC620_PC620 | 1.116517e+02 | 2019 |
San Marino | CC674_PC674 | 5.644000e+02 | 2019 |
Serbia | CC688_PC688 | 1.002999e+02 | 2019 |
Slovenia | CC705_PC705 | 1.032102e+02 | 2019 |
Spain | CC724_PC724 | 9.369844e+01 | 2019 |
Austria | CC40_PC40 | 1.086666e+02 | 2019 |
Belgium | CC56_PC56 | 3.810874e+02 | 2019 |
France | CC250_PC250 | 1.189460e+02 | 2019 |
Germany | CC276_PC276 | 2.396059e+02 | 2019 |
Liechtenstein | CC438_PC438 | 2.376250e+02 | 2019 |
Luxembourg | CC442_PC442 | 2.377336e+02 | 2019 |
Monaco | CC492_PC492 | 2.615235e+04 | 2019 |
Netherlands | CC528_PC528 | 5.070321e+02 | 2019 |
Switzerland | CC756_PC756 | 2.174147e+02 | 2019 |
Bermuda | CC60_PC60 | 1.250160e+03 | 2019 |
Canada | CC124_PC124 | 4.114037e+00 | 2019 |
Greenland | CC304_PC304 | 1.380436e-01 | 2019 |
Saint Pierre and Miquelon | CC666_PC666 | 2.530870e+01 | 2019 |
United States of America | CC840_PC840 | 3.597352e+01 | 2019 |
The locations.csv
data contains each countries location ID (the same as a country code), and which Sustainable Development Goal Region the location ID is part of.
The data is valid for all years.
Location ID | SDG Region Name |
---|---|
"108" | Sub-Saharan Africa |
"174" | Sub-Saharan Africa |
"262" | Sub-Saharan Africa |
"232" | Sub-Saharan Africa |
"231" | Sub-Saharan Africa |
"404" | Sub-Saharan Africa |
"450" | Sub-Saharan Africa |
"454" | Sub-Saharan Africa |
"480" | Sub-Saharan Africa |
"175" | Sub-Saharan Africa |
"508" | Sub-Saharan Africa |
"638" | Sub-Saharan Africa |
"646" | Sub-Saharan Africa |
"690" | Sub-Saharan Africa |
"706" | Sub-Saharan Africa |
"728" | Sub-Saharan Africa |
"800" | Sub-Saharan Africa |
"834" | Sub-Saharan Africa |
"894" | Sub-Saharan Africa |
"716" | Sub-Saharan Africa |
"24" | Sub-Saharan Africa |
"120" | Sub-Saharan Africa |
"140" | Sub-Saharan Africa |
"148" | Sub-Saharan Africa |
"178" | Sub-Saharan Africa |
"180" | Sub-Saharan Africa |
"226" | Sub-Saharan Africa |
"266" | Sub-Saharan Africa |
"678" | Sub-Saharan Africa |
"72" | Sub-Saharan Africa |
"748" | Sub-Saharan Africa |
"426" | Sub-Saharan Africa |
"516" | Sub-Saharan Africa |
"710" | Sub-Saharan Africa |
"204" | Sub-Saharan Africa |
"854" | Sub-Saharan Africa |
"132" | Sub-Saharan Africa |
"384" | Sub-Saharan Africa |
"270" | Sub-Saharan Africa |
"288" | Sub-Saharan Africa |
"324" | Sub-Saharan Africa |
"624" | Sub-Saharan Africa |
"430" | Sub-Saharan Africa |
"466" | Sub-Saharan Africa |
"478" | Sub-Saharan Africa |
"562" | Sub-Saharan Africa |
"566" | Sub-Saharan Africa |
"654" | Sub-Saharan Africa |
"686" | Sub-Saharan Africa |
"694" | Sub-Saharan Africa |
"768" | Sub-Saharan Africa |
"12" | Northern Africa and Western Asia |
"818" | Northern Africa and Western Asia |
"434" | Northern Africa and Western Asia |
"504" | Northern Africa and Western Asia |
"729" | Northern Africa and Western Asia |
"788" | Northern Africa and Western Asia |
"732" | Northern Africa and Western Asia |
"51" | Northern Africa and Western Asia |
"31" | Northern Africa and Western Asia |
"48" | Northern Africa and Western Asia |
"196" | Northern Africa and Western Asia |
"268" | Northern Africa and Western Asia |
"368" | Northern Africa and Western Asia |
"376" | Northern Africa and Western Asia |
"400" | Northern Africa and Western Asia |
"414" | Northern Africa and Western Asia |
"422" | Northern Africa and Western Asia |
"512" | Northern Africa and Western Asia |
"634" | Northern Africa and Western Asia |
"682" | Northern Africa and Western Asia |
"275" | Northern Africa and Western Asia |
"760" | Northern Africa and Western Asia |
"792" | Northern Africa and Western Asia |
"784" | Northern Africa and Western Asia |
"887" | Northern Africa and Western Asia |
"398" | Central and Southern Asia |
"417" | Central and Southern Asia |
"762" | Central and Southern Asia |
"795" | Central and Southern Asia |
"860" | Central and Southern Asia |
"4" | Central and Southern Asia |
"50" | Central and Southern Asia |
"64" | Central and Southern Asia |
"356" | Central and Southern Asia |
"364" | Central and Southern Asia |
"462" | Central and Southern Asia |
"524" | Central and Southern Asia |
"586" | Central and Southern Asia |
"144" | Central and Southern Asia |
"156" | Eastern and South-Eastern Asia |
"344" | Eastern and South-Eastern Asia |
"446" | Eastern and South-Eastern Asia |
"158" | Eastern and South-Eastern Asia |
"408" | Eastern and South-Eastern Asia |
"392" | Eastern and South-Eastern Asia |
"496" | Eastern and South-Eastern Asia |
"410" | Eastern and South-Eastern Asia |
"96" | Eastern and South-Eastern Asia |
"116" | Eastern and South-Eastern Asia |
"360" | Eastern and South-Eastern Asia |
"418" | Eastern and South-Eastern Asia |
"458" | Eastern and South-Eastern Asia |
"104" | Eastern and South-Eastern Asia |
"608" | Eastern and South-Eastern Asia |
"702" | Eastern and South-Eastern Asia |
"764" | Eastern and South-Eastern Asia |
"626" | Eastern and South-Eastern Asia |
"704" | Eastern and South-Eastern Asia |
"660" | Latin America and the Caribbean |
"28" | Latin America and the Caribbean |
"533" | Latin America and the Caribbean |
"44" | Latin America and the Caribbean |
"52" | Latin America and the Caribbean |
"535" | Latin America and the Caribbean |
"92" | Latin America and the Caribbean |
"136" | Latin America and the Caribbean |
"192" | Latin America and the Caribbean |
"531" | Latin America and the Caribbean |
"212" | Latin America and the Caribbean |
"214" | Latin America and the Caribbean |
"308" | Latin America and the Caribbean |
"312" | Latin America and the Caribbean |
"332" | Latin America and the Caribbean |
"388" | Latin America and the Caribbean |
"474" | Latin America and the Caribbean |
"500" | Latin America and the Caribbean |
"630" | Latin America and the Caribbean |
"652" | Latin America and the Caribbean |
"659" | Latin America and the Caribbean |
"662" | Latin America and the Caribbean |
"663" | Latin America and the Caribbean |
"670" | Latin America and the Caribbean |
"534" | Latin America and the Caribbean |
"780" | Latin America and the Caribbean |
"796" | Latin America and the Caribbean |
"850" | Latin America and the Caribbean |
"84" | Latin America and the Caribbean |
"188" | Latin America and the Caribbean |
"222" | Latin America and the Caribbean |
"320" | Latin America and the Caribbean |
"340" | Latin America and the Caribbean |
"484" | Latin America and the Caribbean |
"558" | Latin America and the Caribbean |
"591" | Latin America and the Caribbean |
"32" | Latin America and the Caribbean |
"68" | Latin America and the Caribbean |
"76" | Latin America and the Caribbean |
"152" | Latin America and the Caribbean |
"170" | Latin America and the Caribbean |
"218" | Latin America and the Caribbean |
"238" | Latin America and the Caribbean |
"254" | Latin America and the Caribbean |
"328" | Latin America and the Caribbean |
"600" | Latin America and the Caribbean |
"604" | Latin America and the Caribbean |
"740" | Latin America and the Caribbean |
"858" | Latin America and the Caribbean |
"862" | Latin America and the Caribbean |
"36" | Australia/New Zealand |
"554" | Australia/New Zealand |
"242" | Oceania (excluding Australia and New Zealand) |
"540" | Oceania (excluding Australia and New Zealand) |
"598" | Oceania (excluding Australia and New Zealand) |
"90" | Oceania (excluding Australia and New Zealand) |
"548" | Oceania (excluding Australia and New Zealand) |
"316" | Oceania (excluding Australia and New Zealand) |
"296" | Oceania (excluding Australia and New Zealand) |
"584" | Oceania (excluding Australia and New Zealand) |
"583" | Oceania (excluding Australia and New Zealand) |
"520" | Oceania (excluding Australia and New Zealand) |
"580" | Oceania (excluding Australia and New Zealand) |
"585" | Oceania (excluding Australia and New Zealand) |
"16" | Oceania (excluding Australia and New Zealand) |
"184" | Oceania (excluding Australia and New Zealand) |
"258" | Oceania (excluding Australia and New Zealand) |
"570" | Oceania (excluding Australia and New Zealand) |
"882" | Oceania (excluding Australia and New Zealand) |
"772" | Oceania (excluding Australia and New Zealand) |
"776" | Oceania (excluding Australia and New Zealand) |
"798" | Oceania (excluding Australia and New Zealand) |
"876" | Oceania (excluding Australia and New Zealand) |
"112" | Europe and Northern America |
"100" | Europe and Northern America |
"203" | Europe and Northern America |
"348" | Europe and Northern America |
"616" | Europe and Northern America |
"498" | Europe and Northern America |
"642" | Europe and Northern America |
"643" | Europe and Northern America |
"703" | Europe and Northern America |
"804" | Europe and Northern America |
"830" | Europe and Northern America |
"208" | Europe and Northern America |
"233" | Europe and Northern America |
"234" | Europe and Northern America |
"246" | Europe and Northern America |
"352" | Europe and Northern America |
"372" | Europe and Northern America |
"833" | Europe and Northern America |
"428" | Europe and Northern America |
"440" | Europe and Northern America |
"578" | Europe and Northern America |
"752" | Europe and Northern America |
"826" | Europe and Northern America |
"8" | Europe and Northern America |
"20" | Europe and Northern America |
"70" | Europe and Northern America |
"191" | Europe and Northern America |
"292" | Europe and Northern America |
"300" | Europe and Northern America |
"336" | Europe and Northern America |
"380" | Europe and Northern America |
"470" | Europe and Northern America |
"499" | Europe and Northern America |
"807" | Europe and Northern America |
"620" | Europe and Northern America |
"674" | Europe and Northern America |
"688" | Europe and Northern America |
"705" | Europe and Northern America |
"724" | Europe and Northern America |
"40" | Europe and Northern America |
"56" | Europe and Northern America |
"250" | Europe and Northern America |
"276" | Europe and Northern America |
"438" | Europe and Northern America |
"442" | Europe and Northern America |
"492" | Europe and Northern America |
"528" | Europe and Northern America |
"756" | Europe and Northern America |
"60" | Europe and Northern America |
"124" | Europe and Northern America |
"304" | Europe and Northern America |
"666" | Europe and Northern America |
"840" | Europe and Northern America |
To analyse the data we will need to have one single data frame. We must join locations and population densities on a column.
At the moment there is no exact matching column to join on, therefore we will need to manipulate columns.
Both data frames contain a “country code” value somewhere. For the Population Density dataframe, the country code will need to be separated from the parent code. There are also prefixes “CC” and “PC” that we will need to consider. For the Location IDs dataframe, the quotation marks will need to be removed.
Once the population densities have their respective SDG region in the same table the data can be aggregated. The data will be grouped by SDG region, then the mean will be calculated on the population density value.
We have stated above that the data is valid for all years, meaning that we expect the structure to be consistent. Once the 2019 data is clean, what things should we consider about applying our program to other years?
Before getting started on the task of analysing the population density data, it is important that we are aware of different styles of programming.
Scripts and notebooks can be really useful tools for quick analysis, however, they limit how we can scale and improve our project.
Our scripts become one line after another of data being slightly changed at each step.
This does not group the code in a structure helpful for us to understand.
This style of programming is sometimes referred to as “imperative”.
Programmers frequently copy and paste code to reuse it in different parts of a program, with small changes.
If the requirements of our project change, we need to hunt through the code to change all the relevant variables and values. If code sections have been copied and pasted, fixing an error in one place won’t fix the copies.
If the project expands we need to write more and more code. This is often done in the same file, making the code harder to work through and understand.
To structure our code better we need to be able to group a collection of code together into one object. This can be done in two ways:
Classes are beyond the scope of this course and are less prevalent in R, so we will focus on functions here. However, many of these principles can also be applied to classes.
Properties of functions:
Functions can be run in one line of code, running complicated operations that have been written elsewhere. This helps “hide” some of the detail, making it clearer what is happening in the code - a process known as extraction.
Well-named functions mean we do not need to understand the details inside the function - just what they achieve.
Within this course there is a programming styles document, explaining some of the different styles of programming. This is suggested further reading at this point in the course.
There are some important principles to keep in mind when we design functions:
In this section we will discuss considerations when converting scripts to functions. In this section we will use an example script to show the steps involved structuring code.
The code that has been given to you by your colleague is given in this section. At present it is a script that is well commented, but not well structured. Your task is to structure the code allowing for future reproducible analysis.
At a high level, the code:
Have a read through the script you have received, be sure to look up any sections you are not comfortable with.
If you would prefer to look at it within an IDE it is located in
example_code_R/initial_script/
.For all scripts and files throughout this course it is assumed that the working directory being used is the location of the file being run. This may need to be changed in your given IDE.
# File to analyse the mean population density data from the UN
# Import relevant libraries for analysis
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
# Load the population density data 2019
<- file.path("../../data/population_density_2019.csv")
population_path <- readr::read_csv(population_path)
pop_density
# Clean the column names, following snake_case convention
colnames(pop_density) <- tolower(colnames(pop_density))
colnames(pop_density) <- stringr::str_replace_all(colnames(pop_density), pattern = " ", replacement = "_")
# The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = pop_density, col = country_and_parent_code,
pop_density into = c("country_code", "parent_code"),
sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(pop_density, everything(), -parent_code)
pop_density
# Convert country_code to integer by removing strings
$country_code <- stringr::str_remove_all(pop_density$country_code, pattern = "CC")
pop_density$country_code <- as.integer(pop_density$country_code)
pop_density
# Load the locations data to get the Sustainable Development Goals sub regions
<- file.path("../../data/locations.csv")
locations_path <- readr::read_csv(locations_path)
locations
# Clean the column names, following naming conventions similar to PEP8
colnames(locations) <- tolower(colnames(locations))
colnames(locations) <- stringr::str_replace_all(colnames(locations), pattern = " ", replacement = "_")
# The location_id data has quotation marks making it a string,
# it needs to be converted to a numeric
$location_id <- stringr::str_remove_all(locations$location_id, pattern = '"')
locations$location_id <- as.integer(locations$location_id)
locations
# Change location_id to be called country_code for join
colnames(locations)[colnames(locations) == "location_id"] <- "country_code"
# Join the data sets
# Left merge so we keep all pop_density data
<- dplyr::left_join(pop_density,
pop_density_location
locations,by = "country_code")
# Get just the relevant columns in preparation
# for the following groupby
<- dplyr::select(pop_density_location, sdg_region_name, population_density)
region_density
# Calculate the mean population density for each region
# A non-weighted mean
<- dplyr::group_by(region_density, sdg_region_name)
region_density_grouped
<- dplyr::summarise(region_density_grouped,
region_mean_density "mean_population_density" = mean(population_density)
)
# Sort the data for clearer reading, descending order
<- dplyr::arrange(region_mean_density, -mean_population_density)
region_mean_density
# Round mean density for clearer reading
$mean_population_density <- round(region_mean_density$mean_population_density,
region_mean_densitydigits = 2)
# Write out the final output
::write_csv(x = region_mean_density, file = "mean_population_density_output.csv") readr
Output data:
sdg_region_name | mean_population_density |
---|---|
Eastern and South-Eastern Asia | 2112.67 |
Europe and Northern America | 764.93 |
Central and Southern Asia | 330.63 |
Northern Africa and Western Asia | 234.38 |
Latin America and the Caribbean | 199.62 |
Oceania (excluding Australia and New Zealand) | 144.20 |
Sub-Saharan Africa | 126.55 |
Australia/New Zealand | 10.72 |
Chunks of code that do similar things should be grouped together.
Deciding which sections of code make sense as being part of the same function is a common challenge when structuring code.
When converting code into a function - the main thing we look for is that it achieves one task. It may take us a few lines of code to achieve this “one task” - but the point is the function has a specific purpose.
If a function has more than one task or “responsibility” it will become hard to maintain, as it has many reasons to be modified.
If a function has a single “responsibility”, it will be focussed and much more likely to be reusable elsewhere.
When writing scripts, we often repeat the same tasks at different points in the script. These are good parts of code to start converting into functions. Doing so reduces the amount of code written in the file - and makes what is happening at any step clearer.
You may also wish to consider writing helper functions for any common housekeeping tasks that you tend to commonly require.
If a code block isn’t repeated throughout the code that’s okay too - all the code can be converted to functions to be called one after the other.
It is much easier to read a sequence of well-named functions, rather than a long stream of commands.
Some code is often very similar, with a variable or two difference in areas of the code. When reading the code, it’s important to think about what is happening to the variables and data involved. Consider whether a similar process is happening elsewhere, rather than whether the same data is involved. These repeating processes present opportunities to reduce the overall length of your script by writing your own custom functions.
Returning to our example script, we are going to take one task, convert it into a function, then improve the function so it can be used multiple times.
The lines of code:
# Load the population density data 2019
<- readr::read_csv("../../data/population_density_2019.csv")
pop_density
# Clean the column names, following snake_case convention
colnames(pop_density) <- tolower(colnames(pop_density))
colnames(pop_density) <- stringr::str_replace_all(colnames(pop_density),
pattern = " ",
replacement = "_")
We can wrap the code into a function so that all the code can be run with one command like so.
#' Read population data and reformat column names
<- function() {
load_formatted_pop_frame # Load the population density data 2019
<- file.path("../../data/population_density_2019.csv")
population_path <- readr::read_csv(population_path)
pop_density
# Clean the column names, following snake_case convention
colnames(pop_density) <- tolower(colnames(pop_density))
colnames(pop_density) <- stringr::str_replace_all(colnames(pop_density),
pattern = " ",
replacement = "_")
return(pop_density)
}
# Call the function to assign the data frame
<- load_formatted_pop_frame() population_density
To improve the function, we can add as an argument something that may change in the future - the path of the data string.
Consider how you would have to change the previous function if the location of the population_density_2019.csv
file changed.
Variable names in functions should reflect what that variable is. If you don’t know exactly the value the variable will take, then a generic name like dataframe
is appropriate. Though consider the framework that you are working in - avoid reserved words or well-established, commonly used function names.
When we add an argument to a function replacing a value within, we need to be sure to change all times that original variable was used.
Our comments should reflect the changes made too.
Note that comments should add information - the comments in this tutorial are reminders of why we are doing this, and not the style of comment you would be expected to write. Often, if functions and variables are well-named, the code does not require many comments.
The new function can now be used for both the population_density
data and locations.csv
.
#' Read population data and reformat column names
<- function(path_to_data) {
load_formatted_frame # Load the population density data 2019
<- file.path(path_to_data)
formatted_path <- readr::read_csv(formatted_path)
dataframe
# Clean the column names, following snake_case convention
colnames(dataframe) <- tolower(colnames(dataframe))
colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
return(dataframe)
}
# The path can be updated where the function is run if needed
<- load_formatted_frame("../../data/population_density_2019.csv")
population_density
# The same function is used to load a formatted locations.csv
<- load_formatted_frame("../../data/locations.csv") locations
Scope is an important concept when creating functions and structuring code.
Scope refers to the places in a program that a variable can be accessed.
When writing scripts, variables can be accessed anywhere in the script - so long as the variable assignment has been run.
When we write scripts, we are storing all our variables at the highest, most accessible area of the program. This is referred to as “Global Scope”.
Variables with global scope are accessible in all locations of the program.
This is the easiest way to store variables when learning to program.
However, using global variables throughout our analysis often creates unexpected results in our code. If a new piece of code accidentally alters a global variable, it will affect all the code run after it, even if the function wasn’t meant to update the variable… errors like this can be very tricky to track down and fix.
Some variables can only be accessed in certain locations within a program. When this happens, it is referred to as “Local Scope”.
Variables have local scope if they are accessible within a part of a program such as a function. They cannot be accessed outside the function they are assigned in.
At the highest level of scope are the parts of the programming language that can be accessed anywhere - the built in functions (e.g. print()
).
To make our functions follow functional programming principles we need to keep variable scope in mind.
When designing functions:
If we are clear about what variables we are accessing, we can be sure about what their values are. Using only variables passed as arguments clarifies what data a function is operating on, and makes it much easier to reuse elsewhere (as it just needs its arguments defined, no hidden dependencies on global variables).
Think of your functions as having an entrance and an exit.
When choosing which parameters to give a function there are a few things to consider:
Not all functions need to return a value, such as a function that writes out a file. In this case do not use a return statement, making it clear nothing will be returned. By default if there is no return statement in a function R will return NULL
Below are examples of code which have similar purposes, one uses parameter variables well, the other does not.
This is bad because we are altering data that has not been passed as arguments to the function.
<- c("a", "b", "c", "d", "e")
letters
<- function() {
add_letter <- c(letters, "f")
long_letters return(long_letters)
}
# Run on original data
print("Initial")
[1] "Initial"
print(add_letter())
[1] "a" "b" "c" "d" "e" "f"
# the value of letters could be changed elsewhere in the program
<- c("1", "2", "3", "4", "5")
letters
# Without changing our function call at all we get a different result
# with the same function call
print("Changed")
[1] "Changed"
print(add_letter())
[1] "1" "2" "3" "4" "5" "f"
<- c("a", "b", "c", "d", "e")
letters
<- function(character_vector) {
add_letter <- c(character_vector, "f")
long_characters return(long_characters)
}
# Run on original data
print("Initial")
[1] "Initial"
print(add_letter(letters))
[1] "a" "b" "c" "d" "e" "f"
# the value of letters could be changed elsewhere in the program
<- c("1", "2", "3", "4", "5")
letters
# Without changing our function call at all we get a different result
# with the same function call
print("Changed")
[1] "Changed"
As analysts and data scientists, we will often use data frames in our programs.
There are some special considerations that need to be taken when working with these objects, with regards to functional programming principles.
R has a number of properties that make it an effective langauge for writing functions.
There are some considerations that need to be taken into account when passing parameters to functions in R, especially when using the tidyverse
family of functions.
When we convert a script to a function we often need to put variable names in our code instead of strings / names.
For example, if we want to write a function that takes a data frame and column name as a parameter.
These variables need to be handled slightly differently than the strings they once were.
For example, when accessing a column name we can specify the name using $
, or we can use square brackets []
, which also works with columns.
Assuming we have some data frame survey
with a column "people"
.
# To access column "people" from the dataframe "survey"
$people
survey
# To access the column name stored as a variable, "people" from "survey"
<- "people"
column_name
survey[column_name]
To access the specific column needed you can use single square brackets survey[column_name]
.
To access the specific vector of values in a column you use double square brackets survey[[column_name]]
.
By default in R when variables are given to functions their values will not be evaluated.
This means if we pass column names into the function as variables they will not be recognised as names of columns themselves.
When using tidyverse
functions this issue can be avoided in two ways.
# This code will not run
<- "people"
column_name
<- dplyr::group_by(.data = survey, column_name) grouped_survey
We can either use standard evaluation function versions, these are given with an underscore at the end of a function name.
# This code will run
<- "people"
column_name
<- dplyr::group_by_(.data = survey, column_name) grouped_survey
Alternatively, we can use the dplyr
function get()
to evaluate a variable we give to it.
# This code will run
<- "people"
column_name
<- group_by(.data = survey, get(col_name)) grouped_survey
In functions we pass variables to other functions frequently, so it is important to be able to access those variables.
Using the code snippets from the example analysis below, write a function that:
country_and_parent_code
column into parent_code
and country_code
columnscountry_and_parent_code
and parent_code
columnsAdd this function into the file example_code_python/function_input/exercise1.py
or example_code_R/function_input/exercise1.R
depending on your chosen framework. Use the code already there to test your result on pop_density
.
Name the function access_country_code()
.
# The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = pop_density, col = country_and_parent_code,
pop_density into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(pop_density, everything(), -parent_code) pop_density
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
## Code to be improved to complete exercise 1
#' Read population data and reformat column names
<- function(path_to_data) {
load_formatted_frame # Load the population density data 2019
<- file.path(path_to_data)
formatted_path <- readr::read_csv(formatted_path)
dataframe
# Clean the column names, following snake_case convention
colnames(dataframe) <- tolower(colnames(dataframe))
colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
return(dataframe)
}
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"),
sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
# Loading both data frames
<- load_formatted_frame("../../data/population_density_2019.csv")
population_density <- load_formatted_frame("../../data/locations.csv")
locations
# Run the code created checking output
<- access_country_code(population_density)
pop_density_single_code print(pop_density_single_code$country_code)
Using the code snippets from our example analysis below, write a function that:
This function will be used across both data frames later - so be sure it is general enough to work for both. In addition, it must use only data it gets as arguments.
Add this function into the file example_code/function_input/exercise2.py|R
. Use the code already there to test your result on locations
and pop_density
.
Name the function convert_type_to_int()
.
# Convert country_code to integer by removing extra strings
$country_code <- stringr::str_remove(pop_density$country_code, pattern = "CC")
pop_density
# Convert type
$country_code <- as.integer(pop_density$country_code) pop_density
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
## Code to be improved to complete exercise 2
#' Read population data and reformat column names
<- function(path_to_data) {
load_formatted_frame # Load the population density data 2019
<- file.path(path_to_data)
formatted_path <- readr::read_csv(formatted_path)
dataframe
# Clean the column names, following snake_case convention
colnames(dataframe) <- tolower(colnames(dataframe))
colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
return(dataframe)
}
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code
# The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[[column_name]]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[[column_name]]
return(dataframe)
}
<- load_formatted_frame("../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
# Using the conversion function created
<- convert_type_to_int(pop_density_single_code,
population_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(locations,
locations_correct_types column_name = "location_id",
string_value = '"')
print(str(population_density_correct_types))
print(str(locations_correct_types))
Using the code snippets from our example analysis below, write a function that:
This function will be used after the previous functions using the data frames outputted.
Add this function into the file example_code_R/function_input/exercise3.r
. Use the code already there to test your result on the new data frame.
This function will be useful for our specific case, but also if we want to join other data frames or use different column names.
Our column names could change if we change an upstream function, so it’s important we give them as inputs.
Name the function join_frames()
.
# Change location_id to be called country_code for join
<- dplyr::rename(locations, country_code = location_id)
locations
# Join the data sets
# Left merge so we keep all pop_density data
<- dplyr::left_join(pop_density,
pop_density_location
locations,by = "country_code")
# Get just the relevant columns in preparation
# for the following groupby
<- dplyr::select(pop_density_location, sdg_region_name, population_density) region_density
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
## Code to be improved to complete exercise 3
#' Read population data and reformat column names
<- function(path_to_data) {
load_formatted_frame # Load the population density data 2019
<- file.path(path_to_data)
formatted_path <- readr::read_csv(formatted_path)
dataframe
# Clean the column names, following snake_case convention
colnames(dataframe) <- tolower(colnames(dataframe))
colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
return(dataframe)
}
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[[column_name]]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[[column_name]]
return(dataframe)
}
#' join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select(combined_frames, sdg_region_name, population_density)
combined_frames_reduced
return(combined_frames_reduced)
}
## Run the functions created
<- load_formatted_frame("../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe=pop_density_single_code,
pop_density_correct_types column_name="country_code",
string_value="CC")
<- convert_type_to_int(dataframe=locations,
locations_correct_types column_name="location_id",
string_value='"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
print(colnames(population_location))
print(head(population_location, 10))
This section will introduce some concepts and good practice that are relevant for when you have converted your script into functions.
In the section below, a version of code with all tasks broken into functions is shown. To help consolidate your learning from the previous exercises, an extension exercise is to convert the remaining code to functions yourself.
Using exercise3_answers.R
convert the remaining script code into functions. The functions should be called:
aggregate_statistic()
format_frame()
write_output()
Each of these functions perform one task. They are general enough that they work for our specific situation but leave some room for if we wanted to make minor adjustments upstream, such as column or filenames.
Side Note: We are writing the function write_output()
as practice, it only contains one single line of code so in practice it wouldn’t be used as a function. It’s important to avoid writing functions that are too small.
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
#' Read population data and reformat column names
<- function(path_to_data) {
load_formatted_frame # Load the population density data 2019
<- file.path(path_to_data)
formatted_path <- readr::read_csv(formatted_path)
dataframe
# Clean the column names, following snake_case convention
colnames(dataframe) <- tolower(colnames(dataframe))
colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
return(dataframe)
}
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[column_name]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[column_name]
return(dataframe)
}
#' Join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select(combined_frames, sdg_region_name, population_density)
combined_frames_reduced
return(combined_frames_reduced)
}
#' Function to groupby and calculate the mean of two columns
<- function(dataframe, groupby_column, statistic_column) {
aggregate_mean
# Perform aggregation and summary
# Use group_by_ because of variable column name
<- group_by_(.data = dataframe, groupby_column)
region_mean_density_grouped
# use get() to access column name
<- dplyr::summarise(.data = region_mean_density_grouped,
region_mean_density "mean_population_density" = mean(get(statistic_column)))
return(region_mean_density)
}
#' Format the dataframe for output
<- function(dataframe, statistic_column) {
format_frame
# Sort the data for clearer reading, descending order
<- dplyr::arrange(.data = dataframe)
dataframe_sorted
# Round mean density for clearer reading
<- round(dataframe_sorted[statistic_column],
dataframe_sorted[statistic_column] digits = 2)
return(dataframe_sorted)
}
#' write output statistic in formatted manner
<- function(dataframe, output_filepath) {
write_output
::write_csv(x = dataframe, path = output_filepath)
readr
}
## Run the functions created
<- load_formatted_frame("../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
write_output(formatted_statistic, "./mean_pop_density.csv")
Now we have converted all our code tasks into functions we can run each function, passing their output into the input of the next function.
Looking at the code at the end of our script there are a group of lines which describe the running of the program. These lines of code describe the whole analysis, showing each step in the process with a function corresponding to each step.
When we hit “Run” on our code, the code shown is run. The functions above it in the file are loaded into the program’s global scope, allowing them to be called by this code.
## Run the functions created
<- load_formatted_frame("../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
write_output(formatted_statistic, "./mean_pop_density.csv")
The code above makes what we are doing much easier to understand. To find out what the code is doing at each step, we can just read the name of the function, or look up what it does in the documentation.
The way the code is currently designed, however, still uses variables in the global scope, something to generally avoid.
If we add one more function, that calls our other functions, we can run our whole program by calling this one function. This will make it much easier to run the analysis later down the line, and to extend our code into modules and packages.
Functions that run other functions are called “high level” functions. Using high level functions lets us build more structure to our code.
Often the convention you will see for naming a highest level function in code is calling it main()
, however it does not have to be this name. We will call our highest level analysis get_analyse_output()
.
In effect, we put all the code that was used to “run” the program within the get_analyse_output()
function. This way we can run the program only when we call get_analyse_output()
.
This is the point where typical convention between Python and R starts to differ. Be sure to check both methods if you regularly code in both.
How many levels of “high level” functions we have should be proprortionate to our code. For a small task we probably don’t need high level functions. For a larger pipeline they become significantly more important.
If we want to alter the behaviour of the run_analysis()
function we have two options:
Below is our get_analyse_output()
function, and the code used to run it.
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function() {
get_analyse_output
<- load_formatted_frame("../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
write_output(formatted_statistic, "./mean_pop_density.csv")
}
get_analyse_output()
If we were to use this analysis on different data sets, it may be useful for us to be able to change the data inputs and outputs.
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function(population_filepath, location_filepath, output_filepath) {
get_analyse_output
<- load_formatted_frame(population_filepath)
pop_density <- load_formatted_frame(location_filepath)
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
write_output(formatted_statistic, output_filepath)
}
## Run the main function created
get_analyse_output(population_filepath = "../../../data/population_density_2019.csv",
location_filepath = "../../../data/locations.csv",
output_filepath = "./mean_pop_density.csv")
We have now introduced a higher level function that runs other functions for us.
This is a great step forward in structuring our code. If we want to understand what the program does:
get_analyse_output()
functionBy having some functions that call others we now have levels and dependencies of functions.
Well documented high-level functions mean we do not need to dive into the lower level functions to understand what the code does.
These relationships between functions can be described with hierarchical diagrams. Writing down the relationship between tasks in your code is an extremely useful practice in structuring code.
Below is what the code in main_func.py|R
looks like as a hierarchy of functions.
As you can see, a lot of steps are being run by the single get_analyse_output()
function. It is really important we have this high level function, but we can have more if it makes the structure of our program clearer.
Below we will first look at a new code diagram with a different structure to the previous, then the code it corresponds to.
This is slight overkill for our program at the moment due to it’s small size, but the principle is very useful as our code becomes more complex.
The new structure:
get_analyse_output()
functionNote that we have not added an additional higher level function above
write_output()
. This is because we don’t need to have a higher function calling just one lower level function. In addition, we do not always want to write out data out while we test the analysis pipeline.
The benefits of this structure is that we can more easily access the data produced by our pipeline at relevant steps.
extract_transform()
functionanalyse()
function and call that instead within get_analyse_output()
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
#' Read population data and reformat column names
<- function(path_to_data) {
load_formatted_frame # Load the population density data 2019
<- file.path(path_to_data)
formatted_path <- readr::read_csv(formatted_path)
dataframe
# Clean the column names, following snake_case convention
colnames(dataframe) <- tolower(colnames(dataframe))
colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
return(dataframe)
}
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[column_name]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[column_name]
return(dataframe)
}
#' Join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select(combined_frames, sdg_region_name, population_density)
combined_frames_reduced
return(combined_frames_reduced)
}
#' Function to groupby and calculate the mean of two columns
<- function(dataframe, groupby_column, statistic_column) {
aggregate_mean
# Perform aggregation and summary
# Use group_by_ because of variable column name
<- group_by_(.data = dataframe, groupby_column)
region_mean_density_grouped
# use get() to access column name
<- dplyr::summarise(.data = region_mean_density_grouped,
region_mean_density "mean_population_density" = mean(get(statistic_column)))
return(region_mean_density)
}
#' Format the dataframe for output
<- function(dataframe, statistic_column) {
format_frame
# Sort the data for clearer reading, descending order
<- dplyr::arrange(.data = dataframe)
dataframe_sorted
# Round mean density for clearer reading
<- round(dataframe_sorted[statistic_column],
dataframe_sorted[statistic_column] digits = 2)
return(dataframe_sorted)
}
#' write output statistic in formatted manner
<- function(dataframe, output_filepath) {
write_output
::write_csv(x = dataframe, path = output_filepath)
readr
}
#' Load the data and convert it to clean joined format for analysis
<- function(population_filepath, location_filepath) {
extract_transform
<- load_formatted_frame(population_filepath)
pop_density <- load_formatted_frame(location_filepath)
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(left_dataframe = pop_density_correct_types,
population_location right_dataframe = locations_correct_types,
left_column = "country_code",
right_column = "location_id")
return(population_location)
}
#' Perform groupby mean of population density and reformat result
<- function(full_dataframe, groupby_column, aggregate_column, statistic_column) {
analyse
= aggregate_mean(dataframe = full_dataframe,
aggreagation groupby_column = groupby_column,
statistic_column = aggregate_column)
= format_frame(aggreagation, statistic_column = statistic_column)
formatted_statistic
return(formatted_statistic)
}
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function(population_filepath, location_filepath, output_filepath) {
get_analyse_output
= extract_transform(population_filepath = population_filepath,
population_location location_filepath = location_filepath)
= analyse(full_dataframe = population_location,
formatted_statistic groupby_column = "sdg_region_name",
aggregate_column = "population_density",
statistic_column = "mean_population_density")
write_output(formatted_statistic, output_filepath)
}
get_analyse_output(population_filepath="../../data/population_density_2019.csv",
location_filepath="../../data/locations.csv",
output_filepath="./mean_pop_density.csv")
“Who will need to access this part of the program?” is a useful question to think about when structuring your code.
As the main developer you will likely be accessing the whole code base, every function.
To run the program a user only needs to interact with a small part of the program. The part of the program a user will be interacting with is called the “application programming interface”, API. Other areas of the code can be seen, but rarely used by the user.
Parts of your code can be “hidden” from the user. The end user does not need to understand every line of code or function. The user only needs to run the program.
By structuring the code properly, it is possible to “hide” the private parts from users - they do not need to understand or access the inner workings of every function - they just need to run the program.
In our code the API part would be the get_analyse_output()
function.
Separating public facing and lower level functions improves clarity and usability. All code should be as clear as possible whether it is the API or lower to help with future development.
Having this distinction allows us to test the code at the correct levels.
By having a hierarchy of functions with distinctions about what the API is can make the code simpler. Structuring the code well makes it easier to run, test and fix for developers and users.
This concept becomes more important in:
Ideally, a user does not need to open any code files to run analysis. Instead, the user can work with a graphical user interface (GUI) or command line interface. Parameters such as the input data file paths and output paths are written in a separate file or by the user in the interface.
“What will the end product of my analysis pipeline be?” is an important question to consider when structuring your project.
Earlier in this course a scenario was introduced explaining that a single script can grow large and become difficult to maintain.
Although adding structure with functions makes our code better, it can make it longer. Larger code files are difficult to maintain and understand.
We can make our code even clearer and better-structured by moving the functions in our code into different files. By grouping related functions together into different files it will be easier to look up different parts of our code. We no longer need to scroll through thousands of lines of code, we just navigate to the relevant file.
When we move functions (or other objects) into different files, they then need to be imported back into the file we are using those functions in.
When we move code into different files the code in files are “sourced” into the R environment.
Before structuring code in different files, we need to discuss how to structure our directory properly to help us with this.
Now we are moving beyond working with just one script we need to consider our project, files, folders/directories and paths.
A key part of building a reproducible collection of code is making the project folder simple to understand, navigate and work with.
There is no single folder structure that is perfect for all analysis, however, there are good minimum requirements and guiding principles.
The situation to avoid is having all your data, course code, notebooks and documentation in the same location. This is confusing to anyone else looking in, and makes it harder for your project to be extended.
In this section we will outline basic components of project structure, their relevance to this course, and point to good resources for deciding your own approach.
The main principles are
A directory structure for analysis should separate the:
In addition, relevant version control folders/files will be present (not covered in this course) - .git folder, .gitignore file.
How this is done may depend on your team, language, and specific use case.
An example folder structure for our project is shown below, this is a minimum and could be extended.
population_density_analysis| LICENSE.txt
| README.md
| requirements.txt
|
+---data
| +---processed
| | mean_pop_density.csv
| |
| \---raw
| locations.csv
| population_density_2019.csv
|
+---docs
| documentation.txt
| user_guide.html
|
+---reports
| | population_analysis_report.html
| | population_analysis_report.rmd
| |
| \---figures
| graph.png
|
---src
\ main_func (to be broken up).py
Note: /src/
stands for “source” - referring to your source code, the files your program is written with.
There are other folders and considerations to structure your project beyond the minimum.
You may want to have separate folders for:
/src/
folder/data/
folderIn R, consider using a predefined project structure using the .Rproj
method, which generates a structure for you:
There is a project structure designed by the Government Digital Service for data science projects.
Now that we are aware of good project folder structure, we can discuss separating our big full code file into more logical smaller files.
This section will focus on the code contained within the /src/
folder shown in the last section.
Group functions with similar purpose together, such as data cleaning
, loading
, modelling
. Make each file/module as focussed as possible to make it easy to find any required function.
To move the functions between files there are four main steps that need to be taken:
Moving code between files when a script already exists is a task that can be avoided by designing your project files in a useful way when starting to write your code. Any new analysis should make use of existing modules that you have created.
Note: in some other people’s code, particularly R code, you may see many files with only one function in each file. This is to be avoided as it does not help group code making it easier to work with. Generally, avoid files containing all the functions of a program, and avoid having many files each containing one function. For further information about this, with reference to R code package conventions have a look at the “R Packages” book.
In this section we will learn how to move functions between files.
In the earlier part of the course “Function Inputs” we discussed why it is important that variables are only accessed through function inputs and outputs. This principle is even more important when moving code between files.
We are first going to make a new file called input_output.R
. This file is going to contain all the code we need for loading and exporting our data frames. It is good practice to group related functions into the same file - especially around data access.
In addition, we are going to rename our original script to main.R
. This is the file that will run all our code.
Within the input_output.R
file we are going to put the following functions, removing them from main.R
:
load_formatted_frame()
write_output()
Our files will now appear as below. Note, they will not currently run.
File contains most of the code used to run the program.
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
#' Split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[column_name]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[column_name]
return(dataframe)
}
#' Join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select(combined_frames, sdg_region_name, population_density)
combined_frames_reduced
return(combined_frames_reduced)
}
#' Function to groupby and calculate the mean of two columns
<- function(dataframe, groupby_column, statistic_column) {
aggregate_mean
# Perform aggregation and summary
# Use group_by_ because of variable column name
<- group_by_(.data = dataframe, groupby_column)
region_mean_density_grouped
# use get() to access column name
<- dplyr::summarise(.data = region_mean_density_grouped,
region_mean_density "mean_population_density" = mean(get(statistic_column)))
return(region_mean_density)
}
#' Format the dataframe for output
<- function(dataframe, statistic_column) {
format_frame
# Sort the data for clearer reading, descending order
<- dplyr::arrange(.data = dataframe)
dataframe_sorted
# Round mean density for clearer reading
<- round(dataframe_sorted[statistic_column],
dataframe_sorted[statistic_column] digits = 2)
return(dataframe_sorted)
}
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function() {
get_analyse_output
<- load_formatted_frame("../../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
write_output(formatted_statistic, "./mean_pop_density.csv")
}
## Run the main function created
get_analyse_output()
File contains the functions used for input and output operations.
library(readr)
#' Read population data and reformat column names
<- function(path_to_data) {
load_formatted_frame # Load the population density data 2019
<- file.path(path_to_data)
formatted_path <- readr::read_csv(formatted_path)
dataframe
# Clean the column names, following snake_case convention
colnames(dataframe) <- tolower(colnames(dataframe))
colnames(dataframe) <- stringr::str_replace_all(colnames(dataframe), pattern = " ", replacement = "_")
return(dataframe)
}
#' write output statistic in formatted manner
<- function(dataframe, output_filepath) {
write_output
::write_csv(x = dataframe, path = output_filepath)
readr
}
The code shown above will not run because the main.R
code cannot access the functions contained within input_output.R
.
For a program to access code in another location the functions need to be loaded into that program explicitly. In R this is called sourcing function(s)
We load the code from one file into the another allowing our code to access the contents of the loaded file.
Loading a file puts the objects within into the scope of our program.
If we load a file’s code in the global scope of our program, then the file’s contents will be accessible anywhere in the program. If we load the file in a specific local scope it will only be accessible in that local area.
In R there is a distinction between loading in code from source files and loading in packages.
To load functions into an R program you need to first put those functions into a .R
file.
Within the file containing the program you want to run, you “source” the file containing the functions. This loads the functions into the program’s global scope.
To source a file use the source()
function.
Within the source()
function give the file path of the .R
file you want to source.
source("path to R file")
For our purposes at the top of the main.R
file we would write:
source("./input_output.R")
to access the functions within input_output.R
.
By convention files are loaded at the top of a file. This makes it clear what files are used in the code and ensures all parts of the code that need the objects in the file can access them.
The sourcing of a file will by default load in all the contents of that file - not just functions. For this reason, it is important to keep the files containing your functions clear of other unnecessary objects such as unnecessary variables.
The main.R
file will then look like the below script, allowing us to access the functions from input_output.R
.
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
source("input_output.r")
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[column_name]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[column_name]
return(dataframe)
}
#' Join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select_(combined_frames, left_column)
combined_frames_reduced
return(combined_frames_reduced)
}
#' Function to groupby and calculate the mean of two columns
<- function(dataframe, groupby_column, statistic_column) {
aggregate_mean
# Perform aggregation and summary
# Use group_by_ because of variable column name
<- group_by_(.data = dataframe, groupby_column)
region_mean_density_grouped
# use get() to access column name
<- dplyr::summarise(.data = region_mean_density_grouped,
region_mean_density "mean_population_density" = mean(get(statistic_column)))
return(region_mean_density)
}
#' Format the dataframe for output
<- function(dataframe, statistic_column) {
format_frame
# Sort the data for clearer reading, descending order
<- dplyr::arrange(.data = dataframe)
dataframe_sorted
# Round mean density for clearer reading
<- round(dataframe_sorted[statistic_column],
dataframe_sorted[statistic_column] digits = 2)
return(dataframe_sorted)
}
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function() {
get_analyse_output
<- load_formatted_frame("../../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
write_output(formatted_statistic, "./mean_pop_density.csv")
}
## Run the main function created
get_analyse_output()
These exercises will help you practice splitting code into different files and loading them back into the main.R
script.
Create a new file in the example_code/modules/exercises/start/
folder called analysis.R
.
Put the following functions within the new file:
aggregate_mean()
format_frame()
Change the code in main.R
such that the file loads the relevant functions and runs the whole analysis.
Create a new file in the example_code/modules/exercises/start/
folder called manipulation.R
.
Put the following functions within the new file:
convert_type_to_int()
access_country_code()
join_frames()
Change the code in main.R
such that the file loads the relevant functions and runs the whole analysis.
#' Function to groupby and calculate the mean of two columns
<- function(dataframe, groupby_column, statistic_column) {
aggregate_mean
# Perform aggregation and summary
# Use group_by_ because of variable column name
<- group_by_(.data = dataframe, groupby_column)
region_mean_density_grouped
# use get() to access column name
<- dplyr::summarise(.data = region_mean_density_grouped,
region_mean_density "mean_population_density" = mean(get(statistic_column)))
return(region_mean_density)
}
#' Format the dataframe for output
<- function(dataframe, statistic_column) {
format_frame
# Sort the data for clearer reading, descending order
<- dplyr::arrange(.data = dataframe)
dataframe_sorted
# Round mean density for clearer reading
<- round(dataframe_sorted[statistic_column],
dataframe_sorted[statistic_column] digits = 2)
return(dataframe_sorted)
}
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[column_name]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[column_name]
return(dataframe)
}
#' Join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select(combined_frames, sdg_region_name, population_density)
combined_frames_reduced
return(combined_frames_reduced)
}
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
source("input_output.r")
source("analysis.r")
source("manipulation.r")
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function() {
get_analyse_output
<- load_formatted_frame("../../../../data/population_density_2019.csv")
pop_density <- load_formatted_frame("../../../../data/locations.csv")
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
write_output(formatted_statistic, "./mean_pop_density.csv")
}
## Run the main function created
get_analyse_output()
Our analysis pipeline for average population density is near complete now.
So far, we have been working on one single data set. Now however we have been given access to the population density values for a range of years in different CSV files.
You now need to add to and improve the code and files to meet new requirements. The code to change can be found in example_code/case_study/initial/
.
This is not the exact methodology for calculating the mean population density in a region - however by completing this case study you will gain experience building a pipeline for analysis.
The below tasks are intended to reinforce the content covered in this course.
Tasks 1-4 are similar to exercises already covered.
Tasks 5 and 6 will require more thought and working with data frames and your chosen languages. You may need to do some research in your chosen language/package to complete them.
Answers to all of the tasks combined are given below, and the full code is contained within example_code/case_study/answers/
.
In order to be able to perform the analysis across different years, we need to access different files:
get_analyse_output()
in main.R
pop_density_filepath
locations.csv
data: location_filepath
Test that your refactoring of the code produces the same result by running the program with the original data sets.
Test that your refactoring of the code works for other data sets by using the 2018
data.
As we build analysis on top of the code we have already written, the get_analyse_output()
will not be the highest level function:
get_analyse_output()
function to return the calculated final data frame formatted_statistic
.output_filepath
of the output path of the reslting data frame.
False/FALSE
do not write out the data frame to a fileTest this new parameter by running it with a file path and with False/FALSE
.
We will be joining the data at the end of our process, this means we will be performing multiple joins in the analysis:
joins.R
join_frames()
from manipulation.R
to join.R
Ensure your code still works at this stage.
If we are going to combine all our data, we will need to change how it is represented:
column_name_year()
and put it in manipulation.R
dataframe
original_column
new_column
Test this function on a data frame.
Once we have created a new data frame for each year we will need to combine the data for each region into one data frame:
join_years()
in joins.R
dataframes
, a list containing the data frames to be joinedjoin_column
We need to bring all our new functions together to perform the final analysis:
combined_analysis()
in main.R
output_filepath
to designate where to output the final analysisget_analyse_output()
for each file path, without writing out the data to file; name the frame appropriately
column_name_year()
should be used on each data frame produced to change the mean_population_density
column to the year of the frame; you can access the year of each data set from each filepath, you may want to split the path string upcombined_analysis()
using join_years()
from joins.R
join the results of each get_analyse_output()
togetherwrite_output()
from input_output.R
write out the data frame to output_filepath
if the value is not False/FALSE
Be sure to run the whole process in one command to check that it produces the output expected.
sdg_region_name | 2017 | 2018 | 2019 | 2020 |
---|---|---|---|---|
Australia/New Zealand | 10.53 | 10.63 | 10.72 | 10.82 |
Central and Southern Asia | 317.54 | 324.57 | 330.63 | 335.32 |
Eastern and South-Eastern Asia | 2065.47 | 2089.43 | 2112.67 | 2135.78 |
Europe and Northern America | 756.12 | 760.73 | 764.93 | 768.50 |
Latin America and the Caribbean | 197.25 | 198.43 | 199.62 | 200.85 |
Northern Africa and Western Asia | 222.42 | 228.59 | 234.38 | 239.48 |
Oceania (excluding Australia and New Zealand) | 142.01 | 143.12 | 144.20 | 145.28 |
Sub-Saharan Africa | 121.30 | 123.91 | 126.55 | 129.21 |
Below are the files changed during the case study.
library(tidyr)
library(dplyr)
library(stringr)
library(readr)
source("input_output.r")
source("analysis.r")
source("manipulation.r")
source("joins.r")
#' Access the data, run the analysis of population density means over locations,
#' output the data into a csv.
<- function(pop_density_filepath, location_filepath, output_filepath) {
get_analyse_output
<- load_formatted_frame(pop_density_filepath)
pop_density <- load_formatted_frame(location_filepath)
locations
<- access_country_code(pop_density)
pop_density_single_code
<- convert_type_to_int(dataframe = pop_density_single_code,
pop_density_correct_types column_name = "country_code",
string_value = "CC")
<- convert_type_to_int(dataframe = locations,
locations_correct_types column_name = "location_id",
string_value = '"')
<- join_frames(pop_density_correct_types,
population_location
locations_correct_types,left_column = "country_code",
right_column = "location_id")
<- aggregate_mean(dataframe = population_location,
aggreagation groupby_column = "sdg_region_name",
statistic_column = "population_density")
<- format_frame(aggreagation, "mean_population_density")
formatted_statistic
# output_filepath == True if not False
if (output_filepath == TRUE) {
write_output(formatted_statistic, output_filepath=output_filepath)
}
return(formatted_statistic)
}
#' Perform population density mean analysis across multiple files
<- function(population_filepaths, location_filepath, output_filepath) {
combined_analysis
= list()
loaded_dataframes for (population_file in population_filepaths) {
# The year is given at the end of the file path, but before '.csv'
= stringr::str_split(population_file, pattern = "_")
path_broken_up = dplyr::last(dplyr::last(path_broken_up))
path_end
= substr(path_end, start = 1, stop = 4)
year
= get_analyse_output(population_file, location_filepath, output_filepath=FALSE)
year_analysis
# Change the column name to the year of the population density
= column_name_year(year_analysis, "mean_population_density", year)
formatted_year_analysis
length(loaded_dataframes) + 1]] <- formatted_year_analysis
loaded_dataframes[[
}
= join_years(loaded_dataframes, join_column="sdg_region_name")
combined_dataframes
if (output_filepath != FALSE) {
write_output(combined_dataframes, output_filepath=output_filepath)
}
return(combined_dataframes)
}
<- "../../../data/population_density_2017.csv"
pop_path_2017 <- "../../../data/population_density_2018.csv"
pop_path_2018 <- "../../../data/population_density_2019.csv"
pop_path_2019 <- "../../../data/population_density_2020.csv"
pop_path_2020
<- "../../../data/locations.csv"
location_path
# Demonstration of final output for case study
<- combined_analysis(list(pop_path_2017, pop_path_2018, pop_path_2019, pop_path_2020),
final_output
location_path, output_filepath = FALSE)
print(final_output)
#' Function to split combined code columns
#' and remove uncessary columns
<- function(dataframe) {
access_country_code # The country_and_parent_code column needs to
# be split into two columns without the strings
<- tidyr::separate(data = dataframe, col = country_and_parent_code,
dataframe into = c("country_code", "parent_code"), sep = "_")
# Remove the parent_code column, not used in later analysis
<- dplyr::select(dataframe, everything(), -parent_code)
dataframe
return(dataframe)
}
#' Function to convert string to integer column type
<- function(dataframe, column_name, string_value) {
convert_type_to_int
# Convert country_code to integer by removing extra strings
# Using dataframe$column_name to get a column won't work when the column name is a variable
<- stringr::str_remove_all(dataframe[[column_name]], pattern = string_value)
dataframe[column_name]
# Convert type
<- as.integer(dataframe[[column_name]])
dataframe[column_name]
return(dataframe)
}
#' Change name of specified columns in dataframe
<- function(dataframe, original_column, new_column) {
column_name_year
colnames(dataframe)[colnames(dataframe) == original_column] <- new_column
return(dataframe)
}
#' Join the required frames on specified columns,
#' dropping unecessary columns
<- function(left_dataframe, right_dataframe, left_column, right_column) {
join_frames
# Change location_id to be called country_code for join
colnames(right_dataframe)[colnames(right_dataframe) == right_column] <- left_column
<- dplyr::left_join(x = left_dataframe,
combined_frames y = right_dataframe,
by = left_column)
<- dplyr::select(combined_frames, sdg_region_name, population_density)
combined_frames_reduced
return(combined_frames_reduced)
}
#' Join a list of frames with an inner join on a specified column name
<- function(dataframes, join_column) {
join_years
<- dataframes %>% purrr::reduce(.x = dataframes,
merged_frame .f = dplyr::inner_join,
by = join_column)
return(merged_frame)
}
Continue on to the case study