Check the CSV Encoding with IATI.Cloud¶
There are issues with the CSV encoding, which are described here
[1]:
from os import remove
import pandas as pd
import requests
from pprint import pprint
[2]:
EXAMPLE_1_ID = "SE-0-SE-6-5504005318-ZAF-99810"
EXAMPLE_2_ID = "NL-KVK-27248417-3818-L15068"
ENCODING_EXAMPLE_1 = "KR-GOV-051-2016140000008"
[3]:
def print_gap():
print("\n>>>>>>>>\n")
def get_iaticloud_data_url_from_activity_id(iati_id, format="csv", number_of_results=1):
return (
f"http://iati.cloud/search/activity?q=(iati_identifier:\"{iati_id}\")&"
f"fl=id,iati_identifier,title_narrative_text,description_narrative_text&"
f"wt={format}&rows={number_of_results}"
)
def get_first_description_from_dataframe(dataframe):
return dataframe["description_narrative_text"][0]
def get_description_from_csv(iati_id):
info_url = get_iaticloud_data_url_from_activity_id(iati_id, format="csv", number_of_results=1)
dataframe = pd.read_csv(info_url)
return get_first_description_from_dataframe(dataframe)
[4]:
print(get_description_from_csv(EXAMPLE_1_ID))
print_gap()
print(get_description_from_csv(EXAMPLE_2_ID)[:1000], ".....")
Program development fund - project preparaations\, meetings and audits,Program development fund - project preparaations\, meetings and audits
>>>>>>>>
Especialy woman and children are our target groups\, because most of the time they have to fetch water now in small creeks with polluted water\, far away from home.\nThe whole community is involved by digging the boreholes.,When the boreholes are completed\, these are accessable for the whole community.\nSo clean drinking water is near (less risk on diseases) and the people can start farming (generate an income\, upgrade their living).,Fianancial: \nIf a pump need any repair (most of the time a minor problem\, like replacing a small rubber ring) they can call Peter Gichuki for assistance. Paying for his transport is the only thing the community has to pay. He will provide the pump with the part\, which we have in stock.\nIf the borehole has to be deepen\, the community has to do this (same like digging the borehole).\n\nInstitutional:\nStichting Namelok is involved by tracing the place to dig a borehole and to support the community by digging. Stichting Namelok will provide the materi .....
Okay, so here we clearly have one kind of encoding, where we have a double \\
to escape the comma and newline characters.
What does the decription look like when we ask for JSON?¶
Below are two functions that help us extract the data from the JSON response from IATI.cloud. The response looks like this:
{'response':
{'docs': [
{'description_narrative_text': [
"STRING WE WANT",
"ANOTHER STRING WE WANT
]
...
}
...
}
...
}
Note that in this case, the description returns a list of strings, whereas we get one long concatenated string from the CSV.
[5]:
def get_json_for_activity(iati_id):
url = get_iaticloud_data_url_from_activity_id(iati_id, format="json", number_of_results=10)
return requests.get(url).json()
def get_description_from_json(json_data):
return json_data['response']['docs'][0]['description_narrative_text']
def get_json_description_type_and_instance(iati_identifier):
json_data = get_json_for_activity(iati_identifier)
description_narrative_text = get_description_from_json(json_data)
print("TYPE:", type(description_narrative_text))
print("LENGTH:", len(description_narrative_text))
print("")
pprint(description_narrative_text)
[6]:
pprint(get_json_description_type_and_instance(EXAMPLE_1_ID))
print_gap()
pprint(get_json_description_type_and_instance(EXAMPLE_2_ID))
TYPE: <class 'list'>
LENGTH: 2
['Program development fund - project preparaations, meetings and audits',
'Program development fund - project preparaations, meetings and audits']
None
>>>>>>>>
TYPE: <class 'list'>
LENGTH: 8
['Especialy woman and children are our target groups, because most of the time '
'they have to fetch water now in small creeks with polluted water, far away '
'from home.\n'
'The whole community is involved by digging the boreholes.',
'When the boreholes are completed, these are accessable for the whole '
'community.\n'
'So clean drinking water is near (less risk on diseases) and the people can '
'start farming (generate an income, upgrade their living).',
'Fianancial: \n'
'If a pump need any repair (most of the time a minor problem, like replacing '
'a small rubber ring) they can call Peter Gichuki for assistance. Paying for '
'his transport is the only thing the community has to pay. He will provide '
'the pump with the part, which we have in stock.\n'
'If the borehole has to be deepen, the community has to do this (same like '
'digging the borehole).\n'
'\n'
'Institutional:\n'
'Stichting Namelok is involved by tracing the place to dig a borehole and to '
'support the community by digging. Stichting Namelok will provide the '
'materials for the foundation and waterpump.\n'
'The community is responsable for the borehole and pump.\n'
'Peter Gichuky is member of stichting Namelok and the Ruai Committee and '
'lives in Kenya.\n'
"If we are not around and there are problems the community's can contact him, "
'so he can help the people.\n'
'\n'
'Environmental:\n'
'Irrigation is possible when there is water (borehole). The community can '
'start farming. That means a green environment and less erosion. If the '
'waterlevel is going down the community has to dig the borehole deeper.\n'
'\n'
'Technical:\n'
'The community is digging the boreholes by hand.\n'
'If this is finished we (stichting Namelok) will help them with constructing '
'the foundation/cover for this borehole.\n'
'After that we put a handpump on this cover.\n'
'The materials for the foundation/cover we buy local.\n'
'The handpump we buy in Nairobi with the company Davis & Shirtliff.\n'
'THe mark of the handpump is AFRIDEV.\n'
'The handpump is very easy to maintain, the only part that needs replacing '
'after about 2 years is a rubber. We have this in stock and Peter is able to '
'replace this. Transport for Peter has te be paid by the community (about € '
"5). With the income of framing this shouldn't be a problem.\n"
'\n'
'Social:\n'
'The waterproject in Isiolo is a great help for women and children.\n'
'The water sources are not that far anymore. So they spend less time in '
'fetching water.\n'
'The children can go to school, the women have extra time to start farming.\n'
'With farming they will get income, so the living standard will grow.',
"All these areas are very dry, there's a shortage of drinking- and irrigation "
'water. \n'
"The local community has to dig there own boreholes, we'll suply them with "
'materials.\n'
"End 2016 and begin 2017 we'll realize these boreholes. And we want to repair "
'at least 5 pumps.',
"The community's dig their own borehole.\n"
'If they reach the waterlevel they contact us.\n'
'We gonna built the foundation and put a handpump.\n'
'In total we want to realize 12 (Afridev) handpumps with foundation and '
'repair 5 pumps.',
'We visited these areas serveral times. During these visits we saw that there '
'is a lack of clean water for domestic use. \n'
'After measuring we told them where to dig their boreholes.\n',
"Some community's in Isiolo (Meru district) are digging their boreholes at "
'this moment.\n'
'If the waterlevel is reached they contact us and then stichting Namelok will '
'suply them with materials for the foundation/cover (sand, stones, cement and '
'wood) and a Afridev waterpump. There are a few Afrdiv pumps (installed long '
'time ago) by on known organisations out of order. We want to repair these '
'pumps.',
'Stichting Namelok']
None
Okay, so it’s clear that the JSON response also gives us these awkward \n
characters in places, but we don’t get the \,
issues, where commas are escaped. This points to some of the harder points of IATI data, where the structure means that we cannot be certain about the structure of the data and so we need to deal with multiple descriptions, and we struggle with joining them together and consistently delivering them accross different format types.
Character Encoding¶
Not only do we need to deal with issues of how descriptions might have been joined together, we also need to consider that we might not be handling the full spectrum of unicode directly.
We can see where characters give us issues with KR-GOV-051-2016140000008
which we’ve stored as variable ENCODING_EXAMPLE_1
[7]:
get_description_from_csv(ENCODING_EXAMPLE_1)
[7]:
'2개 마을 지역주민 2\\,943명의 지속적이고 안전한 급수위생시설 이용과 보건위생 행동양식 습득/2개 마을 주민 2\\,943명의 지속적이고 안전한 급수위생시설 접근성 향상과 보건위생 인식 개선,To provide water and sanitation facilities within the community\\, to establish water management committees\\, to provide health and sanitation education to teachers and students.'
So here we clearly have an example where the characters are being rendered somewhat correctly.
[8]:
get_json_description_type_and_instance(ENCODING_EXAMPLE_1)
TYPE: <class 'list'>
LENGTH: 2
['2개 마을 지역주민 2,943명의 지속적이고 안전한 급수위생시설 이용과 보건위생 행동양식 습득/2개 마을 주민 2,943명의 지속적이고 '
'안전한 급수위생시설 접근성 향상과 보건위생 인식 개선',
'To provide water and sanitation facilities within the community, to '
'establish water management committees, to provide health and sanitation '
'education to teachers and students.']
This seems to be consistent accross our formats and it’s rendering correctly. This leads me to believe that it’s how we’re storing our CSV file that’s the issue
Writing and Reading our CSV File¶
We need to consider that we are not reading the entire response from the IATI.cloud endpoint in to a pandas data frame. In order to keep memory use low, we request the data in chunks and write to the file bit by bit. Let’s do that now with a much smaller request and then have a look at the code that we’re using to see if changing the encodings would help when we read in the file.
[9]:
# A very slightly changed version of the function that we use to download the data
import shutil
from pathlib import Path
from ssl import SSLCertVerificationError
def get_and_write_csv_from_url(url, filename):
try:
with requests.get(url, stream=True) as r:
with open(filename, "wb+") as f:
shutil.copyfileobj(r.raw, f)
except (SSLCertVerificationError, requests.exceptions.SSLError) as e:
print(f"Exception: {e}")
with requests.get(url, stream=True, verify=False) as r:
with open(filename, "wb+") as f:
shutil.copyfileobj(r.raw, f)
print("Download Complete")
# a helper function
def create_csv_filename(iati_identifier):
return f"testfile_{iati_identifier}.csv"
Great! We have our csv, which we’ve written to be alongside our current Python notebook. Let’s use the 2 different encodings and see what changes …
Checking on \n
and \,
characters when reading and writing CSVs¶
[11]:
def compare_csv_encodings(iati_identifier, start_slice=0, end_slice=-1):
FILENAME = create_csv_filename(iati_identifier)
url = get_iaticloud_data_url_from_activity_id(iati_identifier, format="csv", number_of_results=10)
print(f"Downloading CSV for {iati_identifier} to {FILENAME}")
get_and_write_csv_from_url(url, FILENAME)
print("\n")
print(1, "Reading our Example straight from the API into Pandas:")
print(2, "Reading our Example from file using ISO encoding")
print(3, "Reading our Example from file using UTF8 encoding")
print()
# Straight from API into pandas dataframe (no reading or writing to file)
print(1, get_description_from_csv(iati_identifier)[start_slice:end_slice])
# ISO Encoding
iso_encoded_dataframe = pd.read_csv(FILENAME, encoding="iso-8859-1")
print(2, get_first_description_from_dataframe(iso_encoded_dataframe)[start_slice:end_slice])
# UTF-8 Encoding
utf8_encoded_dataframe = pd.read_csv(FILENAME, encoding="utf-8")
print(3, get_first_description_from_dataframe(utf8_encoded_dataframe)[start_slice:end_slice])
print("\n")
print(f"tidying up and removing {FILENAME}")
remove(FILENAME)
compare_csv_encodings(EXAMPLE_1_ID)
Downloading CSV for SE-0-SE-6-5504005318-ZAF-99810 to testfile_SE-0-SE-6-5504005318-ZAF-99810.csv
Download Complete
1 Reading our Example straight from the API into Pandas:
2 Reading our Example from file using ISO encoding
3 Reading our Example from file using UTF8 encoding
1 Program development fund - project preparaations\, meetings and audits,Program development fund - project preparaations\, meetings and audit
2 Program development fund - project preparaations\, meetings and audits,Program development fund - project preparaations\, meetings and audit
3 Program development fund - project preparaations\, meetings and audits,Program development fund - project preparaations\, meetings and audit
tidying up and removing testfile_SE-0-SE-6-5504005318-ZAF-99810.csv
Hmmmmmm. Okay. So it looks like the \,
issue is not solved with the encoding.
We can confirm this with the second example, but we’ll cut the descriptions so it’s easier to compare
[12]:
# write the new CSV
# 800:900 gives us a good example of the string
compare_csv_encodings(EXAMPLE_2_ID, start_slice=800, end_slice=900)
Downloading CSV for NL-KVK-27248417-3818-L15068 to testfile_NL-KVK-27248417-3818-L15068.csv
Download Complete
1 Reading our Example straight from the API into Pandas:
2 Reading our Example from file using ISO encoding
3 Reading our Example from file using UTF8 encoding
1 e like digging the borehole).\n\nInstitutional:\nStichting Namelok is involved by tracing the place
2 e like digging the borehole).\n\nInstitutional:\nStichting Namelok is involved by tracing the place
3 e like digging the borehole).\n\nInstitutional:\nStichting Namelok is involved by tracing the place
tidying up and removing testfile_NL-KVK-27248417-3818-L15068.csv
Again, there is no change. BUT we do have to consider foreign characters! Let’s try with our 3rd example:
[13]:
compare_csv_encodings(ENCODING_EXAMPLE_1)
Downloading CSV for KR-GOV-051-2016140000008 to testfile_KR-GOV-051-2016140000008.csv
Download Complete
1 Reading our Example straight from the API into Pandas:
2 Reading our Example from file using ISO encoding
3 Reading our Example from file using UTF8 encoding
1 2개 마을 지역주민 2\,943명의 지속적이고 안전한 급수위생시설 이용과 보건위생 행동양식 습득/2개 마을 주민 2\,943명의 지속적이고 안전한 급수위생시설 접근성 향상과 보건위생 인식 개선,To provide water and sanitation facilities within the community\, to establish water management committees\, to provide health and sanitation education to teachers and students
2 2ê° ë§ì ì§ì주민 2\,943ëªì ì§ìì ì´ê³ ìì í ê¸ììììì¤ ì´ì©ê³¼ ë³´ê±´ìì íëìì ìµë/2ê° ë§ì 주민 2\,943ëªì ì§ìì ì´ê³ ìì í ê¸ììììì¤ ì ê·¼ì± í¥ìê³¼ ë³´ê±´ìì ì¸ì ê°ì ,To provide water and sanitation facilities within the community\, to establish water management committees\, to provide health and sanitation education to teachers and students
3 2개 마을 지역주민 2\,943명의 지속적이고 안전한 급수위생시설 이용과 보건위생 행동양식 습득/2개 마을 주민 2\,943명의 지속적이고 안전한 급수위생시설 접근성 향상과 보건위생 인식 개선,To provide water and sanitation facilities within the community\, to establish water management committees\, to provide health and sanitation education to teachers and students
tidying up and removing testfile_KR-GOV-051-2016140000008.csv
Aha! We have some success here! Note that when we do iso
encoding, we get garbled characters back. This means that we can change the encoding to UTF-8 and we should get back more sensible results.