Importing and Cleaning Data
In this week’s workshop, we will focus on importing data from various file formats and cleaning data to prepare it for analysis. Clean data is critical for any data science workflow. We will use the Pandas library for most of the tasks.
Introduction
Live coding
Review the material that we explored in Week 4’s live-coding session.
Lab
Exercise: Clean Gapminder Dataset
Your task is to clean and prepare the Gapminder dataset for analysis:
- Load the
gapminder.csvinto a DataFrame. - Inspect the data using:
head()info()describe()
- Rename columns:
- Rename
gdpPercaptoGDP_per_capita.
- Rename
- Handle missing values:
- Drop rows where
lifeExpis missing. - Fill missing
popvalues with median population.
- Drop rows where
- Filter the dataset to include only data for:
- Nigeria, Ghana, Kenya, and South Africa.
- Years 2000 and above.
- Export the cleaned dataset to
gapminder_cleaned.csv. - Commit and push your work to GitHub.
Further reading
- O’Reilly: McKinney, W. (2022). Python for Data Analysis. O’Reilly Media. https://www.oreilly.com/library/view/python-for-data/9781098104023/
- Pandas Official Documentation: Working with Missing Data. https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html