Importing and Cleaning Data

In this week’s workshop, we will focus on importing data from various file formats and cleaning data to prepare it for analysis. Clean data is critical for any data science workflow. We will use the Pandas library for most of the tasks.

Lab Logo Introduction

Live Logo Live coding

Review the material that we explored in Week 4’s live-coding session.

Lab Logo Lab

Exercise: Clean Gapminder Dataset


Your task is to clean and prepare the Gapminder dataset for analysis:

  1. Load the gapminder.csv into a DataFrame.
  2. Inspect the data using:
    • head()
    • info()
    • describe()
  3. Rename columns:
    • Rename gdpPercap to GDP_per_capita.
  4. Handle missing values:
    • Drop rows where lifeExp is missing.
    • Fill missing pop values with median population.
  5. Filter the dataset to include only data for:
    • Nigeria, Ghana, Kenya, and South Africa.
    • Years 2000 and above.
  6. Export the cleaned dataset to gapminder_cleaned.csv.
  7. Commit and push your work to GitHub.

Lab Logo Further reading