Week 4: Importing and Cleaning Data

2025-06-20

Welcome to Week 4

Theme: Importing and Cleaning Data

This week combines:

  • Loading raw data from different sources

  • Cleaning and preparing it for analysis

This is the real world part of data science — messy, essential, and very hands-on!

Why This Matters

“Without clean data, or clean enough data, your data science is worthless.” — Michael Stonebraker, adjunct professor, MIT
You need clean, structured data to:

  • Avoid wrong conclusions

  • Work with machine learning tools

  • Build reproducible workflows

Sources of Data

Most common surces of data we work with:

  • .csv, .xlsx (Excel), .json

  • Databases (SQL, SQLite)

  • APIs

  • Web scraping

For this programme, we focuses on CSV and Excel.

Tools We’ll Use

We will use pandas:

  • Importing datesets

  • Cleaning datasets

  • Analysing tabular data

we’ll use Jupyter Notebooks or VS Code to run your scripts.

The Import Process

With pandas, importing is simple:

import pandas as pd

df = pd.read_csv("data/sales.csv")


For Excel files:

df = pd.read_excel("data/finance.xlsx")


After loading, inspect with:

df.head()
df.info()

What is “Dirty” Data?

Common problems include:

  • Missing values

  • Inconsistent formats (e.g. dates, case)

  • Duplicates

  • Wrong data types

  • Outliers

Cleaning = fixing or removing these issues.

Common Cleaning Techniques

Examples:

# Remove missing rows
df.dropna() 
# Fill missing values                
df.fillna(0)        
# Fix dates        
df['Date'] = pd.to_datetime(df['Date'])  
# Remove duplicates
df.drop_duplicates()        


We’ll go deeper into these in the notebook.

Real-Life Example

You receive a spreadsheet with patient data:

  • Names in UPPERCASE

  • Dates in text format

  • Some rows are empty

  • Income listed as text like “£45,000”

Our job: Make it usable for analysis.

Let’s Dive Into The Live Session