Week 4: Importing and Cleaning Data

2025-06-20

Welcome to Week 4

Theme: Importing and Cleaning Data

This week combines:

Loading raw data from different sources
Cleaning and preparing it for analysis

This is the real world part of data science — messy, essential, and very hands-on!

Why This Matters

“Without clean data, or clean enough data, your data science is worthless.” — Michael Stonebraker, adjunct professor, MIT
You need clean, structured data to:

Avoid wrong conclusions
Work with machine learning tools
Build reproducible workflows

Sources of Data

Most common surces of data we work with:

.csv, .xlsx (Excel), .json
Databases (SQL, SQLite)
APIs
Web scraping

For this programme, we focuses on CSV and Excel.

Tools We’ll Use

We will use pandas:

Importing datesets
Cleaning datasets
Analysing tabular data

we’ll use Jupyter Notebooks or VS Code to run your scripts.

The Import Process

With pandas, importing is simple:

import pandas as pd

df = pd.read_csv("data/sales.csv")

For Excel files:

df = pd.read_excel("data/finance.xlsx")

After loading, inspect with:

df.head()
df.info()

What is “Dirty” Data?

Common problems include:

Missing values
Inconsistent formats (e.g. dates, case)
Duplicates
Wrong data types
Outliers

Cleaning = fixing or removing these issues.

Common Cleaning Techniques

Examples:

# Remove missing rows
df.dropna() 
# Fill missing values                
df.fillna(0)        
# Fix dates        
df['Date'] = pd.to_datetime(df['Date'])  
# Remove duplicates
df.drop_duplicates()

We’ll go deeper into these in the notebook.

Real-Life Example

You receive a spreadsheet with patient data:

Names in UPPERCASE
Dates in text format
Some rows are empty
Income listed as text like “£45,000”

Our job: Make it usable for analysis.