2025-06-20
Theme: Importing and Cleaning Data
This week combines:
Loading raw data from different sources
Cleaning and preparing it for analysis
This is the real world part of data science — messy, essential, and very hands-on!
“Without clean data, or clean enough data, your data science is worthless.” — Michael Stonebraker, adjunct professor, MIT
You need clean, structured data to:
Avoid wrong conclusions
Work with machine learning tools
Build reproducible workflows
Most common surces of data we work with:
.csv, .xlsx (Excel), .json
Databases (SQL, SQLite)
APIs
Web scraping
For this programme, we focuses on CSV and Excel.
We will use pandas:
Importing datesets
Cleaning datasets
Analysing tabular data
we’ll use Jupyter Notebooks or VS Code to run your scripts.
With pandas, importing is simple:
For Excel files:
After loading, inspect with:
Common problems include:
Missing values
Inconsistent formats (e.g. dates, case)
Duplicates
Wrong data types
Outliers
Cleaning = fixing or removing these issues.
Examples:
# Remove missing rows
df.dropna()
# Fill missing values
df.fillna(0)
# Fix dates
df['Date'] = pd.to_datetime(df['Date'])
# Remove duplicates
df.drop_duplicates() We’ll go deeper into these in the notebook.
You receive a spreadsheet with patient data:
Names in UPPERCASE
Dates in text format
Some rows are empty
Income listed as text like “£45,000”
Our job: Make it usable for analysis.