Week 6: Introduction to NLP in Python

2025-07-14

Welcome to Week 6

Theme: Using NLP in Python to code free-text job responses

Goal: Map survey responses like “software dev” to standardized occupation codes ( ISCO-08)

In this session:

What NLP is & why it matters for surveys
A text-to-code pipeline
Tools & skills you’ll apply

Why Use NLP for Occupational Coding?

Survey data often includes open-text job descriptions:

“cashier at supermarket”
“hospital nurse”
“manages IT systems”

Manual coding is time-consuming and inconsistent

NLP lets us automate and standardize classification → e.g., map to ISCO codes for analysis or reporting

NLP Workflow for Occupational Coding

Text Cleaning
Tokenization & Lemmatization
Matching / Classification
Output: ISCO code

This week, we focus on steps 1–3 — the backbone of an NLP coding pipeline.

Step 1: Text Cleaning

Make text more consistent: - Lowercasing

Removing punctuation
Standardizing common terms (“dev” → “developer”)

text = "Software Dev / IT systems"
cleaned = text.lower().replace("/", " ")

Pre-cleaning improves accuracy in later steps.

Step 2: Tokenisation & Lemmatisation

Break the text into words (tokens) and reduce them to root form:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("software developers managing systems")

tokens = [token.lemma_ for token in doc if not token.is_stop]

→ ['software', 'developer', 'manage', 'system']

Step 3: Matching

Approaches:

Rule-based: keyword dictionaries (e.g. “nurse” → 2221)
Fuzzy matching: handle typos & variants
ML classifiers: predict ISCO from text

if "nurse" in tokens:
    code = "2221"  # ISCO-08: Nursing professionals

Challenges in Real-World Data

Survey responses are often:

Vague: “I work in IT”
Misspelled: “enginer”
Ambiguous: “consultant” (which kind?)

→ NLP helps structure & interpret messy human input

Real-World Impact

Structured occupation codes:

Enable labour force analysis
Feed into policymaking, earnings models, diversity studies
Standardize across countries using ISCO-08

We will mirror real tasks in government, academia & industry.