Week 6: Introduction to NLP in Python

2025-07-14

Welcome to Week 6

Theme: Using NLP in Python to code free-text job responses

Goal: Map survey responses like “software dev” to standardized occupation codes ( ISCO-08)

In this session:

  • What NLP is & why it matters for surveys

  • A text-to-code pipeline

  • Tools & skills you’ll apply

Why Use NLP for Occupational Coding?

Survey data often includes open-text job descriptions:

  • “cashier at supermarket”

  • “hospital nurse”

  • “manages IT systems”

Manual coding is time-consuming and inconsistent

NLP lets us automate and standardize classification → e.g., map to ISCO codes for analysis or reporting

NLP Workflow for Occupational Coding

  1. Text Cleaning

  2. Tokenization & Lemmatization

  3. Matching / Classification

  4. Output: ISCO code

This week, we focus on steps 1–3 — the backbone of an NLP coding pipeline.

Step 1: Text Cleaning

Make text more consistent: - Lowercasing

  • Removing punctuation

  • Standardizing common terms (“dev” → “developer”)

text = "Software Dev / IT systems"
cleaned = text.lower().replace("/", " ")


Pre-cleaning improves accuracy in later steps.

Step 2: Tokenisation & Lemmatisation

Break the text into words (tokens) and reduce them to root form:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("software developers managing systems")

tokens = [token.lemma_ for token in doc if not token.is_stop]


['software', 'developer', 'manage', 'system']

Step 3: Matching

Approaches:

  • Rule-based: keyword dictionaries (e.g. “nurse” → 2221)

  • Fuzzy matching: handle typos & variants

  • ML classifiers: predict ISCO from text

if "nurse" in tokens:
    code = "2221"  # ISCO-08: Nursing professionals

Challenges in Real-World Data

Survey responses are often:

  • Vague: “I work in IT”

  • Misspelled: “enginer”

  • Ambiguous: “consultant” (which kind?)

→ NLP helps structure & interpret messy human input

Real-World Impact

Structured occupation codes:

  • Enable labour force analysis

  • Feed into policymaking, earnings models, diversity studies

  • Standardize across countries using ISCO-08

We will mirror real tasks in government, academia & industry.

Let’s Dive Into The Live Session