Documents and ingestion

This week is the second core session of the course. Last week introduced StatsChat as a full Retrieval-Augmented Generation (RAG) system. This week zooms in on the first part of the pipeline: ingesting and processing PDFs.

The aim is to understand how official reports become machine-readable evidence for a RAG system. We will look at the basic PDF-to-JSON idea, how StatsChat currently approaches this, why PDFs are difficult, and how the pipeline could be improved. Participants should also keep in mind one publication or data source from their own organisation that they might want to feed into a StatsChat-like system.

Slide deck

Session overview

In this session, we focus on five big ideas:

PDF processing is the foundation of the rest of the RAG pipeline.
PDFs are designed for human reading, not necessarily for machine-readable structure.
StatsChat currently uses a practical PDF ingestion pipeline built around downloading PDFs, tracking sources, and converting pages into JSON.
The current approach works as a prototype, but important structure can be lost: tables, headings, captions, units, footnotes, page context and metadata.
Better ingestion and better retrieval need to be designed together.

Why this matters

A RAG system can only retrieve evidence that has been ingested and indexed. If a table heading, unit, date, page number or source link is lost during PDF processing, later retrieval and generation may not be able to recover it reliably.

The key message is not simply “extract the text”. The goal is to preserve useful, traceable evidence.

Basic PDF-to-JSON demo

A short coding demo will show the simple version of PDF processing:

open a PDF with PyMuPDF
loop through each page
extract the page text
save page_number and page_text to a JSON file

A simple demo script is provided here:

pdf_to_json_demo.py

This demo is intentionally much simpler than the full StatsChat pipeline, but it shows the core idea.

How StatsChat goes further

The StatsChat pipeline adds more around the simple PDF-to-JSON step, including:

crawling KNBS report pages
downloading PDFs
recording source URLs in url_dict.json
extracting page-level text
adding metadata from the KNBS report page
preserving page URLs for references
handling setup and update modes
passing JSON conversions into chunking, embeddings and retrieval

Example PDFs

Suggested examples for the session are listed here:

example-pdfs.md

The recommended anchor examples are:

CPI March 2026: clean/simple demo PDF
2025 Facts and Figures: table-heavy report
ICT indicators factsheet: visual layout and chart/grouping example

The Economic Survey and full Housing Survey Basic Report could be added later if we want examples of very long reports and multi-attachment report pages.

Exercise for the lab

The Tuesday session includes a short exercise to complete before the Thursday lab. It is designed to be easy and quick: choose one source from your own organisation, then inspect two simple extraction examples. We will discuss answers together in the lab.

Week 5 exercise: choosing a source and inspecting extraction

This short exercise is for the Thursday lab. It should take around 10–15 minutes. The aim is to absorb the main lesson from Tuesday’s session:

PDF processing is not just extracting text. It is preserving evidence in a form that retrieval and generation can use.

You do not need to write code.

Part 1: choose a source from your own organisation

Think of one publication or data source that could be useful in a StatsChat-like system.

It could be:

an annual statistical report
a monthly or quarterly release
a factsheet or infographic
a spreadsheet or table collection
a web page, database, or API
a mix of the above

Answer briefly:

What source would you choose?
Who might ask questions about it?
What kinds of questions would they ask?
What format is it in? For example: PDF, HTML, Excel, Word, database/API.
What would be important to preserve: page numbers, dates, units, geography, table headings, footnotes, charts, metadata?

Part 2: inspect a simple extracted table

In the CPI March 2026 report, a table includes the total row for inflation. A simple PyMuPDF extraction can return text like this:

Total
100.0000
0.5
4.4

The original table headings tell us that the final number, 4.4, is the percentage change on the same month of the previous year.

Answer briefly:

What information has been preserved?
What information is only understandable if we keep the table heading or surrounding context?
Would this be enough to answer: “What was the annual inflation rate in March 2026?”
What would a better structured output keep?

Part 3: inspect a visual factsheet extraction

In the ICT indicators factsheet, the page is visually clear to a human reader. But extracted text can look more like a sequence of labels and values:

Mobile Phone Ownership by Individuals
National
Male
Female
53.7%
54.5%
52.9%
Internet Usage by Individuals
National
Male
35.0%
37.8%
Female
32.2%

Answer briefly:

What information has been preserved?
What visual structure has been weakened or lost?
If this text were indexed for retrieval, what might go wrong?
If this text were passed to an LLM, what might go wrong?
What metadata or structure would you want the ingestion pipeline to preserve?

Part 4: bring one point to the lab

Bring one short comment to the lab:

one thing that surprised you
one thing that might be difficult for your organisation’s publications
or one improvement you think would make extraction more useful

Short bullet points are completely fine. We will discuss examples together in the lab.

Slide deck

Session overview

Why this matters

Basic PDF-to-JSON demo

How StatsChat goes further

Example PDFs

Exercise for the lab

Week 5 exercise: choosing a source and inspecting extraction

Part 1: choose a source from your own organisation

Part 2: inspect a simple extracted table

Part 3: inspect a visual factsheet extraction

Part 4: bring one point to the lab

Further reading