Documents and ingestion
This week is the second core session of the course. Last week introduced StatsChat as a full Retrieval-Augmented Generation (RAG) system. This week zooms in on the first part of the pipeline: ingesting and processing PDFs.
The aim is to understand how official reports become machine-readable evidence for a RAG system. We will look at the basic PDF-to-JSON idea, how StatsChat currently approaches this, why PDFs are difficult, and how the pipeline could be improved. Participants should also keep in mind one publication or data source from their own organisation that they might want to feed into a StatsChat-like system.
Slide deck
Session overview
In this session, we focus on five big ideas:
- PDF processing is the foundation of the rest of the RAG pipeline.
- PDFs are designed for human reading, not necessarily for machine-readable structure.
- StatsChat currently uses a practical PDF ingestion pipeline built around downloading PDFs, tracking sources, and converting pages into JSON.
- The current approach works as a prototype, but important structure can be lost: tables, headings, captions, units, footnotes, page context and metadata.
- Better ingestion and better retrieval need to be designed together.
Why this matters
A RAG system can only retrieve evidence that has been ingested and indexed. If a table heading, unit, date, page number or source link is lost during PDF processing, later retrieval and generation may not be able to recover it reliably.
The key message is not simply “extract the text”. The goal is to preserve useful, traceable evidence.
Basic PDF-to-JSON demo
A short coding demo will show the simple version of PDF processing:
- open a PDF with PyMuPDF
- loop through each page
- extract the page text
- save
page_numberandpage_textto a JSON file
A simple demo script is provided here:
This demo is intentionally much simpler than the full StatsChat pipeline, but it shows the core idea.
How StatsChat goes further
The StatsChat pipeline adds more around the simple PDF-to-JSON step, including:
- crawling KNBS report pages
- downloading PDFs
- recording source URLs in
url_dict.json - extracting page-level text
- adding metadata from the KNBS report page
- preserving page URLs for references
- handling setup and update modes
- passing JSON conversions into chunking, embeddings and retrieval
Example PDFs
Suggested examples for the session are listed here:
The recommended anchor examples are:
- CPI March 2026: clean/simple demo PDF
- 2025 Facts and Figures: table-heavy report
- ICT indicators factsheet: visual layout and chart/grouping example
The Economic Survey and full Housing Survey Basic Report could be added later if we want examples of very long reports and multi-attachment report pages.
Exercise for the lab
The Tuesday session includes a short exercise to complete before the Thursday lab. It is designed to be easy and quick: choose one source from your own organisation, then inspect two simple extraction examples. We will discuss answers together in the lab.
Week 5 exercise: choosing a source and inspecting extraction
This short exercise is for the Thursday lab. It should take around 10–15 minutes. The aim is to absorb the main lesson from Tuesday’s session:
PDF processing is not just extracting text. It is preserving evidence in a form that retrieval and generation can use.
You do not need to write code.
Part 1: choose a source from your own organisation
Think of one publication or data source that could be useful in a StatsChat-like system.
It could be:
- an annual statistical report
- a monthly or quarterly release
- a factsheet or infographic
- a spreadsheet or table collection
- a web page, database, or API
- a mix of the above
Answer briefly:
- What source would you choose?
- Who might ask questions about it?
- What kinds of questions would they ask?
- What format is it in? For example: PDF, HTML, Excel, Word, database/API.
- What would be important to preserve: page numbers, dates, units, geography, table headings, footnotes, charts, metadata?
Part 2: inspect a simple extracted table
In the CPI March 2026 report, a table includes the total row for inflation. A simple PyMuPDF extraction can return text like this:
Total
100.0000
0.5
4.4
The original table headings tell us that the final number, 4.4, is the percentage change on the same month of the previous year.
Answer briefly:
- What information has been preserved?
- What information is only understandable if we keep the table heading or surrounding context?
- Would this be enough to answer: “What was the annual inflation rate in March 2026?”
- What would a better structured output keep?
Part 3: inspect a visual factsheet extraction
In the ICT indicators factsheet, the page is visually clear to a human reader. But extracted text can look more like a sequence of labels and values:
Mobile Phone Ownership by Individuals
National
Male
Female
53.7%
54.5%
52.9%
Internet Usage by Individuals
National
Male
35.0%
37.8%
Female
32.2%
Answer briefly:
- What information has been preserved?
- What visual structure has been weakened or lost?
- If this text were indexed for retrieval, what might go wrong?
- If this text were passed to an LLM, what might go wrong?
- What metadata or structure would you want the ingestion pipeline to preserve?
Part 4: bring one point to the lab
Bring one short comment to the lab:
- one thing that surprised you
- one thing that might be difficult for your organisation’s publications
- or one improvement you think would make extraction more useful
Short bullet points are completely fine. We will discuss examples together in the lab.
Further reading
- Review the StatsChat architecture notes on PDF ingestion.
- Review the week 4 material on the distinction between preparing the document collection and answering each question.