2026-05-05
Core session 2
From official reports to usable evidence for a RAG system.
By the end, you should be able to explain:
Many of you may use, adapt or help maintain a StatsChat-like system for your own organisation.
Keep one publication or data source in mind during the session. We will return to this in the exercise and lab.
A. A frequently used official report with clear user demand
B. A very old scanned report with poor image quality
C. A visually impressive factsheet with many charts but little text
D. Any PDF, because all PDFs are equally easy to process
Suggested answer: A — start with a useful, realistic source where success is measurable. The others may matter later, but they are harder first examples.
A PDF is designed to preserve a page layout for human readers.
It may contain:
A RAG system needs structured evidence.
It needs to know:
A. PDFs are always scanned images
B. PDFs preserve visual layout, but not always semantic structure
C. PDFs cannot contain text
D. PDFs are too large for Python to open
Suggested answer: B — many PDFs have text, but the structure that makes the text meaningful can be hard to extract.
The aim is not just to extract text. The aim is to preserve enough meaning and source information for retrieval and verification.
This is likely what the coding demo will show: the basic idea behind PDF-to-JSON conversion.
The StatsChat version adds more: source tracking, metadata, update modes, error handling and links back to report pages.
StatsChat treats PDF ingestion as part of the one-time or update-time preparation of the document collection.
This is already much better than anonymous text: it keeps page numbers and source links.
CPI March 2026
Clean demo PDF
2025 Facts and Figures
Table-heavy report
ICT indicators factsheet
Visual layout example
The ICT factsheet is a separate example from the 2025 Facts and Figures report. It is useful because the page is very visual.
But this is still page text, not a structured table object.
The page has a clear table with:
A better pipeline would preserve those as table fields, not just lines of text.
The visual page is easy for humans, but the extracted text becomes a sequence of labels and values:
The numbers survive, but the visual grouping is weaker.
A. The PDF has not been downloaded
B. The system may mix up which labels, groups or units belong to which values
C. The LLM will refuse to answer every question
D. There is no risk if all the numbers are present
Suggested answer: B — values without their labels, units or grouping can become misleading evidence.
The important information is often not just in paragraphs. It is in headings, tables, captions, notes and page context.
For KNBS reports, the biggest risk is often not OCR alone. It is tables, headings, page structure, metadata and chunk boundaries being lost.
The information may still be present, but the structure that makes it interpretable has been damaged.
A table might visually mean:
| Year | 2020 | 2021 | 2022 | 2023 | 2024 |
|---|---|---|---|---|---|
| Inflation (%) | 5.4 | 6.1 | 7.7 | 7.7 | 4.5 |
But extracted text might look more like:
A person can sometimes reconstruct this. A retrieval system and LLM may not do so reliably.
A. PDF download only
B. PDF extraction only
C. Chunking, retrieval or ranking
D. The user question cannot be answered by RAG
Suggested answer: C — this is why we diagnose the whole chain rather than blaming the LLM or the PDF extractor immediately.
A better ingestion system should think about the whole report record, not just a single PDF link.
The current branch has already improved the ingestion pipeline.
url_dict.jsonpdfplumber for selected difficult pages/documentsSo the message is not “the current pipeline is bad”. It is “the current pipeline works, but document processing is still a major improvement area”.
Not every weak answer is a PDF extraction problem. Sometimes the evidence is extracted but not retrieved.
This helps retrieval reason across the whole publication while still citing the exact source page or table.
The target is not perfect extraction everywhere. The target is preserving the evidence needed for retrieval and checking.
A. Only increase the LLM prompt length
B. Preserve tables as structured evidence objects with captions, units and page references
C. Remove all tables before indexing
D. Assume the LLM can reconstruct tables from any text
Suggested answer: B — better generation starts with better evidence representation.
| Page type | Better handling |
|---|---|
| Clean digital prose | fast native text extraction |
| Table-heavy page | table-aware extraction and validation |
| Scanned page | selective OCR |
| Mixed or problematic page | fallback extractor or manual review |
Do not treat every PDF page the same. Use the simplest method that preserves the evidence.
A good ingestion pipeline should preserve:
This checklist is also an evaluation checklist: what should we inspect before trusting the pipeline?
CPI March 2026
Short release; good for PDF → JSON and page-level text.
2025 Facts and Figures
Good for discussing table preservation and structured evidence.
ICT indicators factsheet
Good for showing how visual grouping can be lost.
If we later add the Economic Survey or the full Housing Survey, they can extend the same story to long reports and multi-attachment report pages.
Before the lab, complete the short exercise:
The aim is not to produce a technical design. It is just to practise thinking like someone adapting a RAG system.