Week 5: Ingesting and Processing PDFs

2026-05-05

Welcome

Ingesting and Processing PDFs

Core session 2

From official reports to usable evidence for a RAG system.

  • Last week: StatsChat end to end
  • This week: how reports become machine-readable
  • Key idea: extraction quality shapes retrieval quality

Where this session fits

Last week

  • StatsChat overview
  • prepare once vs answer each question
  • RAG pipeline as a whole

Today

  • PDF ingestion
  • text and metadata extraction
  • tables, layout and structure
  • how this could be improved

Today’s aims

By the end, you should be able to explain:

  • why PDF processing matters for RAG
  • why PDFs are difficult for machines
  • how StatsChat currently processes PDF reports
  • what can go wrong during extraction
  • how better ingestion could support better retrieval

Why this matters for your organisation

Many of you may use, adapt or help maintain a StatsChat-like system for your own organisation.

  • What publications or data sources would you want users to query?
  • Are they PDFs, spreadsheets, web pages, APIs, or a mixture?
  • What would be risky if the structure was lost?

Keep one publication or data source in mind during the session. We will return to this in the exercise and lab.

Quick question 1

Which source would you prioritise first for a StatsChat-like system?

A. A frequently used official report with clear user demand
B. A very old scanned report with poor image quality
C. A visually impressive factsheet with many charts but little text
D. Any PDF, because all PDFs are equally easy to process

Suggested answer: A — start with a useful, realistic source where success is measurable. The others may matter later, but they are harder first examples.

The big idea

A PDF is not the same as evidence

A PDF is designed to preserve a page layout for human readers.

It may contain:

  • headings
  • tables
  • charts
  • captions
  • footnotes
  • page numbers

A RAG system needs structured evidence.

It needs to know:

  • what text says
  • where it came from
  • what page it was on
  • what table or section it belonged to
  • whether extraction was reliable

Quick question 2

Why are PDFs often difficult for RAG systems?

A. PDFs are always scanned images
B. PDFs preserve visual layout, but not always semantic structure
C. PDFs cannot contain text
D. PDFs are too large for Python to open

Suggested answer: B — many PDFs have text, but the structure that makes the text meaningful can be hard to extract.

From reports to evidence

Illustration of reports being transformed into structured evidence and used for search

The aim is not just to extract text. The aim is to preserve enough meaning and source information for retrieval and verification.

The simple version

PDF → text → JSON

This is likely what the coding demo will show: the basic idea behind PDF-to-JSON conversion.

Demo: what we expect to see

A simple PyMuPDF workflow

import fitz
import json

pdf = fitz.open("report.pdf")

pages = []
for page_number, page in enumerate(pdf, start=1):
    pages.append({
        "page_number": page_number,
        "page_text": page.get_text()
    })

with open("report.json", "w", encoding="utf-8") as f:
    json.dump({"pages": pages}, f, indent=2)

The StatsChat version adds more: source tracking, metadata, update modes, error handling and links back to report pages.

How StatsChat currently processes PDFs

StatsChat treats PDF ingestion as part of the one-time or update-time preparation of the document collection.

Current output: what gets preserved

A simplified StatsChat JSON structure

{
  "metadata": {
    "title": "Consumer Price Indices and Inflation Rates",
    "release_date": "2026-03-31",
    "theme": "Prices",
    "url": "https://..."
  },
  "pages": [
    {
      "page_number": 3,
      "page_url": "https://...pdf#page=3",
      "page_text": "Annual consumer price inflation..."
    }
  ]
}

This is already much better than anonymous text: it keeps page numbers and source links.

The example PDFs we will use

Cover of the March 2026 CPI report

CPI March 2026
Clean demo PDF

Cover of 2025 Facts and Figures

2025 Facts and Figures
Table-heavy report

ICT indicators factsheet

ICT indicators factsheet
Visual layout example

The ICT factsheet is a separate example from the 2025 Facts and Figures report. It is useful because the page is very visual.

Example 1: clean extraction can work well

CPI March 2026 table showing one and twelve month percentage changes

CPI March 2026

PyMuPDF extracts the table in a readable order:

Total
100.0000
0.5
4.4

This is enough for a simple question like:

What was the annual inflation rate in March 2026?

But this is still page text, not a structured table object.

Example 2: table-heavy reports need more than text

2025 Facts and Figures inflation table

2025 Facts and Figures

The page has a clear table with:

  • row labels
  • year columns
  • units implied by the title
  • a weighted average row

A better pipeline would preserve those as table fields, not just lines of text.

Example 3: visual factsheets are harder

ICT indicators factsheet with icons, charts and percentages

ICT indicators factsheet

The visual page is easy for humans, but the extracted text becomes a sequence of labels and values:

Mobile Phone Ownership by Individuals
National
Male
Female
53.7%
54.5%
52.9%
Internet Usage by Individuals
National
Male
35.0%
37.8%
Female
32.2%

The numbers survive, but the visual grouping is weaker.

Quick question 3

If the numbers are extracted but the visual grouping is lost, what is the biggest risk?

A. The PDF has not been downloaded
B. The system may mix up which labels, groups or units belong to which values
C. The LLM will refuse to answer every question
D. There is no risk if all the numbers are present

Suggested answer: B — values without their labels, units or grouping can become misleading evidence.

Why official statistics PDFs are hard

Mock report spread showing headings, charts, tables, source notes and page numbers

The important information is often not just in paragraphs. It is in headings, tables, captions, notes and page context.

Three kinds of PDF problem

1. Digital text

  • selectable text
  • extraction usually works
  • layout may still be awkward

2. Scanned pages

  • page is an image
  • needs OCR
  • OCR can introduce errors

3. Complex layout

  • text exists
  • tables and columns matter
  • reading order can break

For KNBS reports, the biggest risk is often not OCR alone. It is tables, headings, page structure, metadata and chunk boundaries being lost.

What can go wrong?

The table problem

Illustration of a structured table becoming disordered extracted text

The information may still be present, but the structure that makes it interpretable has been damaged.

Example: table structure lost

Human-readable table → machine-confusing string

A table might visually mean:

Year 2020 2021 2022 2023 2024
Inflation (%) 5.4 6.1 7.7 7.7 4.5

But extracted text might look more like:

YEARPER CENT20232024202220212020012345678 97.77.74.56.15.4

A person can sometimes reconstruct this. A retrieval system and LLM may not do so reliably.

Lost structure is a retrieval problem

If we only preserve text

  • table headers may disappear
  • units may detach from values
  • row and column labels may merge
  • captions may be separated
  • source page may be unclear

Later steps struggle

  • chunks are less meaningful
  • embeddings are weaker
  • retrieval may miss the right evidence
  • answers become less reliable
  • citations are harder to check

Quick question 4

A correct table is extracted into JSON, but StatsChat answers from the wrong page. Where is the problem most likely to be?

A. PDF download only
B. PDF extraction only
C. Chunking, retrieval or ranking
D. The user question cannot be answered by RAG

Suggested answer: C — this is why we diagnose the whole chain rather than blaming the LLM or the PDF extractor immediately.

Another lesson: report pages, not just PDFs

A report page can have several useful files

Report page
├── main PDF
├── popular version
├── highlights or presentation
├── speech or launch material
└── spreadsheet tables, where available

A better ingestion system should think about the whole report record, not just a single PDF link.

Current improvements in StatsChat

The current branch has already improved the ingestion pipeline.

  • downloads all PDF links from a report page, not only the first one
  • records source URLs through url_dict.json
  • handles page-level extraction errors more safely
  • can fall back to pdfplumber for selected difficult pages/documents
  • includes tests for downloader, conversion and page integrity

So the message is not “the current pipeline is bad”. It is “the current pipeline works, but document processing is still a major improvement area”.

Diagnosing failures

When an answer is wrong, where did it go wrong?

Not every weak answer is a PDF extraction problem. Sometimes the evidence is extracted but not retrieved.

From current to better: example 1

Treat report pages as parent records

Current/simple view

PDF file
└── page text

Better target

Report record
├── metadata
├── primary PDF
├── supporting attachments
├── pages
├── tables
├── figures/captions
└── quality flags

This helps retrieval reason across the whole publication while still citing the exact source page or table.

From current to better: example 2

Move from page text to evidence objects

Current-style object

{
  "page_number": 25,
  "page_text": "Figure 3... YEARPER CENT..."
}

Better-style object

{
  "chunk_type": "table",
  "page": 25,
  "caption": "Inflation, 2020-2024",
  "units": "per cent",
  "headers": ["2020", "2021", "2022"],
  "text": "Inflation table with source page"
}

The target is not perfect extraction everywhere. The target is preserving the evidence needed for retrieval and checking.

Quick question 5

Which improvement is most useful for table-heavy official reports?

A. Only increase the LLM prompt length
B. Preserve tables as structured evidence objects with captions, units and page references
C. Remove all tables before indexing
D. Assume the LLM can reconstruct tables from any text

Suggested answer: B — better generation starts with better evidence representation.

From current to better: example 3

Route different pages differently

Page type Better handling
Clean digital prose fast native text extraction
Table-heavy page table-aware extraction and validation
Scanned page selective OCR
Mixed or problematic page fallback extractor or manual review

Do not treat every PDF page the same. Use the simplest method that preserves the evidence.

Better ingestion checklist

A good ingestion pipeline should preserve:

  • source URL
  • report title
  • release date
  • page number
  • section heading
  • table title/caption
  • units
  • footnotes
  • figure captions
  • reference period
  • geography
  • extraction warnings

This checklist is also an evaluation checklist: what should we inspect before trusting the pipeline?

Anchor examples for today

Clean/simple demo

CPI March 2026
Short release; good for PDF → JSON and page-level text.

Table-heavy report

2025 Facts and Figures
Good for discussing table preservation and structured evidence.

Visual factsheet

ICT indicators factsheet
Good for showing how visual grouping can be lost.

If we later add the Economic Survey or the full Housing Survey, they can extend the same story to long reports and multi-attachment report pages.

Exercise for Thursday lab

Choose a source + inspect extraction

Before the lab, complete the short exercise:

  1. Choose one publication or data source from your organisation.
  2. Say why it would be useful in a StatsChat-like system.
  3. Inspect two short extraction examples from today.
  4. Note what was preserved, what was lost, and what better structure would help.

The aim is not to produce a technical design. It is just to practise thinking like someone adapting a RAG system.

Takeaways

  • PDF processing is where real-world documents enter the RAG pipeline.
  • Good ingestion is not just text extraction; it is evidence preservation.
  • Tables, headings, metadata and page references matter for official statistics.
  • Current StatsChat ingestion works and has improved, but this is still a key area for development.
  • Better ingestion and better retrieval need to be designed together.