StatsChat overview

This week is the first core session of the course. Last week introduced the general ideas behind large language models (LLMs) and Retrieval-Augmented Generation (RAG). This session moves from the general picture to one real system: StatsChat.

The aim is to give you a clear, end-to-end mental model of what StatsChat is, what it is for, and where it fits in an official statistics setting. We will not go deeply into every component yet: later sessions will return to document processing, retrieval, generation, and evaluation in more detail.

Slide deck

Session overview

In this session, we focus on five big ideas:

  • what problem StatsChat is trying to solve
  • how the full system works at a high level
  • which steps happen once and which steps happen for each user question
  • what the system returns, and what those outputs mean
  • why evaluation and human judgement still matter

StatsChat in one sentence

StatsChat helps users ask questions about Kenya and Kenyan statistics, using KNBS reports as the evidence base.

A user is not necessarily thinking “which report contains this paragraph?”. They are more likely to ask a question such as “What was Kenya’s annual inflation rate in March 2026?” or “Which publication supports this figure?”. StatsChat then uses a known collection of official reports to retrieve relevant evidence and pass that evidence to the LLM as context.

The key idea is that the LLM is not expected to answer from its general training alone. Instead, the system first searches a trusted report collection and uses the retrieved extracts to support the answer.

A useful mental model: two halves of the system

A simple way to understand StatsChat is to split it into two parts:

1. Prepare the document collection

These are the steps that mainly happen once, or whenever new reports are added:

  • download the reports
  • convert PDFs into machine-readable text and metadata
  • split the text into smaller chunks
  • create embeddings for those chunks
  • build the search index

This is sometimes called the offline, background, or indexing part of the system.

2. Answer each user question

These are the steps that happen for each question:

  • receive the user question
  • retrieve relevant chunks from the index
  • send the question and retrieved context to the LLM
  • return an answer and references

This is the online or query-time part of the system.

This split is helpful because it separates two different jobs:

  • making the reports searchable
  • answering a question from that searchable collection

Why this matters for official statistics

In official statistics, a useful answer usually needs to be more than fluent. We often want it to be:

  • linked to trusted source material
  • easier to verify
  • sensitive to definitions, metadata, and caveats
  • clear about where it came from
  • appropriate to the relevant report or publication

That is one reason a RAG-style system can be useful. It can help users move from a broad question to a relevant evidence base more quickly.

Key terms

Publication / PDF

The original report, bulletin, or statistical publication that the system draws from.

JSON conversion

A machine-readable version of a report, created so the text and metadata can be processed more easily by the rest of the pipeline.

Chunk

A smaller section of text used for retrieval. Chunks help the system search more precisely than if it only searched whole documents.

Embedding

A numerical representation of a piece of text. Texts with similar meaning should have embeddings that are closer together.

Vector store

A search index over embeddings. In StatsChat this supports semantic search over the report collection.

Reference

The retrieved evidence returned with the answer. In practice this is often a retrieved chunk with metadata, not just a whole document title.

Worked example: Kenya inflation in March 2026

Suppose a user asks:

In March 2026, what was Kenya’s annual inflation rate, and what drove it?

A good answer needs to check several things:

  • definition: annual CPI inflation is different from monthly inflation
  • date: the comparison is March 2026 with March 2025
  • source: the answer should come from the KNBS CPI bulletin, not an arbitrary web source
  • caveat: “what drove it?” may refer to division-level price changes, contribution to the headline rate, or the report’s narrative explanation

For the March 2026 KNBS CPI bulletin, the headline annual CPI inflation figure is 4.4%. The report also identifies Food and Non-Alcoholic Beverages, Transport, and Housing, Water, Electricity, Gas and other Fuels as important divisions to inspect.

A StatsChat-style system needs to:

  1. search the indexed report collection
  2. retrieve relevant passages from the CPI bulletin
  3. pass the question and relevant passages to the LLM
  4. generate an answer using that retrieved context
  5. return an answer and references so the user can verify the source

This is different from asking a plain LLM the question directly. The answer is intended to be grounded in the retrieved evidence.

Prototype status: important caveat

A key message from this session is that StatsChat is a useful prototype, not a finished product.

This matters because the course should neither oversell nor undersell the current system. The current version:

  • is a real RAG system adapted for KNBS
  • is useful as a case study and learning tool
  • has recently shown improved results after changes to the implementation
  • helps us see how a system like this might be adapted in other NSO contexts

But it is not yet production-ready. PDF processing and retrieval remain important areas for improvement, and answers still need careful interpretation and evaluation.

So the value of StatsChat in this course is not that it is already finished. The value is that it is a real prototype through which we can understand how systems like this work, where they help, and where further development is needed.

Where StatsChat fits

A StatsChat-style system may be helpful when users need to:

  • navigate large collections of reports
  • find the right report or passage quickly
  • surface definitions, caveats, and evidence
  • answer repeated questions over official publications

It should be used cautiously where:

  • the evidence is weak or ambiguous
  • retrieval may miss the right source
  • the latest figure matters a great deal
  • users need a fully checked, publication-ready answer

Connected topic: AI search summaries

This is not the main topic of the session, but it is a useful connected discussion.

Tools such as Google AI summaries or other AI search interfaces may help users discover information more quickly. However, they can also:

  • use non-KNBS sources when a user wanted official KNBS evidence
  • mix official and unofficial evidence
  • produce overconfident summaries
  • make it harder to see exactly where a claim came from

That does not mean these tools are useless. It does mean they should not automatically be treated as substitutes for a system designed around trusted sources and traceable retrieval.

A related future topic is how NSOs can make their content more machine-readable, clearly structured, and easier for AI systems to identify and attribute correctly.

Session summary

This week’s big picture is:

  • StatsChat helps answer questions about Kenya and Kenyan statistics using KNBS evidence
  • it helps to split the pipeline into prepare once and answer each question
  • the answer depends on retrieved evidence, not only on the model
  • the current prototype is useful and improving, but not production-ready
  • evaluation and human judgement remain essential

Lab

StatsChat overview exercise

This short task reinforces the main idea from the session: StatsChat is easiest to understand if we separate the one-time preparation of the document collection from the per-question answering process.

Task

Choose one realistic question that you would want a StatsChat-like system to answer in your own organisation.

Then answer the questions below briefly.

Questions
  1. What is your question?
  2. What documents or reports would the system need access to?
  3. Which steps would need to happen once before users could ask this question?
  4. Which steps would happen each time a user asks the question?
  5. What would make you trust the final answer more?
  6. What is one reason you might still be cautious, even if the answer looks good?
Optional extension
  1. A user could also try a general AI search summary instead of a StatsChat-style system. What might be better about a StatsChat-style approach for this question? What might still be difficult?
Guidance
  • Short bullet points are fine.
  • This should take around 5–10 minutes.
  • Be ready to discuss your example in the lab.

Between-session task

Please complete the short exercise before the Thursday lab. Brief bullet points are enough.

Lab Slides

Further reading

  • Revisit last week’s material on LLMs and RAG foundations.
  • Review the architecture notes on PDF ingestion, retrieval, generation, and input/output.