LLMs and RAG foundations

In this week’s workshop, we introduce the core ideas behind large language models (LLMs) and retrieval-augmented generation (RAG).

This session is designed as a general introduction for all course participants. The aim is to give everyone a shared mental model before we move into the main StatsChat sessions.

This is the big picture week, where we set the scene for the rest of the course. We won’t go into technical details yet, but we will cover the key concepts and terminology of LLMs and RAG. This will help us understand the strengths and limitations of these tools, especially in the context of official statistics.

We will focus on:

what an LLM is, in practical terms
what RAG adds to a plain LLM workflow
why grounding matters for official statistics
why RAG can help, but does not guarantee correctness

Introduction

StatsChat and the wider course

StatsChat is the running example for this course. In simple terms, it lets a user ask a question about official reports, retrieves relevant material, and uses a language model to produce a grounded answer.

This week is deliberately concept-led. Later sessions will unpack the pipeline in more detail, especially:

retrieval
generation
evaluation
adaptation to other document collections and NSO contexts

Session summary

A plain LLM can often produce fluent answers, summaries, and explanations. However, for official statistics, fluency is not enough. We often need answers that are:

based on specific documents
up to date
easier to verify
clear about where they came from
sensitive to definitions, caveats, and context

A RAG system helps by retrieving relevant source material before generation. This can make answers more useful and more grounded, especially when working with official reports and organisation-specific document collections.

A core point for this session is that RAG is helpful, but not magical. A grounded answer can still be wrong if the wrong evidence is retrieved, if the model misinterprets the source, or if the source itself is incomplete or ambiguous.

Key terms

LLM

A large language model is a model trained on large amounts of text that generates language by predicting likely next tokens.

Token

A token is a chunk of text the model processes. It is often smaller than a full word, though sometimes it may be a whole word.

Prompt

A prompt is the instruction or input given to the model.

Context

Context is the text the model is given to work with when producing its answer.

Retrieval

Retrieval is the step where relevant source material is found for the question.

Grounding

Grounding means tying the answer to specific source material rather than relying only on the model’s general patterns.

Hallucination

A hallucination is when a model produces content that sounds plausible but is unsupported, incorrect, or invented.

RAG

Retrieval-augmented generation is an approach where the system first retrieves relevant information, then asks the model to generate an answer using that information.

Worked example

We use a fictional official-statistics example in the session.

Question:

According to the National Statistics Office’s Consumer Price Bulletin, what was the year-on-year inflation rate in March 2025, and which categories contributed most to the increase?

Source excerpt A

The year-on-year inflation rate in March 2025 was 6.2%, up from 5.8% in February 2025. The increase was mainly driven by higher prices for food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels.

Source excerpt B

Year-on-year inflation compares the Consumer Price Index in a given month with the same month in the previous year. Monthly changes and year-on-year changes should not be interpreted as the same measure.

Plain LLM-style answer

Inflation in March 2025 was around 5.8%, mainly driven by food prices and transport costs. Housing-related costs may also have contributed. This suggests inflation remained elevated during the period.

Grounded / RAG-style answer

According to the Consumer Price Bulletin, March 2025, the year-on-year inflation rate was 6.2%, up from 5.8% in February 2025. The bulletin states that the increase was mainly driven by food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels. A technical note in the same publication also clarifies that year-on-year inflation is not the same as a monthly change.

This comparison helps show why grounding can make an answer:

more specific
easier to verify
more appropriate for official-statistics use

It also helps highlight that a grounded answer could still be wrong if the wrong evidence is retrieved or interpreted.

Lab

Compare a plain LLM answer and a grounded answer

Read the example below and respond briefly to the questions that follow.

Question

According to the National Statistics Office’s Consumer Price Bulletin, what was the year-on-year inflation rate in March 2025, and which categories contributed most to the increase?

Plain LLM-style answer

Inflation in March 2025 was around 5.8%, mainly driven by food prices and transport costs. Housing-related costs may also have contributed. This suggests inflation remained elevated during the period.

Grounded / RAG-style answer

According to the Consumer Price Bulletin, March 2025, the year-on-year inflation rate was 6.2%, up from 5.8% in February 2025. The bulletin states that the increase was mainly driven by food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels. A technical note in the same publication also clarifies that year-on-year inflation is not the same as a monthly change.

Questions

Which answer would you trust more?
Why?
What makes an answer more useful in an official statistics context?
In your organisation, what kind of report, publication, or document might benefit from a RAG-style system?
What could still go wrong, even if a RAG system is used?

Optional extension

Why bother creating our own RAG system (which builds on existing models)? Rather than just using the most recent LLM-based models (e.g. ChatGPT 5, Gemini 3, Claude Opus 4), which already have a form of RAG built in.

Guidance

Short bullet points are fine.
This task should take around 5–10 minutes.
Be ready to discuss your answers in the Thursday lab.

Between-session task

Please complete the short lab task before the Thursday session. Keep your answers brief: bullet points are fine.

LLMs and RAG foundations

Introduction

StatsChat and the wider course

Session summary

Key terms

LLM

Token

Prompt

Context

Retrieval

Grounding

Hallucination

RAG

Worked example

Source excerpt A

Source excerpt B

Plain LLM-style answer

Grounded / RAG-style answer

Lab

Compare a plain LLM answer and a grounded answer

Question

Plain LLM-style answer

Grounded / RAG-style answer

Questions

Optional extension

Guidance

Between-session task

Lab Slides

Further reading