LLMs and RAG foundations

In this week’s workshop, we introduce the core ideas behind large language models (LLMs) and retrieval-augmented generation (RAG).

This session is designed as a general introduction for all course participants. The aim is to give everyone a shared mental model before we move into the main StatsChat sessions.

This is the big picture week, where we set the scene for the rest of the course. We won’t go into technical details yet, but we will cover the key concepts and terminology of LLMs and RAG. This will help us understand the strengths and limitations of these tools, especially in the context of official statistics.

We will focus on:

Introduction

StatsChat and the wider course

StatsChat is the running example for this course. In simple terms, it lets a user ask a question about official reports, retrieves relevant material, and uses a language model to produce a grounded answer.

This week is deliberately concept-led. Later sessions will unpack the pipeline in more detail, especially:

  • retrieval
  • generation
  • evaluation
  • adaptation to other document collections and NSO contexts

Session summary

A plain LLM can often produce fluent answers, summaries, and explanations. However, for official statistics, fluency is not enough. We often need answers that are:

  • based on specific documents
  • up to date
  • easier to verify
  • clear about where they came from
  • sensitive to definitions, caveats, and context

A RAG system helps by retrieving relevant source material before generation. This can make answers more useful and more grounded, especially when working with official reports and organisation-specific document collections.

A core point for this session is that RAG is helpful, but not magical. A grounded answer can still be wrong if the wrong evidence is retrieved, if the model misinterprets the source, or if the source itself is incomplete or ambiguous.

Key terms

LLM

A large language model is a model trained on large amounts of text that generates language by predicting likely next tokens.

Token

A token is a chunk of text the model processes. It is often smaller than a full word, though sometimes it may be a whole word.

Prompt

A prompt is the instruction or input given to the model.

Context

Context is the text the model is given to work with when producing its answer.

Retrieval

Retrieval is the step where relevant source material is found for the question.

Grounding

Grounding means tying the answer to specific source material rather than relying only on the model’s general patterns.

Hallucination

A hallucination is when a model produces content that sounds plausible but is unsupported, incorrect, or invented.

RAG

Retrieval-augmented generation is an approach where the system first retrieves relevant information, then asks the model to generate an answer using that information.

Worked example

We use a fictional official-statistics example in the session.

Question:

According to the National Statistics Office’s Consumer Price Bulletin, what was the year-on-year inflation rate in March 2025, and which categories contributed most to the increase?

Source excerpt A

The year-on-year inflation rate in March 2025 was 6.2%, up from 5.8% in February 2025. The increase was mainly driven by higher prices for food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels.

Source excerpt B

Year-on-year inflation compares the Consumer Price Index in a given month with the same month in the previous year. Monthly changes and year-on-year changes should not be interpreted as the same measure.

Plain LLM-style answer

Inflation in March 2025 was around 5.8%, mainly driven by food prices and transport costs. Housing-related costs may also have contributed. This suggests inflation remained elevated during the period.

Grounded / RAG-style answer

According to the Consumer Price Bulletin, March 2025, the year-on-year inflation rate was 6.2%, up from 5.8% in February 2025. The bulletin states that the increase was mainly driven by food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels. A technical note in the same publication also clarifies that year-on-year inflation is not the same as a monthly change.

This comparison helps show why grounding can make an answer:

  • more specific
  • easier to verify
  • more appropriate for official-statistics use

It also helps highlight that a grounded answer could still be wrong if the wrong evidence is retrieved or interpreted.

Lab

Compare a plain LLM answer and a grounded answer

Read the example below and respond briefly to the questions that follow.

Question

According to the National Statistics Office’s Consumer Price Bulletin, what was the year-on-year inflation rate in March 2025, and which categories contributed most to the increase?

Plain LLM-style answer

Inflation in March 2025 was around 5.8%, mainly driven by food prices and transport costs. Housing-related costs may also have contributed. This suggests inflation remained elevated during the period.

Grounded / RAG-style answer

According to the Consumer Price Bulletin, March 2025, the year-on-year inflation rate was 6.2%, up from 5.8% in February 2025. The bulletin states that the increase was mainly driven by food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels. A technical note in the same publication also clarifies that year-on-year inflation is not the same as a monthly change.

Questions
  1. Which answer would you trust more?
  2. Why?
  3. What makes an answer more useful in an official statistics context?
  4. In your organisation, what kind of report, publication, or document might benefit from a RAG-style system?
  5. What could still go wrong, even if a RAG system is used?
Optional extension
  1. Why bother creating our own RAG system (which builds on existing models)? Rather than just using the most recent LLM-based models (e.g. ChatGPT 5, Gemini 3, Claude Opus 4), which already have a form of RAG built in.
Guidance
  • Short bullet points are fine.
  • This task should take around 5–10 minutes.
  • Be ready to discuss your answers in the Thursday lab.

Between-session task

Please complete the short lab task before the Thursday session. Keep your answers brief: bullet points are fine.

Lab Slides

Further reading