LLMs and RAG foundations
In this week’s workshop, we introduce the core ideas behind large language models (LLMs) and retrieval-augmented generation (RAG).
This session is designed as a general introduction for all course participants. The aim is to give everyone a shared mental model before we move into the main StatsChat sessions.
This is the big picture week, where we set the scene for the rest of the course. We won’t go into technical details yet, but we will cover the key concepts and terminology of LLMs and RAG. This will help us understand the strengths and limitations of these tools, especially in the context of official statistics.
We will focus on:
- what an LLM is, in practical terms
- what RAG adds to a plain LLM workflow
- why grounding matters for official statistics
- why RAG can help, but does not guarantee correctness
Introduction
StatsChat and the wider course
StatsChat is the running example for this course. In simple terms, it lets a user ask a question about official reports, retrieves relevant material, and uses a language model to produce a grounded answer.
This week is deliberately concept-led. Later sessions will unpack the pipeline in more detail, especially:
- retrieval
- generation
- evaluation
- adaptation to other document collections and NSO contexts
Session summary
A plain LLM can often produce fluent answers, summaries, and explanations. However, for official statistics, fluency is not enough. We often need answers that are:
- based on specific documents
- up to date
- easier to verify
- clear about where they came from
- sensitive to definitions, caveats, and context
A RAG system helps by retrieving relevant source material before generation. This can make answers more useful and more grounded, especially when working with official reports and organisation-specific document collections.
A core point for this session is that RAG is helpful, but not magical. A grounded answer can still be wrong if the wrong evidence is retrieved, if the model misinterprets the source, or if the source itself is incomplete or ambiguous.
Key terms
LLM
A large language model is a model trained on large amounts of text that generates language by predicting likely next tokens.
Token
A token is a chunk of text the model processes. It is often smaller than a full word, though sometimes it may be a whole word.
Prompt
A prompt is the instruction or input given to the model.
Context
Context is the text the model is given to work with when producing its answer.
Retrieval
Retrieval is the step where relevant source material is found for the question.
Grounding
Grounding means tying the answer to specific source material rather than relying only on the model’s general patterns.
Hallucination
A hallucination is when a model produces content that sounds plausible but is unsupported, incorrect, or invented.
RAG
Retrieval-augmented generation is an approach where the system first retrieves relevant information, then asks the model to generate an answer using that information.
Worked example
We use a fictional official-statistics example in the session.
Question:
According to the National Statistics Office’s Consumer Price Bulletin, what was the year-on-year inflation rate in March 2025, and which categories contributed most to the increase?
Source excerpt A
The year-on-year inflation rate in March 2025 was 6.2%, up from 5.8% in February 2025. The increase was mainly driven by higher prices for food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels.
Source excerpt B
Year-on-year inflation compares the Consumer Price Index in a given month with the same month in the previous year. Monthly changes and year-on-year changes should not be interpreted as the same measure.
Plain LLM-style answer
Inflation in March 2025 was around 5.8%, mainly driven by food prices and transport costs. Housing-related costs may also have contributed. This suggests inflation remained elevated during the period.
Grounded / RAG-style answer
According to the Consumer Price Bulletin, March 2025, the year-on-year inflation rate was 6.2%, up from 5.8% in February 2025. The bulletin states that the increase was mainly driven by food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels. A technical note in the same publication also clarifies that year-on-year inflation is not the same as a monthly change.
This comparison helps show why grounding can make an answer:
- more specific
- easier to verify
- more appropriate for official-statistics use
It also helps highlight that a grounded answer could still be wrong if the wrong evidence is retrieved or interpreted.
Lab
Compare a plain LLM answer and a grounded answer
Read the example below and respond briefly to the questions that follow.
Question
According to the National Statistics Office’s Consumer Price Bulletin, what was the year-on-year inflation rate in March 2025, and which categories contributed most to the increase?
Plain LLM-style answer
Inflation in March 2025 was around 5.8%, mainly driven by food prices and transport costs. Housing-related costs may also have contributed. This suggests inflation remained elevated during the period.
Grounded / RAG-style answer
According to the Consumer Price Bulletin, March 2025, the year-on-year inflation rate was 6.2%, up from 5.8% in February 2025. The bulletin states that the increase was mainly driven by food and non-alcoholic beverages, transport, and housing, water, electricity, gas and other fuels. A technical note in the same publication also clarifies that year-on-year inflation is not the same as a monthly change.
Questions
- Which answer would you trust more?
- Why?
- What makes an answer more useful in an official statistics context?
- In your organisation, what kind of report, publication, or document might benefit from a RAG-style system?
- What could still go wrong, even if a RAG system is used?
Optional extension
- Why bother creating our own RAG system (which builds on existing models)? Rather than just using the most recent LLM-based models (e.g. ChatGPT 5, Gemini 3, Claude Opus 4), which already have a form of RAG built in.
Guidance
- Short bullet points are fine.
- This task should take around 5–10 minutes.
- Be ready to discuss your answers in the Thursday lab.
Between-session task
Please complete the short lab task before the Thursday session. Keep your answers brief: bullet points are fine.
Lab Slides
Further reading
- OpenAI. Introduction to large language models. https://help.openai.com/en/articles/11162441
- IBM. What is retrieval-augmented generation (RAG)? https://www.ibm.com/think/topics/retrieval-augmented-generation
- Google for Developers. Long context. https://ai.google.dev/gemini-api/docs/long-context
- NIST. AI Risk Management Framework. https://www.nist.gov/itl/ai-risk-management-framework