UNECA StatsChat course
Welcome to the UNECA StatsChat course!
This course will introduce StatsChat as a practical example of a Retrieval-Augmented Generation (RAG) system for official statistics. Across the course, we will look at how a system like this works, how it can be used appropriately, and how similar approaches could potentially be adapted in different NSO contexts.
The course will have two main parts:
- 3 preliminary sessions, intended to give some helpful foundation in Git, Python, and LLM / RAG basics
- 6 core sessions, centred on understanding and adapting a RAG system using StatsChat as the main example The preliminary sessions are designed as support sessions, particularly for anyone who would find it helpful to refresh or build confidence in these areas before the main course begins.
Course Structure
Primers
| Week | Topic | Highlights |
|---|---|---|
| Week 1 | GitHub and repo orientation | How to navigate a repository, read documentation, understand branches and pull requests, and use the repo as technical documentation. |
| Week 2 | Python and command-line basics | A light introduction to files, folders, JSON, running scripts, packages, and the role of Python in the StatsChat pipeline. |
| Week 3 | LLMs and RAG foundations | A shared mental model of LLMs, grounding, hallucination risk, and why retrieval matters for official statistics. |
Core sessions
| Week | Topic | Highlights |
|---|---|---|
| Week 4 | Introduction to StatsChat and RAG | What problem StatsChat solves; end-to-end workflow; what makes this a RAG system; where it fits in an NSO setting. |
| Week 5 | Documents and ingestion | How reports become machine-readable inputs; PDF challenges; metadata; source tracking; what would change for HTML, Word, spreadsheets, or APIs. |
| Week 6 | Chunking, embeddings, and vector search | Why the system does not query raw reports directly; how chunking and embeddings create a searchable representation. |
| Week 7 | Retrieval and evidence selection | How relevant passages are found; reranking; thresholds; recentness; how weak retrieval affects final answers. |
| Week 8 | Generation, prompting, and grounded answers | What the LLM does with retrieved context; answer formatting; confidence, refusal, and traceability. |
| Week 9 | Evaluation, limitations, and local adaptation | Failure modes; evaluation approaches; readiness for deployment; adapting StatsChat for local documents, languages, and users. |