The aims of this demonstration are to:
- Understand the MST synthesis method from a high level
- Visualise the impact of adding noise during synthesis
- Observe some of the limitations of the method
To do this, we will be synthesising the 1% Teaching File and following the path of a particular set of columns.
Moreover, to emphasise the importance of adding noise in achieving disclosure control, we will focus on a pair of columns with a very small count in their two-way marginal table.
The MST method¶
Maximum Spanning Tree (MST) is a data synthesis method for synthesising categorical data while preserving privacy. The authors of MST won the 2018 NIST competition for differential privacy (DP) and data synthesis. The team won using a variant of MST tailored to the contest dataset - extracts from the 1940 US Census.
There are three main steps to MST:
- Select a collection of one- and two-way marginals from the data
- Measure the marginals privately by adding a controlled amount of noise to each cell count
- Generate synthetic data that attempts to match the noisy marginals
MST is a flexible, scalable method that can preserve interactions in the data beyond those it measures.
This accuracy for latent relationships comes in the generate step which employs the Private-PGM
method.
Private-PGM
represents the noisy marginals as a graph with nodes and edges.
Since MST is a differentially private synthesis method, we must also specify a privacy budget. The variant of DP used in MST requires two parameters, $\epsilon > 0$ and $0 < \delta \ll \epsilon$. These parameters control the amount of noise added to marginals, and the pseudo-randomness of the selection process.
Choosing $\delta$ is relatively straightforward so long as it is smaller than $\frac{1}{N}$ where $N$ is the length of the dataset to be synthesised. Choosing the primary privacy budget $\epsilon$ is much more nuanced and broadly determines how accurate the synthetic data may be. While larger values are commonly used in practice, one of the authors of DP argues that $0.01 \approx \ln(1.01) \le \epsilon \le \ln(3) \approx 1.1$ is a suitable range.
For this demonstration, we take $\epsilon = 1$ and $\delta = 10 ^ {- ceil\left(\log_{10} N\right)} = 10 ^ {-6}$, where $ceil$ is the ceiling function.
The 1% Census¶
One of the outputs from the 2011 Census in England and Wales is the Microdata Teaching File. This file contains a 1% sample of the person records from the Census; all records are anonymised and chosen at random. Excluding the person ID, the data comprises 17 columns and 569,741 rows.
The primary purpose of the teaching file is an educational tool, but it also offers up a glance at what genuine Census microdata look like. Using this dataset to demonstrate the MST method seems sensible then, given its previous applications to census data.
Snippets¶
Below is a snapshot of the original data, and a similar snapshot of the synthetic data beneath that. While it may seem like we are skipping ahead, these snippets make clear what our end goal looks like. That is, a synthetic dataset that looks quite like the real thing.
Region | Residence Type | Family Composition | Population Base | Sex | Age | Marital Status | Student | Country of Birth | Health | Ethnic Group | Religion | Economic Activity | Occupation | Industry | Hours worked per week | Approximated Social Grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E12000001 | H | 2 | 1 | 2 | 6 | 2 | 2 | 1 | 2 | 1 | 2 | 5 | 8 | 2 | -9 | 4 |
1 | E12000001 | H | 5 | 1 | 1 | 4 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 8 | 6 | 4 | 3 |
2 | E12000001 | H | 3 | 1 | 2 | 4 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 6 | 11 | 3 | 4 |
3 | E12000001 | H | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 7 | 7 | 3 | 2 |
4 | E12000001 | H | 3 | 1 | 1 | 5 | 4 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 4 | 3 | 2 |
Region | Residence Type | Family Composition | Population Base | Sex | Age | Marital Status | Student | Country of Birth | Health | Ethnic Group | Religion | Economic Activity | Occupation | Industry | Hours worked per week | Approximated Social Grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E12000003 | H | 5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | -9 | -9 | -9 | -9 | -9 |
1 | W92000004 | H | 5 | 1 | 1 | 4 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 5 | 3 | 3 | 3 |
2 | E12000007 | H | 2 | 1 | 1 | 8 | 2 | 2 | 1 | 1 | 1 | 1 | 5 | 5 | 3 | -9 | 2 |
3 | W92000004 | H | 2 | 1 | 1 | 6 | 2 | 2 | 1 | 3 | 1 | 2 | 5 | 9 | 10 | -9 | 2 |
4 | W92000004 | H | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 6 | 6 | 11 | -9 | 3 |
Our pair of columns¶
Our demonstration follows the trajectory of two columns in the teaching file, Population Base
and Marital Status
.
Population Base
describes the individual's residency:
- Usual resident
- Student living away from home during term-time
- Short-term resident
Marital Status
describes the individual's marital status:
- Single (never married or never registered a same-sex civil partnership)
- Married or in a registered same-sex civil partnership
- Separated but still legally married or separated but still legally in a same-sex civil partnership
- Divorced or formerly in a same-sex civil partnership which is now legally dissolved
- Widowed or surviving partner from a same-sex civil partnership
These columns are heavily unbalanced in the data with 98% of individuals being usual residents, and 85% of individuals either being single or in a legally binding relationship, i.e., married or in a registered same-sex civil partnership:
One-way counts¶
Population Base¶
count | proportion | |
---|---|---|
Population Base | ||
1 | 561,040 | 0.9847 |
2 | 6,730 | 0.0118 |
3 | 1,971 | 0.0035 |
Marital Status¶
count | proportion | |
---|---|---|
Marital Status | ||
1 | 270,999 | 0.4757 |
2 | 214,180 | 0.3759 |
3 | 11,951 | 0.021 |
4 | 40,713 | 0.0715 |
5 | 31,898 | 0.056 |
Two-way counts¶
The primary reason we are interested in these columns, however, is because they have a two-way marginal table with some very low counts:
Marital Status | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Population Base | |||||
1 | 262,807 | 213,777 | 11,930 | 40,664 | 31,862 |
2 | 6,684 | 28 | 8 | 9 | 1 |
3 | 1,508 | 375 | 13 | 40 | 35 |
Typically, data owners might be hesistant to disclose information like this, since there is only one individual in the data who is both a student and widowed. In the case of the 1% Teaching File, the data is so granular that reidentification of this individual is incredibly small.
However, in a more fine-grained, confidential dataset, this would be a problem.
Adding noise¶
To ensure the privacy of the individuals in the dataset, MST adds noise to all of these counts. In MST, the noise is Gaussian and centred on zero. The standard deviation of the noise distribution is determined by the privacy budget and how many measurements are taken.
MST measures all 17 one-way marginals and a selection of two-way marginals. One third of the privacy budget is spent randomly selecting these two-way marginals.
For $\epsilon = 1$ and $\delta = 10^{-6}$, MST adds $\mathcal{N}(0, 32.36)$ noise to the one-way marginal tables.
MST selects enough two-way marginals to connect all the columns using a graphical representation. In this representation, columns are nodes and an edge on the graph indicates the selection of the two-way marginal between the nodes.
As such, MST must measure a minimum of 16 two-way marginals to form a minimum spanning tree of the column nodes. In practice, the final graph is not always a tree and may contain triangles. Here, this is not the case.
For our example, MST adds $\mathcal{N}(0, 31.39)$ noise to each two-way marginal's cell counts.
Generating the synthetic data¶
With the marginals selected and measured (with added noise), they can be passed to the Private-PGM
method to create tabular data.
In short, Private-PGM
attempts to find a data distribution that matches the noisy marginals by solving an optimisation problem.
Mapping out the synthesis¶
The synthetic dataset is generated by traversing a directed acyclic graph of the selected marginals, and sampling as need from the estimated distribution.
The figure below shows the graph for this synthesis, showing how the generation was performed.
Each column is represented as a node, and an edge between two nodes indicates that their two-way marginal
was selected for synthesis. The direction of an edge (a, b)
indicates that the column a
is used to
synthesise column b
.
This graph shows that Industry
was synthesised first with no dependent columns.
Then the synthetic Industry
column was used to synthesise Occupation
, and so on.
Comparing the marginal counts¶
One-way marginals¶
To demonstrate the impact of adding noise to the marginal tables, consider the tables below, which show the observed, noisy and synthesised counts for each our columns of interest.
Population Base¶
observed | noisy | synthetic | |
---|---|---|---|
Population Base | |||
1 | 561,040 | 561,012.36 | 560,993 |
2 | 6,730 | 6,647.39 | 6,730 |
3 | 1,971 | 1,992.15 | 2,002 |
Marital Status¶
observed | noisy | synthetic | |
---|---|---|---|
Marital Status | |||
1 | 270,999 | 270,970.27 | 270,944 |
2 | 214,180 | 214,115.91 | 214,146 |
3 | 11,951 | 11,939.74 | 11,948 |
4 | 40,713 | 40,718.06 | 40,739 |
5 | 31,898 | 31,937.81 | 31,948 |
We can see that for this privacy budget and this particular run of MST, these counts are all rather close. Success!
However, accurately recreating individuals columns is a relatively low bar for data synthesis.
Two-way marginals¶
The figures below show the marginal table for our pair of columns for the original data, its noisy counterpart, and the synthetic dataset. The colour map indicates the count in each cell, and is consistent across the figures.
From these figures we can see that the larger counts are less affected by the noise and appear to be very accurately preserved, which makes sense - they carry a lower disclosure risk. The smaller counts, especially the unique individual in the original data, however, are perturbed to a much greater effect. This could be a problem if the users of this synthetic data are interested in studying edge-populations such as this.
Looking at the noisy table, we can see that the Gaussian noise has actually pushed some counts below zero. The generation step of MST considers the one-way marginal table and the amount of noise added to form the noisy counts, allowing the synthetic data to still resemble the original. In fact, with the noise added, the mean absolute difference between the original and synthetic counts for this marginal table is 28.0, which falls in line with the Gaussian noise added.
Limitations¶
It is important to note that MST is not a panacaea for differentially private data synthesis, and here we lay out two of its limitations.
Among the limitations of MST, though not demonstrated here, is that it does not scale for columns with large domains; bear this in mind for your applications.
Internal inconsistency¶
Internal inconsistencies are artefacts of synthetic data that should never occur in the real data. The US Census Bureau's application of DP led to a litany of these artefacts, including households with no adults and whole districts comprised of elderly residents.
The Teaching File does not have hierarchical structures like geography or households, but there are some combinations of values can never occur together. This type of internal consistency is called a structural zero. MST does nothing to maintain structural zeroes by default. The Python implementation (used in this demonstration) includes functionality for maintaining structural zeroes, but unfortunately there is no documentation available on how to include them.
As such, we can arrive at a reasonable synthetic dataset that includes thousands of individuals who belong to the student population base and who are not students:
Original¶
Student | 1 | 2 |
---|---|---|
Population Base | ||
1 | 118715 | 442325 |
2 | 6730 | 0 |
3 | 1092 | 879 |
Synthetic¶
Student | 1 | 2 |
---|---|---|
Population Base | ||
1 | 122877 | 438116 |
2 | 3038 | 3692 |
3 | 686 | 1316 |
Unobserved pairs of columns¶
Close inspection of the marginal tables above reveals another limitation of MST. That is, MST does not synthesise relationships it does not observe as accurately as those it does.
In truth, MST actually does a pretty good job of doing this because of the Private-PGM
method.
To demonstrate the efficacy of Private-PGM
in synthesising latent two-way relationships,
let us look at another pair of columns: Sex
and Population Base
.
In the Teaching File, Sex
is considered binary where 1
indicates Male and 2
indicates Female.
Below are the two-way marginals for this latent pair in the original and synthetic data. The third tab shows the difference from the original in each count. At a glance from the marginal tables, it may seem things are well-preserved still.
Then, when looking at the differences, the counts are clearly off by some way. In fact, the mean absolute difference in the marginal table is 117.0, which is considerably larger than for our pair of interest.
So, unobserved relationships are not a wash, but there is a substantial decrease in quality.
Original¶
Sex | 1 | 2 |
---|---|---|
Population Base | ||
1 | 276,174 | 284,866 |
2 | 3,387 | 3,343 |
3 | 1,008 | 963 |
Synthetic¶
Sex | 1 | 2 |
---|---|---|
Population Base | ||
1 | 276,341 | 284,652 |
2 | 3,269 | 3,461 |
3 | 981 | 1,021 |
Difference¶
Sex | 1 | 2 |
---|---|---|
Population Base | ||
1 | 167 | -214 |
2 | -118 | 118 |
3 | -27 | 58 |