The aims of this demonstration are to:

Understand the MST synthesis method from a high level
Visualise the impact of adding noise during synthesis
Observe some of the limitations of the method

To do this, we will be synthesising the 1% Teaching File and following the path of a particular set of columns.

Moreover, to emphasise the importance of adding noise in achieving disclosure control, we will focus on a pair of columns with a very small count in their two-way marginal table.

The MST method¶

Maximum Spanning Tree (MST) is a data synthesis method for synthesising categorical data while preserving privacy. The authors of MST won the 2018 NIST competition for differential privacy (DP) and data synthesis. The team won using a variant of MST tailored to the contest dataset - extracts from the 1940 US Census.

There are three main steps to MST:

Select a collection of one- and two-way marginals from the data
Measure the marginals privately by adding a controlled amount of noise to each cell count
Generate synthetic data that attempts to match the noisy marginals

MST is a flexible, scalable method that can preserve interactions in the data beyond those it measures. This accuracy for latent relationships comes in the generate step which employs the Private-PGM method. Private-PGM represents the noisy marginals as a graph with nodes and edges.

Since MST is a differentially private synthesis method, we must also specify a privacy budget. The variant of DP used in MST requires two parameters, $\epsilon > 0$ and $0 < \delta \ll \epsilon$. These parameters control the amount of noise added to marginals, and the pseudo-randomness of the selection process.

Choosing $\delta$ is relatively straightforward so long as it is smaller than $\frac{1}{N}$ where $N$ is the length of the dataset to be synthesised. Choosing the primary privacy budget $\epsilon$ is much more nuanced and broadly determines how accurate the synthetic data may be. While larger values are commonly used in practice, one of the authors of DP argues that $0.01 \approx \ln(1.01) \le \epsilon \le \ln(3) \approx 1.1$ is a suitable range.

For this demonstration, we take $\epsilon = 1$ and $\delta = 10 ^ {- ceil\left(\log_{10} N\right)} = 10 ^ {-6}$, where $ceil$ is the ceiling function.

The 1% Census¶

One of the outputs from the 2011 Census in England and Wales is the Microdata Teaching File. This file contains a 1% sample of the person records from the Census; all records are anonymised and chosen at random. Excluding the person ID, the data comprises 17 columns and 569,741 rows.

The primary purpose of the teaching file is an educational tool, but it also offers up a glance at what genuine Census microdata look like. Using this dataset to demonstrate the MST method seems sensible then, given its previous applications to census data.

Snippets¶

Below is a snapshot of the original data, and a similar snapshot of the synthetic data beneath that. While it may seem like we are skipping ahead, these snippets make clear what our end goal looks like. That is, a synthetic dataset that looks quite like the real thing.

	Region	Residence Type	Family Composition	Population Base	Sex	Age	Marital Status	Student	Country of Birth	Health	Ethnic Group	Religion	Economic Activity	Occupation	Industry	Hours worked per week	Approximated Social Grade
0	E12000001	H	2	1	2	6	2	2	1	2	1	2	5	8	2	-9	4
1	E12000001	H	5	1	1	4	1	2	1	1	1	2	1	8	6	4	3
2	E12000001	H	3	1	2	4	1	2	1	1	1	1	1	6	11	3	4
3	E12000001	H	3	1	2	2	1	2	1	2	1	2	1	7	7	3	2
4	E12000001	H	3	1	1	5	4	2	1	1	1	2	1	1	4	3	2

	Region	Residence Type	Family Composition	Population Base	Sex	Age	Marital Status	Student	Country of Birth	Health	Ethnic Group	Religion	Economic Activity	Occupation	Industry	Hours worked per week	Approximated Social Grade
0	E12000003	H	5	1	1	1	1	1	1	1	1	2	-9	-9	-9	-9	-9
1	W92000004	H	5	1	1	4	1	2	1	1	1	1	1	5	3	3	3
2	E12000007	H	2	1	1	8	2	2	1	1	1	1	5	5	3	-9	2
3	W92000004	H	2	1	1	6	2	2	1	3	1	2	5	9	10	-9	2
4	W92000004	H	2	1	2	2	1	1	1	1	1	1	6	6	11	-9	3

Our pair of columns¶

Our demonstration follows the trajectory of two columns in the teaching file, Population Base and Marital Status.

Population Base describes the individual's residency:

Usual resident
Student living away from home during term-time
Short-term resident

Marital Status describes the individual's marital status:

Single (never married or never registered a same-sex civil partnership)
Married or in a registered same-sex civil partnership
Separated but still legally married or separated but still legally in a same-sex civil partnership
Divorced or formerly in a same-sex civil partnership which is now legally dissolved
Widowed or surviving partner from a same-sex civil partnership

These columns are heavily unbalanced in the data with 98% of individuals being usual residents, and 85% of individuals either being single or in a legally binding relationship, i.e., married or in a registered same-sex civil partnership:

One-way counts¶

.tabset

Population Base¶

	count	proportion
Population Base
1	561,040	0.9847
2	6,730	0.0118
3	1,971	0.0035

Marital Status¶

	count	proportion
Marital Status
1	270,999	0.4757
2	214,180	0.3759
3	11,951	0.021
4	40,713	0.0715
5	31,898	0.056

Two-way counts¶

The primary reason we are interested in these columns, however, is because they have a two-way marginal table with some very low counts:

Marital Status	1	2	3	4	5
Population Base
1	262,807	213,777	11,930	40,664	31,862
2	6,684	28	8	9	1
3	1,508	375	13	40	35

Typically, data owners might be hesistant to disclose information like this, since there is only one individual in the data who is both a student and widowed. In the case of the 1% Teaching File, the data is so granular that reidentification of this individual is incredibly small.

However, in a more fine-grained, confidential dataset, this would be a problem.

Adding noise¶

To ensure the privacy of the individuals in the dataset, MST adds noise to all of these counts. In MST, the noise is Gaussian and centred on zero. The standard deviation of the noise distribution is determined by the privacy budget and how many measurements are taken.

MST measures all 17 one-way marginals and a selection of two-way marginals. One third of the privacy budget is spent randomly selecting these two-way marginals.

For $\epsilon = 1$ and $\delta = 10^{-6}$, MST adds $\mathcal{N}(0, 32.36)$ noise to the one-way marginal tables.

MST selects enough two-way marginals to connect all the columns using a graphical representation. In this representation, columns are nodes and an edge on the graph indicates the selection of the two-way marginal between the nodes.

As such, MST must measure a minimum of 16 two-way marginals to form a minimum spanning tree of the column nodes. In practice, the final graph is not always a tree and may contain triangles. Here, this is not the case.

For our example, MST adds $\mathcal{N}(0, 31.39)$ noise to each two-way marginal's cell counts.

Generating the synthetic data¶

With the marginals selected and measured (with added noise), they can be passed to the Private-PGM method to create tabular data. In short, Private-PGM attempts to find a data distribution that matches the noisy marginals by solving an optimisation problem.

Mapping out the synthesis¶

The synthetic dataset is generated by traversing a directed acyclic graph of the selected marginals, and sampling as need from the estimated distribution.

The figure below shows the graph for this synthesis, showing how the generation was performed.

Each column is represented as a node, and an edge between two nodes indicates that their two-way marginal was selected for synthesis. The direction of an edge (a, b) indicates that the column a is used to synthesise column b.

This graph shows that Industry was synthesised first with no dependent columns. Then the synthetic Industry column was used to synthesise Occupation, and so on.

Comparing the marginal counts¶

One-way marginals¶

.tabset

To demonstrate the impact of adding noise to the marginal tables, consider the tables below, which show the observed, noisy and synthesised counts for each our columns of interest.

Population Base¶

	observed	noisy	synthetic
Population Base
1	561,040	561,012.36	560,993
2	6,730	6,647.39	6,730
3	1,971	1,992.15	2,002

Marital Status¶

	observed	noisy	synthetic
Marital Status
1	270,999	270,970.27	270,944
2	214,180	214,115.91	214,146
3	11,951	11,939.74	11,948
4	40,713	40,718.06	40,739
5	31,898	31,937.81	31,948

We can see that for this privacy budget and this particular run of MST, these counts are all rather close. Success!

However, accurately recreating individuals columns is a relatively low bar for data synthesis.

Two-way marginals¶

The figures below show the marginal table for our pair of columns for the original data, its noisy counterpart, and the synthetic dataset. The colour map indicates the count in each cell, and is consistent across the figures.

From these figures we can see that the larger counts are less affected by the noise and appear to be very accurately preserved, which makes sense - they carry a lower disclosure risk. The smaller counts, especially the unique individual in the original data, however, are perturbed to a much greater effect. This could be a problem if the users of this synthetic data are interested in studying edge-populations such as this.

Looking at the noisy table, we can see that the Gaussian noise has actually pushed some counts below zero. The generation step of MST considers the one-way marginal table and the amount of noise added to form the noisy counts, allowing the synthetic data to still resemble the original. In fact, with the noise added, the mean absolute difference between the original and synthetic counts for this marginal table is 28.0, which falls in line with the Gaussian noise added.

Limitations¶

It is important to note that MST is not a panacaea for differentially private data synthesis, and here we lay out two of its limitations.

Among the limitations of MST, though not demonstrated here, is that it does not scale for columns with large domains; bear this in mind for your applications.

Internal inconsistency¶

.tabset

Internal inconsistencies are artefacts of synthetic data that should never occur in the real data. The US Census Bureau's application of DP led to a litany of these artefacts, including households with no adults and whole districts comprised of elderly residents.

The Teaching File does not have hierarchical structures like geography or households, but there are some combinations of values can never occur together. This type of internal consistency is called a structural zero. MST does nothing to maintain structural zeroes by default. The Python implementation (used in this demonstration) includes functionality for maintaining structural zeroes, but unfortunately there is no documentation available on how to include them.

As such, we can arrive at a reasonable synthetic dataset that includes thousands of individuals who belong to the student population base and who are not students:

Original¶

Student	1	2
Population Base
1	118715	442325
2	6730	0
3	1092	879

Synthetic¶

Student	1	2
Population Base
1	122877	438116
2	3038	3692
3	686	1316

Unobserved pairs of columns¶

.tabset

Close inspection of the marginal tables above reveals another limitation of MST. That is, MST does not synthesise relationships it does not observe as accurately as those it does.

In truth, MST actually does a pretty good job of doing this because of the Private-PGM method. To demonstrate the efficacy of Private-PGM in synthesising latent two-way relationships, let us look at another pair of columns: Sex and Population Base. In the Teaching File, Sex is considered binary where 1 indicates Male and 2 indicates Female.

Below are the two-way marginals for this latent pair in the original and synthetic data. The third tab shows the difference from the original in each count. At a glance from the marginal tables, it may seem things are well-preserved still.

Then, when looking at the differences, the counts are clearly off by some way. In fact, the mean absolute difference in the marginal table is 117.0, which is considerably larger than for our pair of interest.

So, unobserved relationships are not a wash, but there is a substantial decrease in quality.

Original¶

Sex	1	2
Population Base
1	276,174	284,866
2	3,387	3,343
3	1,008	963

Synthetic¶

Sex	1	2
Population Base
1	276,341	284,652
2	3,269	3,461
3	981	1,021

Difference¶

Sex	1	2
Population Base
1	167	-214
2	-118	118
3	-27	58

Demonstrating MST for the 1% Census

Henry Wilde

The MST method¶

The 1% Census¶

Snippets¶

Our pair of columns¶

One-way counts¶

Population Base¶

Marital Status¶

Two-way counts¶

Adding noise¶

Generating the synthetic data¶

Mapping out the synthesis¶

Comparing the marginal counts¶

One-way marginals¶

Population Base¶

Marital Status¶

Two-way marginals¶

Limitations¶

Internal inconsistency¶

Original¶

Synthetic¶

Unobserved pairs of columns¶

Original¶

Synthetic¶

Difference¶