The aims of this demonstration are to:

  1. Understand the MST synthesis method from a high level
  2. Visualise the impact of adding noise during synthesis
  3. Observe some of the limitations of the method

To do this, we will be synthesising the 1% Teaching File and following the path of a particular set of columns.

Moreover, to emphasise the importance of adding noise in achieving disclosure control, we will focus on a pair of columns with a very small count in their two-way marginal table.


The MST method

Maximum Spanning Tree (MST) is a data synthesis method for synthesising categorical data while preserving privacy. The authors of MST won the 2018 NIST competition for differential privacy (DP) and data synthesis. The team won using a variant of MST tailored to the contest dataset - extracts from the 1940 US Census.

There are three main steps to MST:

  1. Select a collection of one- and two-way marginals from the data
  2. Measure the marginals privately by adding a controlled amount of noise to each cell count
  3. Generate synthetic data that attempts to match the noisy marginals

MST is a flexible, scalable method that can preserve interactions in the data beyond those it measures. This accuracy for latent relationships comes in the generate step which employs the Private-PGM method. Private-PGM represents the noisy marginals as a graph with nodes and edges.

Since MST is a differentially private synthesis method, we must also specify a privacy budget. The variant of DP used in MST requires two parameters, $\epsilon > 0$ and $0 < \delta \ll \epsilon$. These parameters control the amount of noise added to marginals, and the pseudo-randomness of the selection process.

Choosing $\delta$ is relatively straightforward so long as it is smaller than $\frac{1}{N}$ where $N$ is the length of the dataset to be synthesised. Choosing the primary privacy budget $\epsilon$ is much more nuanced and broadly determines how accurate the synthetic data may be. While larger values are commonly used in practice, one of the authors of DP argues that $0.01 \approx \ln(1.01) \le \epsilon \le \ln(3) \approx 1.1$ is a suitable range.

For this demonstration, we take $\epsilon = 1$ and $\delta = 10 ^ {- ceil\left(\log_{10} N\right)} = 10 ^ {-6}$, where $ceil$ is the ceiling function.


The 1% Census

One of the outputs from the 2011 Census in England and Wales is the Microdata Teaching File. This file contains a 1% sample of the person records from the Census; all records are anonymised and chosen at random. Excluding the person ID, the data comprises 17 columns and 569,741 rows.

The primary purpose of the teaching file is an educational tool, but it also offers up a glance at what genuine Census microdata look like. Using this dataset to demonstrate the MST method seems sensible then, given its previous applications to census data.

Snippets

Below is a snapshot of the original data, and a similar snapshot of the synthetic data beneath that. While it may seem like we are skipping ahead, these snippets make clear what our end goal looks like. That is, a synthetic dataset that looks quite like the real thing.

Region Residence Type Family Composition Population Base Sex Age Marital Status Student Country of Birth Health Ethnic Group Religion Economic Activity Occupation Industry Hours worked per week Approximated Social Grade
0 E12000001 H 2 1 2 6 2 2 1 2 1 2 5 8 2 -9 4
1 E12000001 H 5 1 1 4 1 2 1 1 1 2 1 8 6 4 3
2 E12000001 H 3 1 2 4 1 2 1 1 1 1 1 6 11 3 4
3 E12000001 H 3 1 2 2 1 2 1 2 1 2 1 7 7 3 2
4 E12000001 H 3 1 1 5 4 2 1 1 1 2 1 1 4 3 2
Region Residence Type Family Composition Population Base Sex Age Marital Status Student Country of Birth Health Ethnic Group Religion Economic Activity Occupation Industry Hours worked per week Approximated Social Grade
0 E12000003 H 5 1 1 1 1 1 1 1 1 2 -9 -9 -9 -9 -9
1 W92000004 H 5 1 1 4 1 2 1 1 1 1 1 5 3 3 3
2 E12000007 H 2 1 1 8 2 2 1 1 1 1 5 5 3 -9 2
3 W92000004 H 2 1 1 6 2 2 1 3 1 2 5 9 10 -9 2
4 W92000004 H 2 1 2 2 1 1 1 1 1 1 6 6 11 -9 3

Our pair of columns

Our demonstration follows the trajectory of two columns in the teaching file, Population Base and Marital Status.

Population Base describes the individual's residency:

  1. Usual resident
  2. Student living away from home during term-time
  3. Short-term resident

Marital Status describes the individual's marital status:

  1. Single (never married or never registered a same-sex civil partnership)
  2. Married or in a registered same-sex civil partnership
  3. Separated but still legally married or separated but still legally in a same-sex civil partnership
  4. Divorced or formerly in a same-sex civil partnership which is now legally dissolved
  5. Widowed or surviving partner from a same-sex civil partnership

These columns are heavily unbalanced in the data with 98% of individuals being usual residents, and 85% of individuals either being single or in a legally binding relationship, i.e., married or in a registered same-sex civil partnership:

One-way counts

Population Base

  count proportion
Population Base    
1 561,040 0.9847
2 6,730 0.0118
3 1,971 0.0035

Marital Status

  count proportion
Marital Status    
1 270,999 0.4757
2 214,180 0.3759
3 11,951 0.021
4 40,713 0.0715
5 31,898 0.056

Two-way counts

The primary reason we are interested in these columns, however, is because they have a two-way marginal table with some very low counts:

Marital Status 1 2 3 4 5
Population Base          
1 262,807 213,777 11,930 40,664 31,862
2 6,684 28 8 9 1
3 1,508 375 13 40 35

Typically, data owners might be hesistant to disclose information like this, since there is only one individual in the data who is both a student and widowed. In the case of the 1% Teaching File, the data is so granular that reidentification of this individual is incredibly small.

However, in a more fine-grained, confidential dataset, this would be a problem.


Adding noise

To ensure the privacy of the individuals in the dataset, MST adds noise to all of these counts. In MST, the noise is Gaussian and centred on zero. The standard deviation of the noise distribution is determined by the privacy budget and how many measurements are taken.

MST measures all 17 one-way marginals and a selection of two-way marginals. One third of the privacy budget is spent randomly selecting these two-way marginals.

For $\epsilon = 1$ and $\delta = 10^{-6}$, MST adds $\mathcal{N}(0, 32.36)$ noise to the one-way marginal tables.

MST selects enough two-way marginals to connect all the columns using a graphical representation. In this representation, columns are nodes and an edge on the graph indicates the selection of the two-way marginal between the nodes.

As such, MST must measure a minimum of 16 two-way marginals to form a minimum spanning tree of the column nodes. In practice, the final graph is not always a tree and may contain triangles. Here, this is not the case.

For our example, MST adds $\mathcal{N}(0, 31.39)$ noise to each two-way marginal's cell counts.


Generating the synthetic data

With the marginals selected and measured (with added noise), they can be passed to the Private-PGM method to create tabular data. In short, Private-PGM attempts to find a data distribution that matches the noisy marginals by solving an optimisation problem.

Mapping out the synthesis

The synthetic dataset is generated by traversing a directed acyclic graph of the selected marginals, and sampling as need from the estimated distribution.

The figure below shows the graph for this synthesis, showing how the generation was performed.

Each column is represented as a node, and an edge between two nodes indicates that their two-way marginal was selected for synthesis. The direction of an edge (a, b) indicates that the column a is used to synthesise column b.

This graph shows that Industry was synthesised first with no dependent columns. Then the synthetic Industry column was used to synthesise Occupation, and so on.


Comparing the marginal counts

One-way marginals

To demonstrate the impact of adding noise to the marginal tables, consider the tables below, which show the observed, noisy and synthesised counts for each our columns of interest.

Population Base

  observed noisy synthetic
Population Base      
1 561,040 561,012.36 560,993
2 6,730 6,647.39 6,730
3 1,971 1,992.15 2,002

Marital Status

  observed noisy synthetic
Marital Status      
1 270,999 270,970.27 270,944
2 214,180 214,115.91 214,146
3 11,951 11,939.74 11,948
4 40,713 40,718.06 40,739
5 31,898 31,937.81 31,948

We can see that for this privacy budget and this particular run of MST, these counts are all rather close. Success!

However, accurately recreating individuals columns is a relatively low bar for data synthesis.

Two-way marginals

The figures below show the marginal table for our pair of columns for the original data, its noisy counterpart, and the synthetic dataset. The colour map indicates the count in each cell, and is consistent across the figures.

From these figures we can see that the larger counts are less affected by the noise and appear to be very accurately preserved, which makes sense - they carry a lower disclosure risk. The smaller counts, especially the unique individual in the original data, however, are perturbed to a much greater effect. This could be a problem if the users of this synthetic data are interested in studying edge-populations such as this.

Looking at the noisy table, we can see that the Gaussian noise has actually pushed some counts below zero. The generation step of MST considers the one-way marginal table and the amount of noise added to form the noisy counts, allowing the synthetic data to still resemble the original. In fact, with the noise added, the mean absolute difference between the original and synthetic counts for this marginal table is 28.0, which falls in line with the Gaussian noise added.


Limitations

It is important to note that MST is not a panacaea for differentially private data synthesis, and here we lay out two of its limitations.

Among the limitations of MST, though not demonstrated here, is that it does not scale for columns with large domains; bear this in mind for your applications.

Internal inconsistency

Internal inconsistencies are artefacts of synthetic data that should never occur in the real data. The US Census Bureau's application of DP led to a litany of these artefacts, including households with no adults and whole districts comprised of elderly residents.

The Teaching File does not have hierarchical structures like geography or households, but there are some combinations of values can never occur together. This type of internal consistency is called a structural zero. MST does nothing to maintain structural zeroes by default. The Python implementation (used in this demonstration) includes functionality for maintaining structural zeroes, but unfortunately there is no documentation available on how to include them.

As such, we can arrive at a reasonable synthetic dataset that includes thousands of individuals who belong to the student population base and who are not students:

Original

Student 1 2
Population Base
1 118715 442325
2 6730 0
3 1092 879

Synthetic

Student 1 2
Population Base
1 122877 438116
2 3038 3692
3 686 1316

Unobserved pairs of columns

Close inspection of the marginal tables above reveals another limitation of MST. That is, MST does not synthesise relationships it does not observe as accurately as those it does.

In truth, MST actually does a pretty good job of doing this because of the Private-PGM method. To demonstrate the efficacy of Private-PGM in synthesising latent two-way relationships, let us look at another pair of columns: Sex and Population Base. In the Teaching File, Sex is considered binary where 1 indicates Male and 2 indicates Female.

Below are the two-way marginals for this latent pair in the original and synthetic data. The third tab shows the difference from the original in each count. At a glance from the marginal tables, it may seem things are well-preserved still.

Then, when looking at the differences, the counts are clearly off by some way. In fact, the mean absolute difference in the marginal table is 117.0, which is considerably larger than for our pair of interest.

So, unobserved relationships are not a wash, but there is a substantial decrease in quality.

Original

Sex 1 2
Population Base    
1 276,174 284,866
2 3,387 3,343
3 1,008 963

Synthetic

Sex 1 2
Population Base    
1 276,341 284,652
2 3,269 3,461
3 981 1,021

Difference

Sex 1 2
Population Base    
1 167 -214
2 -118 118
3 -27 58