synthgauge.metrics.privacy

Privacy metrics.

Module Contents

Functions

tcap_score(real, synth, key, target)

Target Correct Attribution Probability (TCAP) score.

min_nearest_neighbour(real, synth[, feats, ...])

Minimum nearest-neighbour distance.

sample_overlap_score(real, synth[, feats, ...])

Return percentage of overlap between real and synth data based on

synthgauge.metrics.privacy.tcap_score(real, synth, key, target)[source]

Target Correct Attribution Probability (TCAP) score.

This privacy metric calculates the average chance that the key-target pairings in a synthetic dataset reveal the true key-target pairings in associated real dataset.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • key (list of str) – List of features in synth to use as the key.

  • target (str) – Feature to use as the target.

Returns

The average TCAP score across the dataset.

Return type

float

Notes

This metric provides an estimate of how well an intruder could infer attributes of groups in the real dataset by studying the synthetic. The choices for key and target will vary depending on the dataset in question but we would suggest the key features are those that could be readily available to an outsider and the target feature is one we wouldn’t want them finding out, such as a protected characteristic.

This method only works with categorical data, so binning of continuous data may be required.

Full details may be found in:

Taub and Elliott (2019). The Synthetic Data Challenge. The Hague, The Netherlands: Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Session 3.

synthgauge.metrics.privacy.min_nearest_neighbour(real, synth, feats=None, outliers_only=True, threshold=2, n_neighbours=5)[source]

Minimum nearest-neighbour distance.

This privacy metric returns the smallest distance between any point in the real dataset and any point in the synthetic dataset. There is an option to only consider the outliers in the real dataset as these perhaps pose more of a privacy concern.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – Features in real and synth to use when calculating distance. If None (default), all common features are used.

  • outliers_only (bool, default True) – Boolean indicating whether to filter out the real data inliers (default) or not.

  • threshold (number, default 2) – Outlier decision threshold. Increase to include fewer points from real in nearest-neighbour calculations.

  • n_neighbours (int, default 5) – Number of neighbours to consider when identifying local outliers.

Returns

Minimum Manhattan distance between real and synth data.

Return type

float

Notes

This privacy metric provides an insight into whether the synthetic dataset is too similar to the real dataset. It does this by calculating the minimum distance between the real records and the synthetic records.

This metric assumes that categorical data is ordinal during distance calculations, or that it has already been suitably one-hot-encoded.

synthgauge.metrics.privacy.sample_overlap_score(real, synth, feats=None, sample_size=0.2, runs=5, seed=None, score_type='unique')[source]

Return percentage of overlap between real and synth data based on random sampling.

Samples from both the real and synthetic datasets are compared for similarity. This similarity, or overlap score, is based on the exact matches of real data records within the synthetic data.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – The features used to match records. If None (default), all common features are used.

  • sample_size (float or int, default 0.2) – The ratio (if sample_size between 0 and 1) or count (sample_size > 1) of records to sample. Default is 0.2 (20%).

  • runs (int, default 5) – The number of sampling runs to use when computing the score.

  • seed (int, optional) – Random number seed used for sampling.

  • score_type ({"unique", "sample"}, default "unique") – Method used for calculating the overlap score. If “unique” (default), the score is the percentage of unique records in the real sample that have a match within the synthetic data. If “sample”, the score is the percentage of all records within the real sample that have a match within the synth sample.

Returns

overlap_score – Estimated overlap score between real and synth.

Return type

float