synthgauge.metrics.nist

Functions for the generic measures from the 2018 NIST competition.

Module Contents

Functions

kway_marginals(real, synth[, k, trials, bins, seed])

A measure based on the similarity of a set of k-way marginals.

hoc(real, synth[, trials, prob, seed])

A measure based on Higher Order Conjunctions (HOC).

synthgauge.metrics.nist.kway_marginals(real, synth, k=3, trials=100, bins=100, seed=None)[source]

A measure based on the similarity of a set of k-way marginals.

This measure works as follows:

  1. Discretise all numeric features (based on the orignal data).

  2. Randomly select trials sets of k features and calculate the corresponding marginal counts for each dataset.

  3. Calculate summed absolute deviation in the counts across all bins and marginal sets.

  4. Transform the summed absolute deviations, \(s\), to form a set of scores \(S = \left[1-s/2 | for each s\right]\). This transformation ensures the scores are in \([0, 1]\). These extremes represent the worst- and best-case scenarios, respectively.

  5. Return the mean score.

The NIST competition utilised a set of 100 three-way marginals. Details can be found at https://doi.org/10.6028/NIST.TN.2151.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • k (int, default 3) – Number of features to include in each k-way marginal. Default uses 3 (as done in the NIST competition).

  • trials (int, default 100) – Maximum number of marginals to consider to estimate the overall score. If there are fewer k-way combinations than trials, tries all. Default uses 100 (as done in the NIST competition).

  • bins (int or str, default 100) – Binning method for sampled numeric features. Can be anything accepted by numpy.histogram. Default uses 100 bins (as done in the NIST competition).

  • seed (int or None, default None) – Random number seed. If None, results will not be reproducible.

Returns

score – The mean transformed sum absolute deviation in k-way densities.

Return type

float

synthgauge.metrics.nist.hoc(real, synth, trials=300, prob=0.1, seed=None)[source]

A measure based on Higher Order Conjunctions (HOC).

This measure compares the relative sizes of randomly selected pools of “similar” rows in the real and synthetic data. This measure of similarity is defined across a set of randomly genereated test cases applied to each dataset. Each test case consists of a set of rules.

The \(i\)-th test calculates the fraction of records satisfying its rules in the real data, \(f_{ri}\), and the synthetic, denoted \(f_{si}\). Their dissimilarity in test \(i\) is quantified as:

\[d_i = \ln\left(\max(f_{si}, 10^{-6})\right) - \ln(f_{ri})\]

These dissimilarities are summarised as:

\[\Delta = \sqrt{\frac{1}{N} \sum_{i=1}^{N} d_i^2}\]

where \(N\) is the number of test cases. Finally, this is transformed to a HOC score:

\[HOC = \max \left(0, 1 + \frac{\Delta}{\ln(10^{-3})}\right)\]

This measure is bounded between 0 and 1, indicating whether the datasets are nothing alike or identical based on the test cases, respectively. In the original text this score is multiplied by 1000 to make it human-readable. Full details are available in https://doi.org/10.6028/NIST.TN.2151.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • trials (int, default 300) – Number of test cases to create. Default of 300 as in the competition.

  • prob (float, default 0.1) – Probability of any column being included in a test case. Default of 0.1 as in the competition.

  • seed (int or None, default None) – Random number seed. If None, results will not be reproducible.

Returns

score – The overall HOC score.

Return type

float

Notes

It is possible that some test cases will be “empty”, i.e. when no columns are selected. In this scenario, the score for that case will be np.nan rather than it being resampled.