synthgauge.metrics

A submodule for all utility and privacy metrics.

Submodules

Package Contents

Functions

classification_comparison(real, synth, feats, target, ...)

Classification utility metric.

clustered_msd(real, synth[, feats, method, k_min, ...])

(Multiple) clustered mean-squared difference (MSD).

correlation_msd(real, synth[, method, feats])

Mean-squared difference in correlation coefficients.

correlation_ratio_msd(real, synth[, categorical, numeric])

Correlation ratio mean-squared difference.

feature_density_mad(real, synth[, feats, bins])

Mean absolute difference of feature densities.

hoc(real, synth[, trials, prob, seed])

A measure based on Higher Order Conjunctions (HOC).

kway_marginals(real, synth[, k, trials, bins, seed])

A measure based on the similarity of a set of k-way marginals.

min_nearest_neighbour(real, synth[, feats, ...])

Minimum nearest-neighbour distance.

sample_overlap_score(real, synth[, feats, ...])

Return percentage of overlap between real and synth data based on

tcap_score(real, synth, key, target)

Target Correct Attribution Probability (TCAP) score.

pmse(combined, indicator, method, **kwargs)

Calculate the propensity score mean-squared error (pMSE).

pmse_ratio(combined, indicator, method[, num_perms, ...])

The propensity score mean-squared error ratio.

pmse_standardised(combined, indicator, method[, ...])

The standardised propensity score mean-squared error.

propensity_metrics(real, synth[, method, feats, ...])

Propensity score-based metrics.

specks(real, synth, classifier, **kwargs)

Propensity score comparison via the Kolmogorov-Smirnov distance.

jensen_shannon_distance(real, synth, feature[, bins])

Jensen-Shannon distance.

jensen_shannon_divergence(real, synth, feature[, bins])

Jensen-Shannon divergence.

kolmogorov_smirnov(real, synth, feature, **kwargs)

Kolmogorov-Smirnov test.

kruskal_wallis(real, synth, feature, **kwargs)

Kruskal-Wallis H test.

kullback_leibler(real, synth, feature[, bins])

Kullback-Leibler divergence.

mann_whitney(real, synth, feature, **kwargs)

Mann-Whitney U test.

wasserstein(real, synth, feature, **kwargs)

The (first) Wasserstein distance.

wilcoxon(real, synth, feature, **kwargs)

Wilcoxon signed-rank test.

synthgauge.metrics.classification_comparison(real, synth, feats, target, classifier, test_prop=0.2, random_state=None, **kwargs)[source]

Classification utility metric.

This metric fits two (identical) classification models to real and synth, and then tests them both against withheld real data. We obtain utility scores by subtracting the precision, recall and f1 scores of the “synthetic” model predictions from the “real” model’s.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str) – List of column names to use as the input in the classification.

  • target (str) – Column to use as target in the classification.

  • classifier (scikit-learn estimator) – Classifier class with fit and predict methods.

  • test_prop (float or int, default 0.2) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.

  • random_state (int, optional) – Random seed for shuffling during the train-test split, and for the classification algorithm itself.

  • **kwargs (dict, optional) – featsword arguments passed to the classifier.

Returns

  • precision_difference (float) – Precision of the real model subtracted by that of the synthetic model.

  • recall_difference (float) – Recall of the real model subtracted by that of the synthetic model.

  • f1_difference (float) – f1 score of the real model subtracted by that of the synthetic model.

Notes

Some preprocessing is carried out before the models are trained. Numeric features are scaled and categorical features are one-hot-encoded.

A score of zero tells us the synthetic data is just as good as the real at training the given classification model. Increases in these scores indicate poorer utility.

synthgauge.metrics.clustered_msd(real, synth, feats=None, method='kmeans', k_min=2, k_max=10, random_state=None)[source]

(Multiple) clustered mean-squared difference (MSD).

This metric clusters the real and synthetic data together, measuring the synthetic utility according to its representation across the fitted clusters. Since there is often no obvious choice for the number of clusters, \(k\), we consider a range of values.

For each value of \(k\), the chosen clustering method is fit and the proportion of synthetic data in each cluster is recorded. The clustered MSD is then calculated as the mean-squared difference between these proportions and the overall proportion of synthetic data.

This collection of MSDs is summarised by taking its minimum to give the metric value.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – Features to use in the clustering. If None (default), all common features are used.

  • method ({"kmeans", "kprototypes"}, default "kmeans") – Clustering method to use. Only k-means and k-prototypes are implemented. If using k-means (default), only numeric columns are considered, while k-prototypes allows for mixed-type clustering.

  • k_min (int, default 2) – Minimum number of clusters to use. Default of 2.

  • k_max (int, default 10) – Maximum number of clusters to use. Default of 10.

  • random_state (int, optional) – The random seed used to fit the clustering algorithm.

Returns

The minimum observed clustered MSD.

Return type

float

Notes

This function can be used with a single value of k by setting k_min and k_max both to k. For instance, if a sensible number of clusters is known a priori.

This metric says nothing about how appropriate the clustering method may be for the data at hand, nor how the data are distributed among the clusters. Both methods considered here have rather strong assumptions about the relative size and characteristics of the clusters in the data. As such, exploratory analysis is advised to determine whether such centroid-based clustering is well-suited.

synthgauge.metrics.correlation_msd(real, synth, method='pearson', feats=None)[source]

Mean-squared difference in correlation coefficients.

This metric calculates the mean squared difference between the correlation matrices for the real and synthetic datasets. This gives an indication of how well the synthetic data has retained bivariate relationships.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • method ({"pearson", "spearman", "cramers_v"}, default "pearson") –

  • feats (list of str or None, default None) – Features to measure correlation across. If method=”cramers_v”, all numeric columns will be filtered out. Likewise, for the other correlation methods, all non-numeric columns are removed. If None (default), all common features that satisfy the needs of method are used.

Returns

Mean-squared difference of correlation coefficients.

Return type

float

See also

numpy.corrcoef

Notes

The smaller the number this function returns, the better the synthetic data captures the correlation between variables in the real data. This method is therefore good for comparing multiple synthetic datasets to see which is closest to the real. However, as this is not a test, there is no threshold value below which we can claim the datasets are statistically the same.

We only consider the coefficients above the main diagonal when calculating the MSD. If we included the entire matrix, we would double-count each pair of features as well as including the trivial ones along the main diagonal.

synthgauge.metrics.correlation_ratio_msd(real, synth, categorical=None, numeric=None)[source]

Correlation ratio mean-squared difference.

This metric calculates the mean-squared difference in association between categorical and continuous feature pairings in the real and synthetic datasets.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • categorical (list of str or None, default None) – Categorical features in real and synth to include in comparison. If None (default), uses all common object-type columns.

  • numeric (list of str or None, default None) – Numerical features in real and synth to include in comparison. If None (default), uses all common columns not selected by categorical.

Returns

Mean squared difference between real and synth in correlation ratio scores across all categorical-continuous feature pairs.

Return type

float

synthgauge.metrics.feature_density_mad(real, synth, feats=None, bins=10)[source]

Mean absolute difference of feature densities.

For each feature the difference between the density across the bins within real and synth is calculated. Finally the MAE across all features and bins is calculated. A value close to 0 indicates that the real and synthetic datasets have a similar set of feature distributions.

Parameters
  • real (pandas.DataFrame) – DataFrame containing the real data.

  • synth (pandas.DataFrame) – DataFrame containing the sythetic data.

  • feats (list of str or None, default None) – The features that will be used to compute the densities. If None (default), all common features are used.

  • bins (str or int, default 10) – Binning method for discretising the data. Can be anything accepted by numpy.histogram_bin_edges. Default uses 10 bins.

Returns

Mean absolute error of feature densities.

Return type

float

synthgauge.metrics.hoc(real, synth, trials=300, prob=0.1, seed=None)[source]

A measure based on Higher Order Conjunctions (HOC).

This measure compares the relative sizes of randomly selected pools of “similar” rows in the real and synthetic data. This measure of similarity is defined across a set of randomly genereated test cases applied to each dataset. Each test case consists of a set of rules.

The \(i\)-th test calculates the fraction of records satisfying its rules in the real data, \(f_{ri}\), and the synthetic, denoted \(f_{si}\). Their dissimilarity in test \(i\) is quantified as:

\[d_i = \ln\left(\max(f_{si}, 10^{-6})\right) - \ln(f_{ri})\]

These dissimilarities are summarised as:

\[\Delta = \sqrt{\frac{1}{N} \sum_{i=1}^{N} d_i^2}\]

where \(N\) is the number of test cases. Finally, this is transformed to a HOC score:

\[HOC = \max \left(0, 1 + \frac{\Delta}{\ln(10^{-3})}\right)\]

This measure is bounded between 0 and 1, indicating whether the datasets are nothing alike or identical based on the test cases, respectively. In the original text this score is multiplied by 1000 to make it human-readable. Full details are available in https://doi.org/10.6028/NIST.TN.2151.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • trials (int, default 300) – Number of test cases to create. Default of 300 as in the competition.

  • prob (float, default 0.1) – Probability of any column being included in a test case. Default of 0.1 as in the competition.

  • seed (int or None, default None) – Random number seed. If None, results will not be reproducible.

Returns

score – The overall HOC score.

Return type

float

Notes

It is possible that some test cases will be “empty”, i.e. when no columns are selected. In this scenario, the score for that case will be np.nan rather than it being resampled.

synthgauge.metrics.kway_marginals(real, synth, k=3, trials=100, bins=100, seed=None)[source]

A measure based on the similarity of a set of k-way marginals.

This measure works as follows:

  1. Discretise all numeric features (based on the orignal data).

  2. Randomly select trials sets of k features and calculate the corresponding marginal counts for each dataset.

  3. Calculate summed absolute deviation in the counts across all bins and marginal sets.

  4. Transform the summed absolute deviations, \(s\), to form a set of scores \(S = \left[1-s/2 | for each s\right]\). This transformation ensures the scores are in \([0, 1]\). These extremes represent the worst- and best-case scenarios, respectively.

  5. Return the mean score.

The NIST competition utilised a set of 100 three-way marginals. Details can be found at https://doi.org/10.6028/NIST.TN.2151.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • k (int, default 3) – Number of features to include in each k-way marginal. Default uses 3 (as done in the NIST competition).

  • trials (int, default 100) – Maximum number of marginals to consider to estimate the overall score. If there are fewer k-way combinations than trials, tries all. Default uses 100 (as done in the NIST competition).

  • bins (int or str, default 100) – Binning method for sampled numeric features. Can be anything accepted by numpy.histogram. Default uses 100 bins (as done in the NIST competition).

  • seed (int or None, default None) – Random number seed. If None, results will not be reproducible.

Returns

score – The mean transformed sum absolute deviation in k-way densities.

Return type

float

synthgauge.metrics.min_nearest_neighbour(real, synth, feats=None, outliers_only=True, threshold=2, n_neighbours=5)[source]

Minimum nearest-neighbour distance.

This privacy metric returns the smallest distance between any point in the real dataset and any point in the synthetic dataset. There is an option to only consider the outliers in the real dataset as these perhaps pose more of a privacy concern.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – Features in real and synth to use when calculating distance. If None (default), all common features are used.

  • outliers_only (bool, default True) – Boolean indicating whether to filter out the real data inliers (default) or not.

  • threshold (number, default 2) – Outlier decision threshold. Increase to include fewer points from real in nearest-neighbour calculations.

  • n_neighbours (int, default 5) – Number of neighbours to consider when identifying local outliers.

Returns

Minimum Manhattan distance between real and synth data.

Return type

float

Notes

This privacy metric provides an insight into whether the synthetic dataset is too similar to the real dataset. It does this by calculating the minimum distance between the real records and the synthetic records.

This metric assumes that categorical data is ordinal during distance calculations, or that it has already been suitably one-hot-encoded.

synthgauge.metrics.sample_overlap_score(real, synth, feats=None, sample_size=0.2, runs=5, seed=None, score_type='unique')[source]

Return percentage of overlap between real and synth data based on random sampling.

Samples from both the real and synthetic datasets are compared for similarity. This similarity, or overlap score, is based on the exact matches of real data records within the synthetic data.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – The features used to match records. If None (default), all common features are used.

  • sample_size (float or int, default 0.2) – The ratio (if sample_size between 0 and 1) or count (sample_size > 1) of records to sample. Default is 0.2 (20%).

  • runs (int, default 5) – The number of sampling runs to use when computing the score.

  • seed (int, optional) – Random number seed used for sampling.

  • score_type ({"unique", "sample"}, default "unique") – Method used for calculating the overlap score. If “unique” (default), the score is the percentage of unique records in the real sample that have a match within the synthetic data. If “sample”, the score is the percentage of all records within the real sample that have a match within the synth sample.

Returns

overlap_score – Estimated overlap score between real and synth.

Return type

float

synthgauge.metrics.tcap_score(real, synth, key, target)[source]

Target Correct Attribution Probability (TCAP) score.

This privacy metric calculates the average chance that the key-target pairings in a synthetic dataset reveal the true key-target pairings in associated real dataset.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • key (list of str) – List of features in synth to use as the key.

  • target (str) – Feature to use as the target.

Returns

The average TCAP score across the dataset.

Return type

float

Notes

This metric provides an estimate of how well an intruder could infer attributes of groups in the real dataset by studying the synthetic. The choices for key and target will vary depending on the dataset in question but we would suggest the key features are those that could be readily available to an outsider and the target feature is one we wouldn’t want them finding out, such as a protected characteristic.

This method only works with categorical data, so binning of continuous data may be required.

Full details may be found in:

Taub and Elliott (2019). The Synthetic Data Challenge. The Hague, The Netherlands: Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Session 3.

synthgauge.metrics.pmse(combined, indicator, method, **kwargs)[source]

Calculate the propensity score mean-squared error (pMSE).

Parameters
  • combined (pandas.DataFrame) – The combined set of real and synthetic data.

  • indicator (numpy.ndarray) – An indicator for which data are real (0) or synthetic (1).

  • method ({"cart", "logr"}) – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • **kwargs (dict, optional) – Keyword arguments passed to propensity model.

Returns

Propensity score mean-squared error.

Return type

float

See also

sklearn.linear_model.LogisticRegression, sklearn.tree.DecisionTreeClassifier

Notes

Propensity scores represent probabilities of group membership. By modelling whether an example is synthetic or not, we can use propensity scores as a measure of utility.

This returns zero if the distributions are identical, and is bounded above by \(1 - c\) if they are nothing alike, where \(c\) is the proportion of the data that is synthetic. This method is therefore good for comparing multiple synthetic datasets. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same.

This function assumes that some preprocessing has been carried out so that the data is ready to be passed to the classification function. Encoding of categorical data is performed, but, for example, scaling is not. Without this, erroneous results may be returned. The logistic regression can fail to converge if many variables are considered. Anecdotally, this doesn’t seem to drastically impact the propensity scores, although this should be investigated formally.

Using a CART model as a classifier is recommended in the literature however we also support the use of logistic regression. For further details, see: https://doi.org/10.1111/rssa.12358

synthgauge.metrics.pmse_ratio(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)[source]

The propensity score mean-squared error ratio.

This is the ratio of observed pMSE to that expected under the null case, i.e.

\[ratio(pMSE) = \frac{pMSE}{E(pMSE)}\]
Parameters
  • combined (pandas.DataFrame) – Dataframe containing the combined real and synthetic data.

  • indicator (numpy.ndarray) – Indicator for whether data are real (0) or synthetic (1).

  • method ({"cart", "logr"}) – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • num_perms (int, optional) – Number of permutations to consider when estimating the null case statistics with a CART model.

  • estimator ({"perm", "boot"}) – Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility.

  • **kwargs (dict, optional) – Keyword arguments passed to the propensity model classifier.

Returns

The observed-to-null pMSE ratio.

Return type

float

Notes

The interpretation of this metric makes more sense for synthetic data. The pMSE alone gives better utility as the value gets closer to zero, which is only attainable when the datasets are identical. However, when generating synthetic data, we do not want to produce identical entries. Rather, we want to achieve similarity between the distributions of the real and synthetic datasets.

This ratio tends towards one when this is achieved, and increases otherwise.

Note that the random_state keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible.

synthgauge.metrics.pmse_standardised(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)[source]

The standardised propensity score mean-squared error.

This takes the observed pMSE and standardises it against the null case, i.e.

\[stand(pMSE) = (pMSE - E(pMSE)) / sd(pMSE)\]
Parameters
  • combined (pandas.DataFrame) – Dataframe containing the combined real and synthetic data.

  • indicator (numpy.ndarray) – Indicator for whether data are real (0) or synthetic (1).

  • method ({"cart", "logr"}) – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • num_perms (int, optional) – Number of permutations to consider when estimating the null case statistics with a CART model.

  • estimator ({"perm", "boot"}) – Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility.

  • **kwargs (dict, optional) – Keyword arguments passed to the propensity model.

Returns

The null-standardised pMSE.

Return type

float

Notes

The interpretation of this metric makes more sense for synthetic data. The pMSE alone indicates better utility as it gets closer to zero, which is only attainable when the datasets are identical. However, when generating synthetic data, we do not want to produce identical entries. Rather, we want to achieve similarity between the distributions of the real and synthetic datasets.

This standardised value tends towards zero when this is achieved, and increases in magnitude otherwise.

Note that the random_state keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible.

synthgauge.metrics.propensity_metrics(real, synth, method='cart', feats=None, num_perms=20, estimator='perm', **kwargs)[source]

Propensity score-based metrics.

This function calculates three metrics based on the propensity score mean-squared error (pMSE), all of which quantify utility by measuring the distinguishability of the synthetic data. That is, how readily real and synthetic data can be identified.

To do this, the datasets are combined and their origins tracked by a boolean indicator. This combined dataset is then used to fit a binary classification model (CART or logistic regression with first-order interactions) with the indicator as the target. The propensity score for each row is then extracted and summarised to give a metric.

The returned metrics are the observed pMSE along with the pMSE ratio and standardised pMSE. These second two metrics are given relative to the null case where the real and synthetic data are produced from identical processes.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • method ({"cart", "logr"}, default "cart") – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • feats (list of str or None, default None) – List of features in the dataset to be used in the propensity model. If None (default), all common features are used.

  • num_perms (int, default 20) – Number of permutations to consider when estimating the null case statistics with a CART model.

  • estimator ({"perm", "boot"}) – Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility.

  • **kwargs (dict, optional) – Keyword arguments passed to the propensity model.

Returns

  • observed (float) – The observed pMSE.

  • standard (float) – The null-standardised pMSE.

  • ratio (float) – The observed-null pMSE ratio.

Raises

ValueError – If method is not one of ‘cart’ or ‘logr’.

See also

sklearn.linear_model.LogisticRegression, sklearn.tree.DecisionTreeClassifier, synthgauge.metrics.propensity.pmse, synthgauge.metrics.propensity.pmse_ratio, synthgauge.metrics.propensity.pmse_standardised

Notes

For the CART model, sklearn.tree.DecisionTreeClassifier is used. Meanwhile, the logistic regression model uses sklearn.linear_model.LogisticRegression.

Note that the random_state keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible.

Details on these metrics can be found at: https://doi.org/10.1111/rssa.12358

synthgauge.metrics.specks(real, synth, classifier, **kwargs)[source]

Propensity score comparison via the Kolmogorov-Smirnov distance.

The SPECKS metric was originally presented in https://arxiv.org/pdf/1803.06763.pdf and works as follows:

  1. Stack the real and synthetic data, and create a variable indicating whether each record is real (0) or synthetic (1).

  2. Calculate the propensity score for each record using a binary classifier on the indicator variable.

  3. Compute the Kolmogorov-Smirnov distance between the empirical CDFs for the real and synthetic propensity scores.

The Kolmogorov-Smirnov distance is defined as the maximum difference between two empirical distributions. Therefore, it is bounded between zero and one. If the synthetic data properly resembles the original data then they will be indistinguishable, leading to close empirical CDFs.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • classifier (scikit-learn estimator) – Any scikit-learn-style classifier class with a predict_proba method.

  • **kwargs (dict, optional) – Keyword arguments to be passed to classifer.

Returns

The Kolmogorov-Smirnov distance between the real and synthetic propensity score CDFs.

Return type

float

Notes

The combined dataset is one-hot-encoded before being passed to the classifier so categorical features can be handled.

The paper introducing SPECKS has also been published in METRON: https://doi.org/10.1007/s40300-021-00201-0.

synthgauge.metrics.jensen_shannon_distance(real, synth, feature, bins='auto', **kwargs)[source]

Jensen-Shannon distance.

Describes the difference between two distributions in terms of entropy. Calculated as the square root of the Jensen-Shannon divergence, the Jensen-Shannon distance satisfies the mathematical definition of a metric.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature of the datasets to compare. This must be continuous.

  • bins (int or str or None, default "auto") – The binning method to use. If int, is the number of bins. If str, must be a method accepted by numpy.histogram_bin_edges. If None, the feature is assumed to be categorical and counts are taken for each value in either dataset.

  • **kwargs (dict, optional) – Keyword arguments for scipy.spatial.distance.jensenshannon.

Returns

distance – The computed distance between the distributions.

Return type

float

See also

synthgauge.metrics.univariate_distance.jensen_shannon_divergence, scipy.spatial.distance.jensenshannon

Notes

This is a wrapper for scipy.spatial.distance.jensenshannon. Since this function expects probability vectors, the data is first discretised into evenly-spaced bins.

We can think of the Jensen-Shannon distance as the amount of information, or entropy, encoded in the difference between the real and synth distributions of the feature.

The distance is zero if the distributions are identical, and is bounded above by one if they are nothing alike. This method is therefore good for comparing multiple synthetic datasets, or features within a dataset, to see which is closest to the real. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same.

An optimal ‘bins’ value has not been suggested.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame(
...     {"a": [0, 1, 3, 4], "b": ["cat", "cow", "dog", "emu"]}
... )
>>> synth = pd.DataFrame(
...     {"a": [3, 3, 2, 0], "b": ["cat", "cat", "cow", "dog"]}
... )

The second feature appears to be more similar than the first across datasets:

>>> jensen_shannon_distance(real, synth, "a")
0.41627730557884884
>>> jensen_shannon_distance(real, synth, "b", bins=None)
0.328452092654953
synthgauge.metrics.jensen_shannon_divergence(real, synth, feature, bins='auto', **kwargs)[source]

Jensen-Shannon divergence.

Also known as the information radius, the Jensen-Shannon divergence describes the similarity between two probability distributions in terms of entropy. This divergence modifies the Kullback-Leibler divergence to be symmetric and finite (between 0 and 1).

The divergence does not satisfy the triangle inequality. Thus, it does not describe “distance” in the mathematical sense. Taking its square root provides a metric known as the Jensen-Shannon distance.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature of the datasets to compare. This must be continuous.

  • bins (int or str or None, default "auto") – The binning method to use. If int, is the number of bins. If str, must be a method accepted by numpy.histogram_bin_edges. If None, the feature is assumed to be categorical and counts are taken for each value in either dataset.

  • **kwargs (dict, optional) – Keyword arguments for scipy.spatial.distance.jensenshannon.

Returns

The computed divergence between the distributions.

Return type

float

See also

synthgauge.metrics.univariate_distance.jensen_shannon_distance, scipy.spatial.distance.jensenshannon

Notes

This is a wrapper of synthgauge.metrics.univariate_distance.jensen_shannon_distance, which in turn wraps scipy.spatial.distance.jensenshannon. Since this function expects probability vectors, the data is first discretised into evenly-spaced bins.

We can think of the Jensen-Shannon divergence as the amount of information, or entropy, encoded in the difference between the real and synthetic distributions of the feature.

The divergence is zero if the distributions are identical, and is bounded above by one if they are nothing alike. This method is therefore good for comparing multiple synthetic datasets, or features within a dataset, to see which is closest to the real. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same.

An optimal ‘bins’ value has not been suggested.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame(
...     {"a": [0, 1, 3, 4], "b": ["cat", "cow", "dog", "emu"]}
... )
>>> synth = pd.DataFrame(
...     {"a": [3, 3, 2, 0], "b": ["cat", "cat", "cow", "dog"]}
... )

The second feature appears to be more similar than the first across datasets:

>>> jensen_shannon_divergence(real, synth, "a")
0.1732867951399863
>>> jensen_shannon_divergence(real, synth, "b", bins=None)
0.10788077716941784
synthgauge.metrics.kolmogorov_smirnov(real, synth, feature, **kwargs)[source]

Kolmogorov-Smirnov test.

The Kolmogorov-Smirnov test statistic is the maximum difference between the cumulative distribution functions of the real and synthetic features.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Name of the feature to compare. This must be continuous.

  • **kwargs (dict, optional) – Keyword arguments for scipy.stats.ks_2samp.

Returns

  • statistic, pvalue (float) – Kolmogorov-Smirnov test statistic.

  • pvalue (float) – Two-tailed p-value.

See also

scipy.stats.ks_2samp

Notes

This is a wrapper for scipy.stats.ks_2samp, which tests whether two samples are drawn from the same distribution by calculating the maximum difference between their cumulative distribution functions.

If the returned statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions are the same.

This approach is only defined if the feature is continuous. The documentation further suggests this method works best when one of the samples has a size of only a few thousand.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame(
...     {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]}
... )
>>> synth = pd.DataFrame(
...     {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]}
... )

The first feature appears to come from the same distribution in both datasets.

>>> kolmogorov_smirnov(real, synth, "a")
KstestResult(statistic=0.3333333333333333, pvalue=0.9307359307359307)

The second feature appears to come from different distributions in the datasets.

>>> kolmogorov_smirnov(real, synth, "b")
KstestResult(statistic=0.8333333333333334, pvalue=0.025974025974025972)
synthgauge.metrics.kruskal_wallis(real, synth, feature, **kwargs)[source]

Kruskal-Wallis H test.

The Kruskal-Wallis test seeks to determine whether two sets of data originated from the same distribution. This is acheived by pooling and ranking the datasets. A low p-value suggests the two sets originate from different distributions and are not similar.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature of the datasets to compare. This must be continuous.

  • **kwargs (dict, optional) – Keyword arguments for scipy.stats.kruskal.

Returns

  • statistic (float) – The Kruskal-Wallis H statistic.

  • pvalue (float) – The p-value for the test.

See also

scipy.stats.kruskal

Notes

This is a wrapper function for scipy.stats.kruskal.

The null hypothesis for this test is that the medians of the distributions are equal. The alternative hypothesis is then that they are different. This would suggest that the synthetic and real data are not similarly distributed.

We notice, however, that failure to reject the null hypothesis only suggests that the medians could be equal and says nothing else about how the data are distributed.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame(
...     {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]}
... )
>>> synth = pd.DataFrame(
...     {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]}
... )

The test for the first feature suggests that the data are similarly distributed according to their medians.

>>> kruskal_wallis(real, synth, "a")
KruskalResult(statistic=0.5646387832699667, pvalue=0.45239722100817814)

The second feature test is much clearer that the data are drawn from distributions with different medians.

>>> kruskal_wallis(real, synth, "b")
KruskalResult(statistic=4.877737226277376, pvalue=0.02720526089960062)
synthgauge.metrics.kullback_leibler(real, synth, feature, bins='auto', **kwargs)[source]

Kullback-Leibler divergence.

Describes how much the synthetic feature distribution varies from the real distribution in terms of relative entropy. The divergence is assymmetric and does not satisfy the triangle inequality. Thus, it does not describe “distance” in the mathematical sense.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature of the datasets to compare. This must be continuous.

  • bins (int or str or None, default "auto") – The binning method to use. If int, is the number of bins. If str, must be a method accepted by numpy.histogram_bin_edges. If None, the feature is assumed to be categorical and counts are taken for each value in either dataset.

  • **kwargs (dict, optional) – Keyword arguments for scipy.stats.entropy.

Returns

The computed divergence between the distributions.

Return type

float

See also

scipy.stats.entropy

Notes

This is a wrapper function for scipy.stats.entropy. Since this function expects a probability vector, the data is first discretised into evenly-spaced bins.

We can think of the Kullback-Leibler divergence as a measure of surprise we might expect seeing an example from the real data, relative to the distribution of the synthetic.

The divergence is zero if the distributions are identical, and larger values indicate that the two discretised distributions are further from one another. Note that the KL divergence may be infinite - the most common case where this happens is when the synthetic data does not contain any data in at least one bin.

An optimal ‘bins’ value has not been suggested.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame(
...     {"a": [0, 1, 2, 3], "b": ["cat", "cow", "dog", "emu"]}
... )
>>> synth = pd.DataFrame(
...     {"a": [3, 1, 2, 0], "b": ["cat", "cat", "cow", "dog"]}
... )

The first feature is replicated up to a re-ordering in the synthetic data, so its KL divergence is zero:

>>> kullback_leibler(real, synth, "a")
0.0

However, the second feature does not include the "emu" category in the synthetic data, so the divergence is infinite:

>>> kullback_leibler(real, synth, "b", bins=None)
inf
synthgauge.metrics.mann_whitney(real, synth, feature, **kwargs)[source]

Mann-Whitney U test.

The Mann-Whitney test compares two sets of data by examining how well-mixed they are when pooled. This is acheived by ranking the pooled data. A low p-value suggests the data are not similar.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature of the datasets to compare. This must be continuous.

  • **kwargs (dict, optional) – Keyword arguments for scipy.stats.mannwhitneyu.

Returns

  • statistic (float) – The Mann-Whitney U statistic, in particular U for synth.

  • pvalue (float) – Two-sided p-value assuming an asymptotic normal distribution.

See also

scipy.stats.mannwhitneyu

Notes

This is a wrapper function for scipy.stats.mannwhitneyu.

The null hypothesis for this test is that for randomly selected real and synthetic values, the probability that the real value is greater than the synthetic is the same as the probability that the synthetic value is greater than the real.

We reject this hypothesis if the p-value is suitably small. This would in turn suggest that the synthetic and real data are not similarly distributed.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame(
...     {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]}
... )
>>> synth = pd.DataFrame(
...     {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]}
... )

If we were to choose our p-value threshold as 0.05, we would reach the conclusion that the distributions of the first feature are similar but the distributions of the second feature are not.

>>> mann_whitney(real, synth, "a")
MannwhitneyuResult(statistic=22.5, pvalue=0.5041764308016705)
>>> mann_whitney(real, synth, "b")
MannwhitneyuResult(statistic=31.5, pvalue=0.033439907088311766)
synthgauge.metrics.wasserstein(real, synth, feature, **kwargs)[source]

The (first) Wasserstein distance.

Also known as the “Earth Mover’s” distance, this metric can be thought of as calculating the amount of “work” required to move from the distribution of the synthetic data to the distribution of the real data.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature of the datasets to compare. This must be continuous.

  • **kwargs (dict, optional) – Keyword arguments for scipy.stats.wasserstein_distance.

Returns

The computed distance between the distributions.

Return type

float

See also

scipy.stats.wasserstein_distance

Notes

This is a wrapper for scipy.stats.wasserstein_distance. Computationally, we can find the Wasserstein distance by calculating the area between the cumulative distribution functions for the two distributions.

If \(s\) is the synthetic feature distribution, \(r\) is the real feature distribution, and \(R\) and \(S\) are their respective cumulative distribution functions, then

\[W(s, r) = \int_{-\infty}^{+\infty} |S - R|\]

The distance is zero if the distributions are identical and increases as they become less alike. This method is therefore good for comparing multiple synthetic datasets, or features within a dataset, to see which is closest to the real. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame(
...     {"a": [0, 1, 3, 4, 3, 4], "b": [3, 7, 2, 1, 7, 4]}
... )
>>> synth = pd.DataFrame(
...     {"a": [3, 3, 2, 0, 2, 3], "b": [1, 5, 1, 1, 6, 3]}
... )

The first feature appears to be more similar than the second across datasets:

>>> wasserstein(real, synth, "a")
0.6666666666666667
>>> wasserstein(real, synth, "b")
1.166666666666667
synthgauge.metrics.wilcoxon(real, synth, feature, **kwargs)[source]

Wilcoxon signed-rank test.

In this use, the Wilcoxon test compares the distributions of paired data. It does this by ranking the pairwise differences between the real and synthetic data.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature of the datasets to compare. This must be continuous.

  • **kwargs (dict, optional) – Keyword arguments for scipy.stats.wilcoxon.

Returns

  • statistic (float) – The sum of the ranks of the differences above or below zero, whichever is greater.

  • pvalue (float) – Two-sided p-value.

See also

scipy.stats.wilcoxon

Notes

This is a wrapper function for scipy.stats.wilcoxon.

The null hypothesis for this test is that the median of the paired differences is zero. The alternative hypothesis is that it is different from zero. This would suggest that the synthetic and real data are not similarly distributed.

This test only makes sense when the synthetic and real data are paired. That is, each synthetic datum is matched to a real one. In which case, it is required that data are ordered to reflect this.

Examples

>>> import pandas as pd
>>> real = pd.DataFrame({"a": [0, 1, 2, 3, 4, 5, 8, 9]})
>>> synth = pd.DataFrame({"a": [2, 0, 3, 7, 7, 4, 2, 1]})

By eye, you might think these distributions are quite different from one another. However, the Wilcoxon test suggests that these two datasets were drawn from similar distributions.

>>> wilcoxon(real, synth, "a")
WilcoxonResult(statistic=17.0, pvalue=0.9453125)