synthgauge.metrics.propensity

Propensity-based utility metrics.

Module Contents

Functions

pmse(combined, indicator, method, **kwargs)

Calculate the propensity score mean-squared error (pMSE).

pmse_ratio(combined, indicator, method[, num_perms, ...])

The propensity score mean-squared error ratio.

pmse_standardised(combined, indicator, method[, ...])

The standardised propensity score mean-squared error.

propensity_metrics(real, synth[, method, feats, ...])

Propensity score-based metrics.

specks(real, synth, classifier, **kwargs)

Propensity score comparison via the Kolmogorov-Smirnov distance.

synthgauge.metrics.propensity.pmse(combined, indicator, method, **kwargs)[source]

Calculate the propensity score mean-squared error (pMSE).

Parameters
  • combined (pandas.DataFrame) – The combined set of real and synthetic data.

  • indicator (numpy.ndarray) – An indicator for which data are real (0) or synthetic (1).

  • method ({"cart", "logr"}) – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • **kwargs (dict, optional) – Keyword arguments passed to propensity model.

Returns

Propensity score mean-squared error.

Return type

float

See also

sklearn.linear_model.LogisticRegression, sklearn.tree.DecisionTreeClassifier

Notes

Propensity scores represent probabilities of group membership. By modelling whether an example is synthetic or not, we can use propensity scores as a measure of utility.

This returns zero if the distributions are identical, and is bounded above by \(1 - c\) if they are nothing alike, where \(c\) is the proportion of the data that is synthetic. This method is therefore good for comparing multiple synthetic datasets. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same.

This function assumes that some preprocessing has been carried out so that the data is ready to be passed to the classification function. Encoding of categorical data is performed, but, for example, scaling is not. Without this, erroneous results may be returned. The logistic regression can fail to converge if many variables are considered. Anecdotally, this doesn’t seem to drastically impact the propensity scores, although this should be investigated formally.

Using a CART model as a classifier is recommended in the literature however we also support the use of logistic regression. For further details, see: https://doi.org/10.1111/rssa.12358

synthgauge.metrics.propensity.pmse_ratio(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)[source]

The propensity score mean-squared error ratio.

This is the ratio of observed pMSE to that expected under the null case, i.e.

\[ratio(pMSE) = \frac{pMSE}{E(pMSE)}\]
Parameters
  • combined (pandas.DataFrame) – Dataframe containing the combined real and synthetic data.

  • indicator (numpy.ndarray) – Indicator for whether data are real (0) or synthetic (1).

  • method ({"cart", "logr"}) – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • num_perms (int, optional) – Number of permutations to consider when estimating the null case statistics with a CART model.

  • estimator ({"perm", "boot"}) – Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility.

  • **kwargs (dict, optional) – Keyword arguments passed to the propensity model classifier.

Returns

The observed-to-null pMSE ratio.

Return type

float

Notes

The interpretation of this metric makes more sense for synthetic data. The pMSE alone gives better utility as the value gets closer to zero, which is only attainable when the datasets are identical. However, when generating synthetic data, we do not want to produce identical entries. Rather, we want to achieve similarity between the distributions of the real and synthetic datasets.

This ratio tends towards one when this is achieved, and increases otherwise.

Note that the random_state keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible.

synthgauge.metrics.propensity.pmse_standardised(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)[source]

The standardised propensity score mean-squared error.

This takes the observed pMSE and standardises it against the null case, i.e.

\[stand(pMSE) = (pMSE - E(pMSE)) / sd(pMSE)\]
Parameters
  • combined (pandas.DataFrame) – Dataframe containing the combined real and synthetic data.

  • indicator (numpy.ndarray) – Indicator for whether data are real (0) or synthetic (1).

  • method ({"cart", "logr"}) – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • num_perms (int, optional) – Number of permutations to consider when estimating the null case statistics with a CART model.

  • estimator ({"perm", "boot"}) – Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility.

  • **kwargs (dict, optional) – Keyword arguments passed to the propensity model.

Returns

The null-standardised pMSE.

Return type

float

Notes

The interpretation of this metric makes more sense for synthetic data. The pMSE alone indicates better utility as it gets closer to zero, which is only attainable when the datasets are identical. However, when generating synthetic data, we do not want to produce identical entries. Rather, we want to achieve similarity between the distributions of the real and synthetic datasets.

This standardised value tends towards zero when this is achieved, and increases in magnitude otherwise.

Note that the random_state keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible.

synthgauge.metrics.propensity.propensity_metrics(real, synth, method='cart', feats=None, num_perms=20, estimator='perm', **kwargs)[source]

Propensity score-based metrics.

This function calculates three metrics based on the propensity score mean-squared error (pMSE), all of which quantify utility by measuring the distinguishability of the synthetic data. That is, how readily real and synthetic data can be identified.

To do this, the datasets are combined and their origins tracked by a boolean indicator. This combined dataset is then used to fit a binary classification model (CART or logistic regression with first-order interactions) with the indicator as the target. The propensity score for each row is then extracted and summarised to give a metric.

The returned metrics are the observed pMSE along with the pMSE ratio and standardised pMSE. These second two metrics are given relative to the null case where the real and synthetic data are produced from identical processes.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • method ({"cart", "logr"}, default "cart") – Which propensity model to use. Must be either CART (“cart”) or logistic regression with first-order interactions (“logr”).

  • feats (list of str or None, default None) – List of features in the dataset to be used in the propensity model. If None (default), all common features are used.

  • num_perms (int, default 20) – Number of permutations to consider when estimating the null case statistics with a CART model.

  • estimator ({"perm", "boot"}) – Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility.

  • **kwargs (dict, optional) – Keyword arguments passed to the propensity model.

Returns

  • observed (float) – The observed pMSE.

  • standard (float) – The null-standardised pMSE.

  • ratio (float) – The observed-null pMSE ratio.

Raises

ValueError – If method is not one of ‘cart’ or ‘logr’.

See also

sklearn.linear_model.LogisticRegression, sklearn.tree.DecisionTreeClassifier, synthgauge.metrics.propensity.pmse, synthgauge.metrics.propensity.pmse_ratio, synthgauge.metrics.propensity.pmse_standardised

Notes

For the CART model, sklearn.tree.DecisionTreeClassifier is used. Meanwhile, the logistic regression model uses sklearn.linear_model.LogisticRegression.

Note that the random_state keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible.

Details on these metrics can be found at: https://doi.org/10.1111/rssa.12358

synthgauge.metrics.propensity.specks(real, synth, classifier, **kwargs)[source]

Propensity score comparison via the Kolmogorov-Smirnov distance.

The SPECKS metric was originally presented in https://arxiv.org/pdf/1803.06763.pdf and works as follows:

  1. Stack the real and synthetic data, and create a variable indicating whether each record is real (0) or synthetic (1).

  2. Calculate the propensity score for each record using a binary classifier on the indicator variable.

  3. Compute the Kolmogorov-Smirnov distance between the empirical CDFs for the real and synthetic propensity scores.

The Kolmogorov-Smirnov distance is defined as the maximum difference between two empirical distributions. Therefore, it is bounded between zero and one. If the synthetic data properly resembles the original data then they will be indistinguishable, leading to close empirical CDFs.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • classifier (scikit-learn estimator) – Any scikit-learn-style classifier class with a predict_proba method.

  • **kwargs (dict, optional) – Keyword arguments to be passed to classifer.

Returns

The Kolmogorov-Smirnov distance between the real and synthetic propensity score CDFs.

Return type

float

Notes

The combined dataset is one-hot-encoded before being passed to the classifier so categorical features can be handled.

The paper introducing SPECKS has also been published in METRON: https://doi.org/10.1007/s40300-021-00201-0.