:py:mod:`synthgauge.metrics`
============================

.. py:module:: synthgauge.metrics

.. autoapi-nested-parse::

   A submodule for all utility and privacy metrics.


Submodules
----------
.. toctree::
   :titlesonly:
   :maxdepth: 1

   classification/index.rst
   cluster/index.rst
   correlation/index.rst
   density/index.rst
   nist/index.rst
   privacy/index.rst
   propensity/index.rst
   univariate/index.rst


Package Contents
----------------


Functions
~~~~~~~~~

.. autoapisummary::

   synthgauge.metrics.classification_comparison
   synthgauge.metrics.clustered_msd
   synthgauge.metrics.correlation_msd
   synthgauge.metrics.correlation_ratio_msd
   synthgauge.metrics.feature_density_mad
   synthgauge.metrics.hoc
   synthgauge.metrics.kway_marginals
   synthgauge.metrics.min_nearest_neighbour
   synthgauge.metrics.sample_overlap_score
   synthgauge.metrics.tcap_score
   synthgauge.metrics.pmse
   synthgauge.metrics.pmse_ratio
   synthgauge.metrics.pmse_standardised
   synthgauge.metrics.propensity_metrics
   synthgauge.metrics.specks
   synthgauge.metrics.jensen_shannon_distance
   synthgauge.metrics.jensen_shannon_divergence
   synthgauge.metrics.kolmogorov_smirnov
   synthgauge.metrics.kruskal_wallis
   synthgauge.metrics.kullback_leibler
   synthgauge.metrics.mann_whitney
   synthgauge.metrics.wasserstein
   synthgauge.metrics.wilcoxon


.. py:function:: classification_comparison(real, synth, feats, target, classifier, test_prop=0.2, random_state=None, **kwargs)

   Classification utility metric.

   This metric fits two (identical) classification models to `real` and
   `synth`, and then tests them both against withheld `real` data. We
   obtain utility scores by subtracting the precision, recall and f1
   scores of the "synthetic" model predictions from the "real" model's.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: List of column names to use as the input in the classification.
   :type feats: list of str
   :param target: Column to use as target in the classification.
   :type target: str
   :param classifier: Classifier class with `fit` and `predict` methods.
   :type classifier: scikit-learn estimator
   :param test_prop: If `float`, should be between 0.0 and 1.0 and represent the
                     proportion of the dataset to include in the test split. If
                     `int`, represents the absolute number of test samples.
   :type test_prop: float or int, default 0.2
   :param random_state: Random seed for shuffling during the train-test split, and for
                        the classification algorithm itself.
   :type random_state: int, optional
   :param \*\*kwargs: featsword arguments passed to the classifier.
   :type \*\*kwargs: dict, optional

   :returns: * **precision_difference** (*float*) -- Precision of the real model subtracted by that of the
               synthetic model.
             * **recall_difference** (*float*) -- Recall of the real model subtracted by that of the synthetic
               model.
             * **f1_difference** (*float*) -- f1 score of the real model subtracted by that of the
               synthetic model.

   .. rubric:: Notes

   Some preprocessing is carried out before the models are trained.
   Numeric features are scaled and categorical features are
   one-hot-encoded.

   A score of zero tells us the synthetic data is just as good as the
   real at training the given classification model. Increases in these
   scores indicate poorer utility.


.. py:function:: clustered_msd(real, synth, feats=None, method='kmeans', k_min=2, k_max=10, random_state=None)

   (Multiple) clustered mean-squared difference (MSD).

   This metric clusters the real and synthetic data together, measuring
   the synthetic utility according to its representation across the
   fitted clusters. Since there is often no obvious choice for the
   number of clusters, :math:`k`, we consider a range of values.

   For each value of :math:`k`, the chosen clustering method is fit
   and the proportion of synthetic data in each cluster is recorded.
   The clustered MSD is then calculated as the mean-squared difference
   between these proportions and the overall proportion of synthetic
   data.

   This collection of MSDs is summarised by taking its minimum to give
   the metric value.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: Features to use in the clustering. If `None` (default), all
                 common features are used.
   :type feats: list of str or None, default None
   :param method: Clustering method to use. Only k-means and k-prototypes
                  are implemented. If using k-means (default), only numeric
                  columns are considered, while k-prototypes allows for mixed-type
                  clustering.
   :type method: {"kmeans", "kprototypes"}, default "kmeans"
   :param k_min: Minimum number of clusters to use. Default of 2.
   :type k_min: int, default 2
   :param k_max: Maximum number of clusters to use. Default of 10.
   :type k_max: int, default 10
   :param random_state: The random seed used to fit the clustering algorithm.
   :type random_state: int, optional

   :returns: The minimum observed clustered MSD.
   :rtype: float

   .. rubric:: Notes

   This function can be used with a single value of `k` by setting
   `k_min` and `k_max` both to `k`. For instance, if a sensible number
   of clusters is known a priori.

   This metric says nothing about how appropriate the clustering method
   may be for the data at hand, nor how the data are distributed among
   the clusters. Both methods considered here have rather strong
   assumptions about the relative size and characteristics of the
   clusters in the data. As such, exploratory analysis is advised to
   determine whether such centroid-based clustering is well-suited.


.. py:function:: correlation_msd(real, synth, method='pearson', feats=None)

   Mean-squared difference in correlation coefficients.

   This metric calculates the mean squared difference between the
   correlation matrices for the real and synthetic datasets. This gives
   an indication of how well the synthetic data has retained bivariate
   relationships.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param method:
   :type method: {"pearson", "spearman", "cramers_v"}, default "pearson"
   :param feats: Features to measure correlation across. If `method="cramers_v"`,
                 all numeric columns will be filtered out. Likewise, for the
                 other correlation methods, all non-numeric columns are removed.
                 If `None` (default), all common features that satisfy the needs
                 of `method` are used.
   :type feats: list of str or None, default None

   :returns: Mean-squared difference of correlation coefficients.
   :rtype: float

   .. seealso:: :obj:`numpy.corrcoef`

   .. rubric:: Notes

   The smaller the number this function returns, the better the
   synthetic data captures the correlation between variables in the
   real data. This method is therefore good for comparing multiple
   synthetic datasets to see which is closest to the real. However, as
   this is not a test, there is no threshold value below which we can
   claim the datasets are statistically the same.

   We only consider the coefficients above the main diagonal when
   calculating the MSD. If we included the entire matrix, we would
   double-count each pair of features as well as including the trivial
   ones along the main diagonal.


.. py:function:: correlation_ratio_msd(real, synth, categorical=None, numeric=None)

   Correlation ratio mean-squared difference.

   This metric calculates the mean-squared difference in association
   between categorical and continuous feature pairings in the real and
   synthetic datasets.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param categorical: Categorical features in `real` and `synth` to include in
                       comparison. If `None` (default), uses all common object-type
                       columns.
   :type categorical: list of str or None, default None
   :param numeric: Numerical features in `real` and `synth` to include in
                   comparison. If `None` (default), uses all common columns not
                   selected by `categorical`.
   :type numeric: list of str or None, default None

   :returns: Mean squared difference between `real` and `synth` in
             correlation ratio scores across all categorical-continuous
             feature pairs.
   :rtype: float


.. py:function:: feature_density_mad(real, synth, feats=None, bins=10)

   Mean absolute difference of feature densities.

   For each feature the difference between the density across the bins
   within `real` and `synth` is calculated. Finally the MAE across all
   features and bins is calculated. A value close to 0 indicates that
   the real and synthetic datasets have a similar set of feature
   distributions.

   :param real: DataFrame containing the real data.
   :type real: pandas.DataFrame
   :param synth: DataFrame containing the sythetic data.
   :type synth: pandas.DataFrame
   :param feats: The features that will be used to compute the densities. If
                 `None` (default), all common features are used.
   :type feats: list of str or None, default None
   :param bins: Binning method for discretising the data. Can be anything
                accepted by `numpy.histogram_bin_edges`. Default uses 10 bins.
   :type bins: str or int, default 10

   :returns: Mean absolute error of feature densities.
   :rtype: float


.. py:function:: hoc(real, synth, trials=300, prob=0.1, seed=None)

   A measure based on Higher Order Conjunctions (HOC).

   This measure compares the relative sizes of randomly selected pools
   of "similar" rows in the real and synthetic data. This measure of
   similarity is defined across a set of randomly genereated test
   cases applied to each dataset. Each test case consists of a set of
   rules.

   The :math:`i`-th test calculates the fraction of records satisfying
   its rules in the real data, :math:`f_{ri}`, and the synthetic,
   denoted :math:`f_{si}`. Their dissimilarity in test :math:`i` is
   quantified as:

   .. math::

       d_i = \ln\left(\max(f_{si}, 10^{-6})\right) - \ln(f_{ri})

   These dissimilarities are summarised as:

   .. math::

       \Delta = \sqrt{\frac{1}{N} \sum_{i=1}^{N} d_i^2}

   where :math:`N` is the number of test cases. Finally, this is
   transformed to a HOC score:

   .. math::

       HOC = \max \left(0, 1 + \frac{\Delta}{\ln(10^{-3})}\right)

   This measure is bounded between 0 and 1, indicating whether the
   datasets are nothing alike or identical based on the test cases,
   respectively. In the original text this score is multiplied by 1000
   to make it human-readable. Full details are available in
   https://doi.org/10.6028/NIST.TN.2151.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param trials: Number of test cases to create. Default of 300 as in the
                  competition.
   :type trials: int, default 300
   :param prob: Probability of any column being included in a test case. Default
                of 0.1 as in the competition.
   :type prob: float, default 0.1
   :param seed: Random number seed. If `None`, results will not be reproducible.
   :type seed: int or None, default None

   :returns: **score** -- The overall HOC score.
   :rtype: float

   .. rubric:: Notes

   It is possible that some test cases will be "empty", i.e. when no
   columns are selected. In this scenario, the score for that case will
   be `np.nan` rather than it being resampled.


.. py:function:: kway_marginals(real, synth, k=3, trials=100, bins=100, seed=None)

   A measure based on the similarity of a set of k-way marginals.

   This measure works as follows:

       1. Discretise all numeric features (based on the orignal data).
       2. Randomly select `trials` sets of `k` features and calculate
          the corresponding marginal counts for each dataset.
       3. Calculate summed absolute deviation in the counts across all
          bins and marginal sets.
       4. Transform the summed absolute deviations, :math:`s`, to form
          a set of scores :math:`S = \left[1-s/2 | for each s\right]`.
          This transformation ensures the scores are in :math:`[0, 1]`.
          These extremes represent the worst- and best-case scenarios,
          respectively.
       5. Return the mean score.

   The NIST competition utilised a set of 100 three-way marginals.
   Details can be found at https://doi.org/10.6028/NIST.TN.2151.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param k: Number of features to include in each k-way marginal. Default
             uses 3 (as done in the NIST competition).
   :type k: int, default 3
   :param trials: Maximum number of marginals to consider to estimate the overall
                  score. If there are fewer `k`-way combinations than `trials`,
                  tries all. Default uses 100 (as done in the NIST competition).
   :type trials: int, default 100
   :param bins: Binning method for sampled numeric features. Can be anything
                accepted by `numpy.histogram`. Default uses 100 bins (as done in
                the NIST competition).
   :type bins: int or str, default 100
   :param seed: Random number seed. If `None`, results will not be reproducible.
   :type seed: int or None, default None

   :returns: **score** -- The mean transformed sum absolute deviation in k-way densities.
   :rtype: float


.. py:function:: min_nearest_neighbour(real, synth, feats=None, outliers_only=True, threshold=2, n_neighbours=5)

   Minimum nearest-neighbour distance.

   This privacy metric returns the smallest distance between any point
   in the real dataset and any point in the synthetic dataset. There is
   an option to only consider the outliers in the real dataset as these
   perhaps pose more of a privacy concern.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: Features in `real` and `synth` to use when calculating
                 distance. If `None` (default), all common features are used.
   :type feats: list of str or None, default None
   :param outliers_only: Boolean indicating whether to filter out the real data inliers
                         (default) or not.
   :type outliers_only: bool, default True
   :param threshold: Outlier decision threshold. Increase to include fewer points
                     from `real` in nearest-neighbour calculations.
   :type threshold: number, default 2
   :param n_neighbours: Number of neighbours to consider when identifying local
                        outliers.
   :type n_neighbours: int, default 5

   :returns: Minimum Manhattan distance between `real` and `synth` data.
   :rtype: float

   .. rubric:: Notes

   This privacy metric provides an insight into whether the synthetic
   dataset is too similar to the real dataset. It does this by
   calculating the minimum distance between the real records and the
   synthetic records.

   This metric assumes that categorical data is ordinal during distance
   calculations, or that it has already been suitably one-hot-encoded.


.. py:function:: sample_overlap_score(real, synth, feats=None, sample_size=0.2, runs=5, seed=None, score_type='unique')

   Return percentage of overlap between real and synth data based on
   random sampling.

   Samples from both the real and synthetic datasets are compared for
   similarity. This similarity, or overlap score, is based on the
   exact matches of real data records within the synthetic data.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: The features used to match records. If `None` (default), all
                 common features are used.
   :type feats: list of str or None, default None
   :param sample_size: The ratio (if `sample_size` between 0 and 1) or count
                       (`sample_size` > 1) of records to sample. Default is 0.2 (20%).
   :type sample_size: float or int, default 0.2
   :param runs: The number of sampling runs to use when computing the score.
   :type runs: int, default 5
   :param seed: Random number seed used for sampling.
   :type seed: int, optional
   :param score_type: Method used for calculating the overlap score. If "unique"
                      (default), the score is the percentage of unique records in the
                      real sample that have a match within the synthetic data. If
                      "sample", the score is the percentage of all records within the
                      real sample that have a match within the synth sample.
   :type score_type: {"unique", "sample"}, default "unique"

   :returns: **overlap_score** -- Estimated overlap score between `real` and `synth`.
   :rtype: float


.. py:function:: tcap_score(real, synth, key, target)

   Target Correct Attribution Probability (TCAP) score.

   This privacy metric calculates the average chance that the
   key-target pairings in a synthetic dataset reveal the true
   key-target pairings in associated real dataset.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param key: List of features in `synth` to use as the key.
   :type key: list of str
   :param target: Feature to use as the target.
   :type target: str

   :returns: The average TCAP score across the dataset.
   :rtype: float

   .. rubric:: Notes

   This metric provides an estimate of how well an intruder could infer
   attributes of groups in the real dataset by studying the synthetic.
   The choices for `key` and `target` will vary depending on the
   dataset in question but we would suggest the `key` features are
   those that could be readily available to an outsider and the
   `target` feature is one we wouldn't want them finding out, such as a
   protected characteristic.

   This method only works with categorical data, so binning of
   continuous data may be required.

   Full details may be found in:

   Taub and Elliott (2019). The Synthetic Data Challenge. The Hague,
   The Netherlands: Joint UNECE/Eurostat Work Session on Statistical
   Data Confidentiality, Session 3.


.. py:function:: pmse(combined, indicator, method, **kwargs)

   Calculate the propensity score mean-squared error (pMSE).

   :param combined: The combined set of real and synthetic data.
   :type combined: pandas.DataFrame
   :param indicator: An indicator for which data are real (0) or synthetic (1).
   :type indicator: numpy.ndarray
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}
   :param \*\*kwargs: Keyword arguments passed to propensity model.
   :type \*\*kwargs: dict, optional

   :returns: Propensity score mean-squared error.
   :rtype: float

   .. seealso:: :obj:`sklearn.linear_model.LogisticRegression`, :obj:`sklearn.tree.DecisionTreeClassifier`

   .. rubric:: Notes

   Propensity scores represent probabilities of group membership. By
   modelling whether an example is synthetic or not, we can use
   propensity scores as a measure of utility.

   This returns zero if the distributions are identical, and is bounded
   above by :math:`1 - c` if they are nothing alike, where :math:`c` is
   the proportion of the data that is synthetic. This method is
   therefore good for comparing multiple synthetic datasets. However,
   as this is not a test, there is no threshold distance below which we
   can claim the distributions are statistically the same.

   This function assumes that some preprocessing has been carried out
   so that the data is ready to be passed to the classification
   function. Encoding of categorical data is performed, but, for
   example, scaling is not. Without this, erroneous results may be
   returned. The logistic regression can fail to converge if many
   variables are considered. Anecdotally, this doesn't seem to
   drastically impact the propensity scores, although this should be
   investigated formally.

   Using a CART model as a classifier is recommended in the literature
   however we also support the use of logistic regression. For further
   details, see: https://doi.org/10.1111/rssa.12358


.. py:function:: pmse_ratio(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)

   The propensity score mean-squared error ratio.

   This is the ratio of observed pMSE to that expected under the null
   case, i.e.

   .. math::

       ratio(pMSE) = \frac{pMSE}{E(pMSE)}

   :param combined: Dataframe containing the combined real and synthetic data.
   :type combined: pandas.DataFrame
   :param indicator: Indicator for whether data are real (0) or synthetic (1).
   :type indicator: numpy.ndarray
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}
   :param num_perms: Number of permutations to consider when estimating the null case
                     statistics with a CART model.
   :type num_perms: int, optional
   :param estimator: Which estimation process to use with a CART model. By default,
                     permutations are used to ensure back-compatibility.
   :type estimator: {"perm", "boot"}
   :param \*\*kwargs: Keyword arguments passed to the propensity model classifier.
   :type \*\*kwargs: dict, optional

   :returns: The observed-to-null pMSE ratio.
   :rtype: float

   .. rubric:: Notes

   The interpretation of this metric makes more sense for synthetic
   data. The pMSE alone gives better utility as the value gets closer
   to zero, which is only attainable when the datasets are identical.
   However, when generating synthetic data, we do not want to produce
   identical entries. Rather, we want to achieve similarity between the
   distributions of the real and synthetic datasets.

   This ratio tends towards one when this is achieved, and increases
   otherwise.

   Note that the `random_state` keyword argument is used to
   (independently) create the permutations and to fit the model when
   using a CART model. Without specifying this, the results will not be
   reproducible.


.. py:function:: pmse_standardised(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)

   The standardised propensity score mean-squared error.

   This takes the observed pMSE and standardises it against the null
   case, i.e.

   .. math::

       stand(pMSE) = (pMSE - E(pMSE)) / sd(pMSE)

   :param combined: Dataframe containing the combined real and synthetic data.
   :type combined: pandas.DataFrame
   :param indicator: Indicator for whether data are real (0) or synthetic (1).
   :type indicator: numpy.ndarray
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}
   :param num_perms: Number of permutations to consider when estimating the null case
                     statistics with a CART model.
   :type num_perms: int, optional
   :param estimator: Which estimation process to use with a CART model. By default,
                     permutations are used to ensure back-compatibility.
   :type estimator: {"perm", "boot"}
   :param \*\*kwargs: Keyword arguments passed to the propensity model.
   :type \*\*kwargs: dict, optional

   :returns: The null-standardised pMSE.
   :rtype: float

   .. rubric:: Notes

   The interpretation of this metric makes more sense for synthetic
   data. The pMSE alone indicates better utility as it gets closer to
   zero, which is only attainable when the datasets are identical.
   However, when generating synthetic data, we do not want to produce
   identical entries. Rather, we want to achieve similarity between the
   distributions of the real and synthetic datasets.

   This standardised value tends towards zero when this is achieved,
   and increases in magnitude otherwise.

   Note that the `random_state` keyword argument is used to
   (independently) create the permutations and to fit the model when
   using a CART model. Without specifying this, the results will not be
   reproducible.


.. py:function:: propensity_metrics(real, synth, method='cart', feats=None, num_perms=20, estimator='perm', **kwargs)

   Propensity score-based metrics.

   This function calculates three metrics based on the propensity score
   mean-squared error (pMSE), all of which quantify utility by
   measuring the distinguishability of the synthetic data. That is, how
   readily real and synthetic data can be identified.

   To do this, the datasets are combined and their origins tracked by a
   boolean indicator. This combined dataset is then used to fit a
   binary classification model (CART or logistic regression with
   first-order interactions) with the indicator as the target. The
   propensity score for each row is then extracted and summarised to
   give a metric.

   The returned metrics are the observed pMSE along with the pMSE ratio
   and standardised pMSE. These second two metrics are given relative
   to the null case where the real and synthetic data are produced from
   identical processes.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}, default "cart"
   :param feats: List of features in the dataset to be used in the propensity
                 model. If `None` (default), all common features are used.
   :type feats: list of str or None, default None
   :param num_perms: Number of permutations to consider when estimating the null case
                     statistics with a CART model.
   :type num_perms: int, default 20
   :param estimator: Which estimation process to use with a CART model. By default,
                     permutations are used to ensure back-compatibility.
   :type estimator: {"perm", "boot"}
   :param \*\*kwargs: Keyword arguments passed to the propensity model.
   :type \*\*kwargs: dict, optional

   :returns: * **observed** (*float*) -- The observed pMSE.
             * **standard** (*float*) -- The null-standardised pMSE.
             * **ratio** (*float*) -- The observed-null pMSE ratio.

   :raises ValueError: If `method` is not one of `'cart'` or `'logr'`.

   .. seealso:: :obj:`sklearn.linear_model.LogisticRegression`, :obj:`sklearn.tree.DecisionTreeClassifier`, :obj:`synthgauge.metrics.propensity.pmse`, :obj:`synthgauge.metrics.propensity.pmse_ratio`, :obj:`synthgauge.metrics.propensity.pmse_standardised`

   .. rubric:: Notes

   For the CART model, `sklearn.tree.DecisionTreeClassifier` is used.
   Meanwhile, the logistic regression model uses
   `sklearn.linear_model.LogisticRegression`.

   Note that the `random_state` keyword argument is used to
   (independently) create the permutations and to fit the model when
   using a CART model. Without specifying this, the results will not be
   reproducible.

   Details on these metrics can be found at:
   https://doi.org/10.1111/rssa.12358


.. py:function:: specks(real, synth, classifier, **kwargs)

   Propensity score comparison via the Kolmogorov-Smirnov distance.

   The SPECKS metric was originally presented in
   https://arxiv.org/pdf/1803.06763.pdf and works as follows:

       1. Stack the real and synthetic data, and create a variable
          indicating whether each record is real (0) or synthetic (1).
       2. Calculate the propensity score for each record using a binary
          classifier on the indicator variable.
       3. Compute the Kolmogorov-Smirnov distance between the empirical
          CDFs for the real and synthetic propensity scores.

   The Kolmogorov-Smirnov distance is defined as the maximum difference
   between two empirical distributions. Therefore, it is bounded
   between zero and one. If the synthetic data properly resembles the
   original data then they will be indistinguishable, leading to close
   empirical CDFs.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param classifier: Any `scikit-learn`-style classifier class with a `predict_proba`
                      method.
   :type classifier: scikit-learn estimator
   :param \*\*kwargs: Keyword arguments to be passed to `classifer`.
   :type \*\*kwargs: dict, optional

   :returns: The Kolmogorov-Smirnov distance between the real and synthetic
             propensity score CDFs.
   :rtype: float

   .. rubric:: Notes

   The combined dataset is one-hot-encoded before being passed to the
   classifier so categorical features can be handled.

   The paper introducing SPECKS has also been published in METRON:
   https://doi.org/10.1007/s40300-021-00201-0.


.. py:function:: jensen_shannon_distance(real, synth, feature, bins='auto', **kwargs)

   Jensen-Shannon distance.

   Describes the difference between two distributions in terms of
   entropy. Calculated as the square root of the Jensen-Shannon
   divergence, the Jensen-Shannon distance satisfies the mathematical
   definition of a metric.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature of the datasets to compare. This must be continuous.
   :type feature: str
   :param bins: The binning method to use. If `int`, is the number of bins. If
                `str`, must be a method accepted by `numpy.histogram_bin_edges`.
                If `None`, the feature is assumed to be categorical and counts
                are taken for each value in either dataset.
   :type bins: int or str or None, default "auto"
   :param \*\*kwargs: Keyword arguments for `scipy.spatial.distance.jensenshannon`.
   :type \*\*kwargs: dict, optional

   :returns: **distance** -- The computed distance between the distributions.
   :rtype: float

   .. seealso:: :obj:`synthgauge.metrics.univariate_distance.jensen_shannon_divergence`, :obj:`scipy.spatial.distance.jensenshannon`

   .. rubric:: Notes

   This is a wrapper for `scipy.spatial.distance.jensenshannon`. Since
   this function expects probability vectors, the data is first
   discretised into evenly-spaced bins.

   We can think of the Jensen-Shannon distance as the amount of
   information, or entropy, encoded in the difference between the
   `real` and `synth` distributions of the `feature`.

   The distance is zero if the distributions are identical, and is
   bounded above by one if they are nothing alike. This method is
   therefore good for comparing multiple synthetic datasets, or
   features within a dataset, to see which is closest to the real.
   However, as this is not a test, there is no threshold distance below
   which we can claim the distributions are statistically the same.

   An optimal 'bins' value has not been suggested.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame(
   ...     {"a": [0, 1, 3, 4], "b": ["cat", "cow", "dog", "emu"]}
   ... )
   >>> synth = pd.DataFrame(
   ...     {"a": [3, 3, 2, 0], "b": ["cat", "cat", "cow", "dog"]}
   ... )

   The second feature appears to be more similar than the first across
   datasets:

   >>> jensen_shannon_distance(real, synth, "a")
   0.41627730557884884
   >>> jensen_shannon_distance(real, synth, "b", bins=None)
   0.328452092654953


.. py:function:: jensen_shannon_divergence(real, synth, feature, bins='auto', **kwargs)

   Jensen-Shannon divergence.

   Also known as the information radius, the Jensen-Shannon divergence
   describes the similarity between two probability distributions in
   terms of entropy. This divergence modifies the Kullback-Leibler
   divergence to be symmetric and finite (between 0 and 1).

   The divergence does not satisfy the triangle inequality. Thus, it
   does not describe "distance" in the mathematical sense. Taking its
   square root provides a metric known as the Jensen-Shannon distance.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature of the datasets to compare. This must be continuous.
   :type feature: str
   :param bins: The binning method to use. If `int`, is the number of bins. If
                `str`, must be a method accepted by `numpy.histogram_bin_edges`.
                If `None`, the feature is assumed to be categorical and counts
                are taken for each value in either dataset.
   :type bins: int or str or None, default "auto"
   :param \*\*kwargs: Keyword arguments for `scipy.spatial.distance.jensenshannon`.
   :type \*\*kwargs: dict, optional

   :returns: The computed divergence between the distributions.
   :rtype: float

   .. seealso:: :obj:`synthgauge.metrics.univariate_distance.jensen_shannon_distance`, :obj:`scipy.spatial.distance.jensenshannon`

   .. rubric:: Notes

   This is a wrapper of
   `synthgauge.metrics.univariate_distance.jensen_shannon_distance`,
   which in turn wraps `scipy.spatial.distance.jensenshannon`. Since
   this function expects probability vectors, the data is first
   discretised into evenly-spaced bins.

   We can think of the Jensen-Shannon divergence as the amount of
   information, or entropy, encoded in the difference between the
   real and synthetic distributions of the feature.

   The divergence is zero if the distributions are identical, and is
   bounded above by one if they are nothing alike. This method is
   therefore good for comparing multiple synthetic datasets, or
   features within a dataset, to see which is closest to the real.
   However, as this is not a test, there is no threshold distance below
   which we can claim the distributions are statistically the same.

   An optimal 'bins' value has not been suggested.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame(
   ...     {"a": [0, 1, 3, 4], "b": ["cat", "cow", "dog", "emu"]}
   ... )
   >>> synth = pd.DataFrame(
   ...     {"a": [3, 3, 2, 0], "b": ["cat", "cat", "cow", "dog"]}
   ... )

   The second feature appears to be more similar than the first across
   datasets:

   >>> jensen_shannon_divergence(real, synth, "a")
   0.1732867951399863
   >>> jensen_shannon_divergence(real, synth, "b", bins=None)
   0.10788077716941784


.. py:function:: kolmogorov_smirnov(real, synth, feature, **kwargs)

   Kolmogorov-Smirnov test.

   The Kolmogorov-Smirnov test statistic is the maximum difference
   between the cumulative distribution functions of the real and
   synthetic features.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Name of the feature to compare. This must be continuous.
   :type feature: str
   :param \*\*kwargs: Keyword arguments for `scipy.stats.ks_2samp`.
   :type \*\*kwargs: dict, optional

   :returns: * **statistic, pvalue** (*float*) -- Kolmogorov-Smirnov test statistic.
             * **pvalue** (*float*) -- Two-tailed p-value.

   .. seealso:: :obj:`scipy.stats.ks_2samp`

   .. rubric:: Notes

   This is a wrapper for `scipy.stats.ks_2samp`, which tests whether
   two samples are drawn from the same distribution by calculating the
   maximum difference between their cumulative distribution functions.

   If the returned statistic is small or the p-value is high, then we
   cannot reject the hypothesis that the distributions are the same.

   This approach is only defined if the feature is continuous. The
   documentation further suggests this method works best when one of
   the samples has a size of only a few thousand.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame(
   ...     {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]}
   ... )
   >>> synth = pd.DataFrame(
   ...     {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]}
   ... )

   The first feature appears to come from the same distribution in both
   datasets.

   >>> kolmogorov_smirnov(real, synth, "a")
   KstestResult(statistic=0.3333333333333333, pvalue=0.9307359307359307)

   The second feature appears to come from different distributions in
   the datasets.

   >>> kolmogorov_smirnov(real, synth, "b")
   KstestResult(statistic=0.8333333333333334, pvalue=0.025974025974025972)


.. py:function:: kruskal_wallis(real, synth, feature, **kwargs)

   Kruskal-Wallis H test.

   The Kruskal-Wallis test seeks to determine whether two sets of data
   originated from the same distribution. This is acheived by pooling
   and ranking the datasets. A low p-value suggests the two sets
   originate from different distributions and are not similar.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature of the datasets to compare. This must be continuous.
   :type feature: str
   :param \*\*kwargs: Keyword arguments for `scipy.stats.kruskal`.
   :type \*\*kwargs: dict, optional

   :returns: * **statistic** (*float*) -- The Kruskal-Wallis H statistic.
             * **pvalue** (*float*) -- The p-value for the test.

   .. seealso:: :obj:`scipy.stats.kruskal`

   .. rubric:: Notes

   This is a wrapper function for `scipy.stats.kruskal`.

   The null hypothesis for this test is that the medians of the
   distributions are equal. The alternative hypothesis is then that
   they are different. This would suggest that the synthetic and real
   data are not similarly distributed.

   We notice, however, that failure to reject the null hypothesis only
   suggests that the medians could be equal and says nothing else about
   how the data are distributed.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame(
   ...     {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]}
   ... )
   >>> synth = pd.DataFrame(
   ...     {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]}
   ... )

   The test for the first feature suggests that the data are similarly
   distributed according to their medians.

   >>> kruskal_wallis(real, synth, "a")
   KruskalResult(statistic=0.5646387832699667, pvalue=0.45239722100817814)

   The second feature test is much clearer that the data are drawn from
   distributions with different medians.

   >>> kruskal_wallis(real, synth, "b")
   KruskalResult(statistic=4.877737226277376, pvalue=0.02720526089960062)


.. py:function:: kullback_leibler(real, synth, feature, bins='auto', **kwargs)

   Kullback-Leibler divergence.

   Describes how much the synthetic feature distribution varies from
   the real distribution in terms of relative entropy. The divergence
   is assymmetric and does not satisfy the triangle inequality. Thus,
   it does not describe "distance" in the mathematical sense.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature of the datasets to compare. This must be continuous.
   :type feature: str
   :param bins: The binning method to use. If `int`, is the number of bins. If
                `str`, must be a method accepted by `numpy.histogram_bin_edges`.
                If `None`, the feature is assumed to be categorical and counts
                are taken for each value in either dataset.
   :type bins: int or str or None, default "auto"
   :param \*\*kwargs: Keyword arguments for `scipy.stats.entropy`.
   :type \*\*kwargs: dict, optional

   :returns: The computed divergence between the distributions.
   :rtype: float

   .. seealso:: :obj:`scipy.stats.entropy`

   .. rubric:: Notes

   This is a wrapper function for `scipy.stats.entropy`. Since this
   function expects a probability vector, the data is first discretised
   into evenly-spaced bins.

   We can think of the Kullback-Leibler divergence as a measure of
   surprise we might expect seeing an example from the real data,
   relative to the distribution of the synthetic.

   The divergence is zero if the distributions are identical, and
   larger values indicate that the two discretised distributions are
   further from one another. Note that the KL divergence may be
   infinite - the most common case where this happens is when the
   synthetic data does not contain any data in at least one bin.

   An optimal 'bins' value has not been suggested.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame(
   ...     {"a": [0, 1, 2, 3], "b": ["cat", "cow", "dog", "emu"]}
   ... )
   >>> synth = pd.DataFrame(
   ...     {"a": [3, 1, 2, 0], "b": ["cat", "cat", "cow", "dog"]}
   ... )

   The first feature is replicated up to a re-ordering in the synthetic
   data, so its KL divergence is zero:

   >>> kullback_leibler(real, synth, "a")
   0.0

   However, the second feature does not include the ``"emu"`` category
   in the synthetic data, so the divergence is infinite:

   >>> kullback_leibler(real, synth, "b", bins=None)
   inf


.. py:function:: mann_whitney(real, synth, feature, **kwargs)

   Mann-Whitney U test.

   The Mann-Whitney test compares two sets of data by examining how
   well-mixed they are when pooled. This is acheived by ranking the
   pooled data. A low p-value suggests the data are not similar.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature of the datasets to compare. This must be continuous.
   :type feature: str
   :param \*\*kwargs: Keyword arguments for `scipy.stats.mannwhitneyu`.
   :type \*\*kwargs: dict, optional

   :returns: * **statistic** (*float*) -- The Mann-Whitney U statistic, in particular U for `synth`.
             * **pvalue** (*float*) -- Two-sided p-value assuming an asymptotic normal distribution.

   .. seealso:: :obj:`scipy.stats.mannwhitneyu`

   .. rubric:: Notes

   This is a wrapper function for `scipy.stats.mannwhitneyu`.

   The null hypothesis for this test is that for randomly selected real
   and synthetic values, the probability that the real value is greater
   than the synthetic is the same as the probability that the synthetic
   value is greater than the real.

   We reject this hypothesis if the p-value is suitably small. This
   would in turn suggest that the synthetic and real data are not
   similarly distributed.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame(
   ...     {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]}
   ... )
   >>> synth = pd.DataFrame(
   ...     {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]}
   ... )

   If we were to choose our p-value threshold as 0.05, we would reach
   the conclusion that the distributions of the first feature are
   similar but the distributions of the second feature are not.

   >>> mann_whitney(real, synth, "a")
   MannwhitneyuResult(statistic=22.5, pvalue=0.5041764308016705)
   >>> mann_whitney(real, synth, "b")
   MannwhitneyuResult(statistic=31.5, pvalue=0.033439907088311766)


.. py:function:: wasserstein(real, synth, feature, **kwargs)

   The (first) Wasserstein distance.

   Also known as the "Earth Mover's" distance, this metric can be
   thought of as calculating the amount of "work" required to move from
   the distribution of the synthetic data to the distribution of the
   real data.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature of the datasets to compare. This must be continuous.
   :type feature: str
   :param \*\*kwargs: Keyword arguments for `scipy.stats.wasserstein_distance`.
   :type \*\*kwargs: dict, optional

   :returns: The computed distance between the distributions.
   :rtype: float

   .. seealso:: :obj:`scipy.stats.wasserstein_distance`

   .. rubric:: Notes

   This is a wrapper for `scipy.stats.wasserstein_distance`.
   Computationally, we can find the Wasserstein distance by calculating
   the area between the cumulative distribution functions for the two
   distributions.

   If :math:`s` is the synthetic feature distribution, :math:`r` is the
   real feature distribution, and :math:`R` and :math:`S` are their
   respective cumulative distribution functions, then

   .. math::

       W(s, r) = \int_{-\infty}^{+\infty} |S - R|

   The distance is zero if the distributions are identical and
   increases as they become less alike. This method is therefore good
   for comparing multiple synthetic datasets, or features within a
   dataset, to see which is closest to the real. However, as this is
   not a test, there is no threshold distance below which we can claim
   the distributions are statistically the same.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame(
   ...     {"a": [0, 1, 3, 4, 3, 4], "b": [3, 7, 2, 1, 7, 4]}
   ... )
   >>> synth = pd.DataFrame(
   ...     {"a": [3, 3, 2, 0, 2, 3], "b": [1, 5, 1, 1, 6, 3]}
   ... )

   The first feature appears to be more similar than the second across
   datasets:

   >>> wasserstein(real, synth, "a")
   0.6666666666666667
   >>> wasserstein(real, synth, "b")
   1.166666666666667


.. py:function:: wilcoxon(real, synth, feature, **kwargs)

   Wilcoxon signed-rank test.

   In this use, the Wilcoxon test compares the distributions of paired
   data. It does this by ranking the pairwise differences between the
   real and synthetic data.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature of the datasets to compare. This must be continuous.
   :type feature: str
   :param \*\*kwargs: Keyword arguments for `scipy.stats.wilcoxon`.
   :type \*\*kwargs: dict, optional

   :returns: * **statistic** (*float*) -- The sum of the ranks of the differences above or below zero,
               whichever is greater.
             * **pvalue** (*float*) -- Two-sided p-value.

   .. seealso:: :obj:`scipy.stats.wilcoxon`

   .. rubric:: Notes

   This is a wrapper function for `scipy.stats.wilcoxon`.

   The null hypothesis for this test is that the median of the paired
   differences is zero. The alternative hypothesis is that it is
   different from zero. This would suggest that the synthetic and real
   data are not similarly distributed.

   This test only makes sense when the synthetic and real data are
   paired. That is, each synthetic datum is matched to a real one. In
   which case, it is required that data are ordered to reflect this.

   .. rubric:: Examples

   >>> import pandas as pd
   >>> real = pd.DataFrame({"a": [0, 1, 2, 3, 4, 5, 8, 9]})
   >>> synth = pd.DataFrame({"a": [2, 0, 3, 7, 7, 4, 2, 1]})

   By eye, you might think these distributions are quite different from
   one another. However, the Wilcoxon test suggests that these two
   datasets were drawn from similar distributions.

   >>> wilcoxon(real, synth, "a")
   WilcoxonResult(statistic=17.0, pvalue=0.9453125)