:py:mod:`synthgauge.metrics.propensity`
=======================================

.. py:module:: synthgauge.metrics.propensity

.. autoapi-nested-parse::

   Propensity-based utility metrics.


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   synthgauge.metrics.propensity.pmse
   synthgauge.metrics.propensity.pmse_ratio
   synthgauge.metrics.propensity.pmse_standardised
   synthgauge.metrics.propensity.propensity_metrics
   synthgauge.metrics.propensity.specks


.. py:function:: pmse(combined, indicator, method, **kwargs)

   Calculate the propensity score mean-squared error (pMSE).

   :param combined: The combined set of real and synthetic data.
   :type combined: pandas.DataFrame
   :param indicator: An indicator for which data are real (0) or synthetic (1).
   :type indicator: numpy.ndarray
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}
   :param \*\*kwargs: Keyword arguments passed to propensity model.
   :type \*\*kwargs: dict, optional

   :returns: Propensity score mean-squared error.
   :rtype: float

   .. seealso:: :obj:`sklearn.linear_model.LogisticRegression`, :obj:`sklearn.tree.DecisionTreeClassifier`

   .. rubric:: Notes

   Propensity scores represent probabilities of group membership. By
   modelling whether an example is synthetic or not, we can use
   propensity scores as a measure of utility.

   This returns zero if the distributions are identical, and is bounded
   above by :math:`1 - c` if they are nothing alike, where :math:`c` is
   the proportion of the data that is synthetic. This method is
   therefore good for comparing multiple synthetic datasets. However,
   as this is not a test, there is no threshold distance below which we
   can claim the distributions are statistically the same.

   This function assumes that some preprocessing has been carried out
   so that the data is ready to be passed to the classification
   function. Encoding of categorical data is performed, but, for
   example, scaling is not. Without this, erroneous results may be
   returned. The logistic regression can fail to converge if many
   variables are considered. Anecdotally, this doesn't seem to
   drastically impact the propensity scores, although this should be
   investigated formally.

   Using a CART model as a classifier is recommended in the literature
   however we also support the use of logistic regression. For further
   details, see: https://doi.org/10.1111/rssa.12358


.. py:function:: pmse_ratio(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)

   The propensity score mean-squared error ratio.

   This is the ratio of observed pMSE to that expected under the null
   case, i.e.

   .. math::

       ratio(pMSE) = \frac{pMSE}{E(pMSE)}

   :param combined: Dataframe containing the combined real and synthetic data.
   :type combined: pandas.DataFrame
   :param indicator: Indicator for whether data are real (0) or synthetic (1).
   :type indicator: numpy.ndarray
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}
   :param num_perms: Number of permutations to consider when estimating the null case
                     statistics with a CART model.
   :type num_perms: int, optional
   :param estimator: Which estimation process to use with a CART model. By default,
                     permutations are used to ensure back-compatibility.
   :type estimator: {"perm", "boot"}
   :param \*\*kwargs: Keyword arguments passed to the propensity model classifier.
   :type \*\*kwargs: dict, optional

   :returns: The observed-to-null pMSE ratio.
   :rtype: float

   .. rubric:: Notes

   The interpretation of this metric makes more sense for synthetic
   data. The pMSE alone gives better utility as the value gets closer
   to zero, which is only attainable when the datasets are identical.
   However, when generating synthetic data, we do not want to produce
   identical entries. Rather, we want to achieve similarity between the
   distributions of the real and synthetic datasets.

   This ratio tends towards one when this is achieved, and increases
   otherwise.

   Note that the `random_state` keyword argument is used to
   (independently) create the permutations and to fit the model when
   using a CART model. Without specifying this, the results will not be
   reproducible.


.. py:function:: pmse_standardised(combined, indicator, method, num_perms=None, estimator='perm', **kwargs)

   The standardised propensity score mean-squared error.

   This takes the observed pMSE and standardises it against the null
   case, i.e.

   .. math::

       stand(pMSE) = (pMSE - E(pMSE)) / sd(pMSE)

   :param combined: Dataframe containing the combined real and synthetic data.
   :type combined: pandas.DataFrame
   :param indicator: Indicator for whether data are real (0) or synthetic (1).
   :type indicator: numpy.ndarray
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}
   :param num_perms: Number of permutations to consider when estimating the null case
                     statistics with a CART model.
   :type num_perms: int, optional
   :param estimator: Which estimation process to use with a CART model. By default,
                     permutations are used to ensure back-compatibility.
   :type estimator: {"perm", "boot"}
   :param \*\*kwargs: Keyword arguments passed to the propensity model.
   :type \*\*kwargs: dict, optional

   :returns: The null-standardised pMSE.
   :rtype: float

   .. rubric:: Notes

   The interpretation of this metric makes more sense for synthetic
   data. The pMSE alone indicates better utility as it gets closer to
   zero, which is only attainable when the datasets are identical.
   However, when generating synthetic data, we do not want to produce
   identical entries. Rather, we want to achieve similarity between the
   distributions of the real and synthetic datasets.

   This standardised value tends towards zero when this is achieved,
   and increases in magnitude otherwise.

   Note that the `random_state` keyword argument is used to
   (independently) create the permutations and to fit the model when
   using a CART model. Without specifying this, the results will not be
   reproducible.


.. py:function:: propensity_metrics(real, synth, method='cart', feats=None, num_perms=20, estimator='perm', **kwargs)

   Propensity score-based metrics.

   This function calculates three metrics based on the propensity score
   mean-squared error (pMSE), all of which quantify utility by
   measuring the distinguishability of the synthetic data. That is, how
   readily real and synthetic data can be identified.

   To do this, the datasets are combined and their origins tracked by a
   boolean indicator. This combined dataset is then used to fit a
   binary classification model (CART or logistic regression with
   first-order interactions) with the indicator as the target. The
   propensity score for each row is then extracted and summarised to
   give a metric.

   The returned metrics are the observed pMSE along with the pMSE ratio
   and standardised pMSE. These second two metrics are given relative
   to the null case where the real and synthetic data are produced from
   identical processes.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param method: Which propensity model to use. Must be either CART (`"cart"`) or
                  logistic regression with first-order interactions (`"logr"`).
   :type method: {"cart", "logr"}, default "cart"
   :param feats: List of features in the dataset to be used in the propensity
                 model. If `None` (default), all common features are used.
   :type feats: list of str or None, default None
   :param num_perms: Number of permutations to consider when estimating the null case
                     statistics with a CART model.
   :type num_perms: int, default 20
   :param estimator: Which estimation process to use with a CART model. By default,
                     permutations are used to ensure back-compatibility.
   :type estimator: {"perm", "boot"}
   :param \*\*kwargs: Keyword arguments passed to the propensity model.
   :type \*\*kwargs: dict, optional

   :returns: * **observed** (*float*) -- The observed pMSE.
             * **standard** (*float*) -- The null-standardised pMSE.
             * **ratio** (*float*) -- The observed-null pMSE ratio.

   :raises ValueError: If `method` is not one of `'cart'` or `'logr'`.

   .. seealso:: :obj:`sklearn.linear_model.LogisticRegression`, :obj:`sklearn.tree.DecisionTreeClassifier`, :obj:`synthgauge.metrics.propensity.pmse`, :obj:`synthgauge.metrics.propensity.pmse_ratio`, :obj:`synthgauge.metrics.propensity.pmse_standardised`

   .. rubric:: Notes

   For the CART model, `sklearn.tree.DecisionTreeClassifier` is used.
   Meanwhile, the logistic regression model uses
   `sklearn.linear_model.LogisticRegression`.

   Note that the `random_state` keyword argument is used to
   (independently) create the permutations and to fit the model when
   using a CART model. Without specifying this, the results will not be
   reproducible.

   Details on these metrics can be found at:
   https://doi.org/10.1111/rssa.12358


.. py:function:: specks(real, synth, classifier, **kwargs)

   Propensity score comparison via the Kolmogorov-Smirnov distance.

   The SPECKS metric was originally presented in
   https://arxiv.org/pdf/1803.06763.pdf and works as follows:

       1. Stack the real and synthetic data, and create a variable
          indicating whether each record is real (0) or synthetic (1).
       2. Calculate the propensity score for each record using a binary
          classifier on the indicator variable.
       3. Compute the Kolmogorov-Smirnov distance between the empirical
          CDFs for the real and synthetic propensity scores.

   The Kolmogorov-Smirnov distance is defined as the maximum difference
   between two empirical distributions. Therefore, it is bounded
   between zero and one. If the synthetic data properly resembles the
   original data then they will be indistinguishable, leading to close
   empirical CDFs.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param classifier: Any `scikit-learn`-style classifier class with a `predict_proba`
                      method.
   :type classifier: scikit-learn estimator
   :param \*\*kwargs: Keyword arguments to be passed to `classifer`.
   :type \*\*kwargs: dict, optional

   :returns: The Kolmogorov-Smirnov distance between the real and synthetic
             propensity score CDFs.
   :rtype: float

   .. rubric:: Notes

   The combined dataset is one-hot-encoded before being passed to the
   classifier so categorical features can be handled.

   The paper introducing SPECKS has also been published in METRON:
   https://doi.org/10.1007/s40300-021-00201-0.