:py:mod:`synthgauge.metrics` ============================ .. py:module:: synthgauge.metrics .. autoapi-nested-parse:: A submodule for all utility and privacy metrics. Submodules ---------- .. toctree:: :titlesonly: :maxdepth: 1 classification/index.rst cluster/index.rst correlation/index.rst density/index.rst nist/index.rst privacy/index.rst propensity/index.rst univariate/index.rst Package Contents ---------------- Functions ~~~~~~~~~ .. autoapisummary:: synthgauge.metrics.classification_comparison synthgauge.metrics.clustered_msd synthgauge.metrics.correlation_msd synthgauge.metrics.correlation_ratio_msd synthgauge.metrics.feature_density_mad synthgauge.metrics.hoc synthgauge.metrics.kway_marginals synthgauge.metrics.min_nearest_neighbour synthgauge.metrics.sample_overlap_score synthgauge.metrics.tcap_score synthgauge.metrics.pmse synthgauge.metrics.pmse_ratio synthgauge.metrics.pmse_standardised synthgauge.metrics.propensity_metrics synthgauge.metrics.specks synthgauge.metrics.jensen_shannon_distance synthgauge.metrics.jensen_shannon_divergence synthgauge.metrics.kolmogorov_smirnov synthgauge.metrics.kruskal_wallis synthgauge.metrics.kullback_leibler synthgauge.metrics.mann_whitney synthgauge.metrics.wasserstein synthgauge.metrics.wilcoxon .. py:function:: classification_comparison(real, synth, feats, target, classifier, test_prop=0.2, random_state=None, **kwargs) Classification utility metric. This metric fits two (identical) classification models to `real` and `synth`, and then tests them both against withheld `real` data. We obtain utility scores by subtracting the precision, recall and f1 scores of the "synthetic" model predictions from the "real" model's. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feats: List of column names to use as the input in the classification. :type feats: list of str :param target: Column to use as target in the classification. :type target: str :param classifier: Classifier class with `fit` and `predict` methods. :type classifier: scikit-learn estimator :param test_prop: If `float`, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If `int`, represents the absolute number of test samples. :type test_prop: float or int, default 0.2 :param random_state: Random seed for shuffling during the train-test split, and for the classification algorithm itself. :type random_state: int, optional :param \*\*kwargs: featsword arguments passed to the classifier. :type \*\*kwargs: dict, optional :returns: * **precision_difference** (*float*) -- Precision of the real model subtracted by that of the synthetic model. * **recall_difference** (*float*) -- Recall of the real model subtracted by that of the synthetic model. * **f1_difference** (*float*) -- f1 score of the real model subtracted by that of the synthetic model. .. rubric:: Notes Some preprocessing is carried out before the models are trained. Numeric features are scaled and categorical features are one-hot-encoded. A score of zero tells us the synthetic data is just as good as the real at training the given classification model. Increases in these scores indicate poorer utility. .. py:function:: clustered_msd(real, synth, feats=None, method='kmeans', k_min=2, k_max=10, random_state=None) (Multiple) clustered mean-squared difference (MSD). This metric clusters the real and synthetic data together, measuring the synthetic utility according to its representation across the fitted clusters. Since there is often no obvious choice for the number of clusters, :math:`k`, we consider a range of values. For each value of :math:`k`, the chosen clustering method is fit and the proportion of synthetic data in each cluster is recorded. The clustered MSD is then calculated as the mean-squared difference between these proportions and the overall proportion of synthetic data. This collection of MSDs is summarised by taking its minimum to give the metric value. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feats: Features to use in the clustering. If `None` (default), all common features are used. :type feats: list of str or None, default None :param method: Clustering method to use. Only k-means and k-prototypes are implemented. If using k-means (default), only numeric columns are considered, while k-prototypes allows for mixed-type clustering. :type method: {"kmeans", "kprototypes"}, default "kmeans" :param k_min: Minimum number of clusters to use. Default of 2. :type k_min: int, default 2 :param k_max: Maximum number of clusters to use. Default of 10. :type k_max: int, default 10 :param random_state: The random seed used to fit the clustering algorithm. :type random_state: int, optional :returns: The minimum observed clustered MSD. :rtype: float .. rubric:: Notes This function can be used with a single value of `k` by setting `k_min` and `k_max` both to `k`. For instance, if a sensible number of clusters is known a priori. This metric says nothing about how appropriate the clustering method may be for the data at hand, nor how the data are distributed among the clusters. Both methods considered here have rather strong assumptions about the relative size and characteristics of the clusters in the data. As such, exploratory analysis is advised to determine whether such centroid-based clustering is well-suited. .. py:function:: correlation_msd(real, synth, method='pearson', feats=None) Mean-squared difference in correlation coefficients. This metric calculates the mean squared difference between the correlation matrices for the real and synthetic datasets. This gives an indication of how well the synthetic data has retained bivariate relationships. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param method: :type method: {"pearson", "spearman", "cramers_v"}, default "pearson" :param feats: Features to measure correlation across. If `method="cramers_v"`, all numeric columns will be filtered out. Likewise, for the other correlation methods, all non-numeric columns are removed. If `None` (default), all common features that satisfy the needs of `method` are used. :type feats: list of str or None, default None :returns: Mean-squared difference of correlation coefficients. :rtype: float .. seealso:: :obj:`numpy.corrcoef` .. rubric:: Notes The smaller the number this function returns, the better the synthetic data captures the correlation between variables in the real data. This method is therefore good for comparing multiple synthetic datasets to see which is closest to the real. However, as this is not a test, there is no threshold value below which we can claim the datasets are statistically the same. We only consider the coefficients above the main diagonal when calculating the MSD. If we included the entire matrix, we would double-count each pair of features as well as including the trivial ones along the main diagonal. .. py:function:: correlation_ratio_msd(real, synth, categorical=None, numeric=None) Correlation ratio mean-squared difference. This metric calculates the mean-squared difference in association between categorical and continuous feature pairings in the real and synthetic datasets. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param categorical: Categorical features in `real` and `synth` to include in comparison. If `None` (default), uses all common object-type columns. :type categorical: list of str or None, default None :param numeric: Numerical features in `real` and `synth` to include in comparison. If `None` (default), uses all common columns not selected by `categorical`. :type numeric: list of str or None, default None :returns: Mean squared difference between `real` and `synth` in correlation ratio scores across all categorical-continuous feature pairs. :rtype: float .. py:function:: feature_density_mad(real, synth, feats=None, bins=10) Mean absolute difference of feature densities. For each feature the difference between the density across the bins within `real` and `synth` is calculated. Finally the MAE across all features and bins is calculated. A value close to 0 indicates that the real and synthetic datasets have a similar set of feature distributions. :param real: DataFrame containing the real data. :type real: pandas.DataFrame :param synth: DataFrame containing the sythetic data. :type synth: pandas.DataFrame :param feats: The features that will be used to compute the densities. If `None` (default), all common features are used. :type feats: list of str or None, default None :param bins: Binning method for discretising the data. Can be anything accepted by `numpy.histogram_bin_edges`. Default uses 10 bins. :type bins: str or int, default 10 :returns: Mean absolute error of feature densities. :rtype: float .. py:function:: hoc(real, synth, trials=300, prob=0.1, seed=None) A measure based on Higher Order Conjunctions (HOC). This measure compares the relative sizes of randomly selected pools of "similar" rows in the real and synthetic data. This measure of similarity is defined across a set of randomly genereated test cases applied to each dataset. Each test case consists of a set of rules. The :math:`i`-th test calculates the fraction of records satisfying its rules in the real data, :math:`f_{ri}`, and the synthetic, denoted :math:`f_{si}`. Their dissimilarity in test :math:`i` is quantified as: .. math:: d_i = \ln\left(\max(f_{si}, 10^{-6})\right) - \ln(f_{ri}) These dissimilarities are summarised as: .. math:: \Delta = \sqrt{\frac{1}{N} \sum_{i=1}^{N} d_i^2} where :math:`N` is the number of test cases. Finally, this is transformed to a HOC score: .. math:: HOC = \max \left(0, 1 + \frac{\Delta}{\ln(10^{-3})}\right) This measure is bounded between 0 and 1, indicating whether the datasets are nothing alike or identical based on the test cases, respectively. In the original text this score is multiplied by 1000 to make it human-readable. Full details are available in https://doi.org/10.6028/NIST.TN.2151. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param trials: Number of test cases to create. Default of 300 as in the competition. :type trials: int, default 300 :param prob: Probability of any column being included in a test case. Default of 0.1 as in the competition. :type prob: float, default 0.1 :param seed: Random number seed. If `None`, results will not be reproducible. :type seed: int or None, default None :returns: **score** -- The overall HOC score. :rtype: float .. rubric:: Notes It is possible that some test cases will be "empty", i.e. when no columns are selected. In this scenario, the score for that case will be `np.nan` rather than it being resampled. .. py:function:: kway_marginals(real, synth, k=3, trials=100, bins=100, seed=None) A measure based on the similarity of a set of k-way marginals. This measure works as follows: 1. Discretise all numeric features (based on the orignal data). 2. Randomly select `trials` sets of `k` features and calculate the corresponding marginal counts for each dataset. 3. Calculate summed absolute deviation in the counts across all bins and marginal sets. 4. Transform the summed absolute deviations, :math:`s`, to form a set of scores :math:`S = \left[1-s/2 | for each s\right]`. This transformation ensures the scores are in :math:`[0, 1]`. These extremes represent the worst- and best-case scenarios, respectively. 5. Return the mean score. The NIST competition utilised a set of 100 three-way marginals. Details can be found at https://doi.org/10.6028/NIST.TN.2151. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param k: Number of features to include in each k-way marginal. Default uses 3 (as done in the NIST competition). :type k: int, default 3 :param trials: Maximum number of marginals to consider to estimate the overall score. If there are fewer `k`-way combinations than `trials`, tries all. Default uses 100 (as done in the NIST competition). :type trials: int, default 100 :param bins: Binning method for sampled numeric features. Can be anything accepted by `numpy.histogram`. Default uses 100 bins (as done in the NIST competition). :type bins: int or str, default 100 :param seed: Random number seed. If `None`, results will not be reproducible. :type seed: int or None, default None :returns: **score** -- The mean transformed sum absolute deviation in k-way densities. :rtype: float .. py:function:: min_nearest_neighbour(real, synth, feats=None, outliers_only=True, threshold=2, n_neighbours=5) Minimum nearest-neighbour distance. This privacy metric returns the smallest distance between any point in the real dataset and any point in the synthetic dataset. There is an option to only consider the outliers in the real dataset as these perhaps pose more of a privacy concern. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feats: Features in `real` and `synth` to use when calculating distance. If `None` (default), all common features are used. :type feats: list of str or None, default None :param outliers_only: Boolean indicating whether to filter out the real data inliers (default) or not. :type outliers_only: bool, default True :param threshold: Outlier decision threshold. Increase to include fewer points from `real` in nearest-neighbour calculations. :type threshold: number, default 2 :param n_neighbours: Number of neighbours to consider when identifying local outliers. :type n_neighbours: int, default 5 :returns: Minimum Manhattan distance between `real` and `synth` data. :rtype: float .. rubric:: Notes This privacy metric provides an insight into whether the synthetic dataset is too similar to the real dataset. It does this by calculating the minimum distance between the real records and the synthetic records. This metric assumes that categorical data is ordinal during distance calculations, or that it has already been suitably one-hot-encoded. .. py:function:: sample_overlap_score(real, synth, feats=None, sample_size=0.2, runs=5, seed=None, score_type='unique') Return percentage of overlap between real and synth data based on random sampling. Samples from both the real and synthetic datasets are compared for similarity. This similarity, or overlap score, is based on the exact matches of real data records within the synthetic data. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feats: The features used to match records. If `None` (default), all common features are used. :type feats: list of str or None, default None :param sample_size: The ratio (if `sample_size` between 0 and 1) or count (`sample_size` > 1) of records to sample. Default is 0.2 (20%). :type sample_size: float or int, default 0.2 :param runs: The number of sampling runs to use when computing the score. :type runs: int, default 5 :param seed: Random number seed used for sampling. :type seed: int, optional :param score_type: Method used for calculating the overlap score. If "unique" (default), the score is the percentage of unique records in the real sample that have a match within the synthetic data. If "sample", the score is the percentage of all records within the real sample that have a match within the synth sample. :type score_type: {"unique", "sample"}, default "unique" :returns: **overlap_score** -- Estimated overlap score between `real` and `synth`. :rtype: float .. py:function:: tcap_score(real, synth, key, target) Target Correct Attribution Probability (TCAP) score. This privacy metric calculates the average chance that the key-target pairings in a synthetic dataset reveal the true key-target pairings in associated real dataset. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param key: List of features in `synth` to use as the key. :type key: list of str :param target: Feature to use as the target. :type target: str :returns: The average TCAP score across the dataset. :rtype: float .. rubric:: Notes This metric provides an estimate of how well an intruder could infer attributes of groups in the real dataset by studying the synthetic. The choices for `key` and `target` will vary depending on the dataset in question but we would suggest the `key` features are those that could be readily available to an outsider and the `target` feature is one we wouldn't want them finding out, such as a protected characteristic. This method only works with categorical data, so binning of continuous data may be required. Full details may be found in: Taub and Elliott (2019). The Synthetic Data Challenge. The Hague, The Netherlands: Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Session 3. .. py:function:: pmse(combined, indicator, method, **kwargs) Calculate the propensity score mean-squared error (pMSE). :param combined: The combined set of real and synthetic data. :type combined: pandas.DataFrame :param indicator: An indicator for which data are real (0) or synthetic (1). :type indicator: numpy.ndarray :param method: Which propensity model to use. Must be either CART (`"cart"`) or logistic regression with first-order interactions (`"logr"`). :type method: {"cart", "logr"} :param \*\*kwargs: Keyword arguments passed to propensity model. :type \*\*kwargs: dict, optional :returns: Propensity score mean-squared error. :rtype: float .. seealso:: :obj:`sklearn.linear_model.LogisticRegression`, :obj:`sklearn.tree.DecisionTreeClassifier` .. rubric:: Notes Propensity scores represent probabilities of group membership. By modelling whether an example is synthetic or not, we can use propensity scores as a measure of utility. This returns zero if the distributions are identical, and is bounded above by :math:`1 - c` if they are nothing alike, where :math:`c` is the proportion of the data that is synthetic. This method is therefore good for comparing multiple synthetic datasets. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same. This function assumes that some preprocessing has been carried out so that the data is ready to be passed to the classification function. Encoding of categorical data is performed, but, for example, scaling is not. Without this, erroneous results may be returned. The logistic regression can fail to converge if many variables are considered. Anecdotally, this doesn't seem to drastically impact the propensity scores, although this should be investigated formally. Using a CART model as a classifier is recommended in the literature however we also support the use of logistic regression. For further details, see: https://doi.org/10.1111/rssa.12358 .. py:function:: pmse_ratio(combined, indicator, method, num_perms=None, estimator='perm', **kwargs) The propensity score mean-squared error ratio. This is the ratio of observed pMSE to that expected under the null case, i.e. .. math:: ratio(pMSE) = \frac{pMSE}{E(pMSE)} :param combined: Dataframe containing the combined real and synthetic data. :type combined: pandas.DataFrame :param indicator: Indicator for whether data are real (0) or synthetic (1). :type indicator: numpy.ndarray :param method: Which propensity model to use. Must be either CART (`"cart"`) or logistic regression with first-order interactions (`"logr"`). :type method: {"cart", "logr"} :param num_perms: Number of permutations to consider when estimating the null case statistics with a CART model. :type num_perms: int, optional :param estimator: Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility. :type estimator: {"perm", "boot"} :param \*\*kwargs: Keyword arguments passed to the propensity model classifier. :type \*\*kwargs: dict, optional :returns: The observed-to-null pMSE ratio. :rtype: float .. rubric:: Notes The interpretation of this metric makes more sense for synthetic data. The pMSE alone gives better utility as the value gets closer to zero, which is only attainable when the datasets are identical. However, when generating synthetic data, we do not want to produce identical entries. Rather, we want to achieve similarity between the distributions of the real and synthetic datasets. This ratio tends towards one when this is achieved, and increases otherwise. Note that the `random_state` keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible. .. py:function:: pmse_standardised(combined, indicator, method, num_perms=None, estimator='perm', **kwargs) The standardised propensity score mean-squared error. This takes the observed pMSE and standardises it against the null case, i.e. .. math:: stand(pMSE) = (pMSE - E(pMSE)) / sd(pMSE) :param combined: Dataframe containing the combined real and synthetic data. :type combined: pandas.DataFrame :param indicator: Indicator for whether data are real (0) or synthetic (1). :type indicator: numpy.ndarray :param method: Which propensity model to use. Must be either CART (`"cart"`) or logistic regression with first-order interactions (`"logr"`). :type method: {"cart", "logr"} :param num_perms: Number of permutations to consider when estimating the null case statistics with a CART model. :type num_perms: int, optional :param estimator: Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility. :type estimator: {"perm", "boot"} :param \*\*kwargs: Keyword arguments passed to the propensity model. :type \*\*kwargs: dict, optional :returns: The null-standardised pMSE. :rtype: float .. rubric:: Notes The interpretation of this metric makes more sense for synthetic data. The pMSE alone indicates better utility as it gets closer to zero, which is only attainable when the datasets are identical. However, when generating synthetic data, we do not want to produce identical entries. Rather, we want to achieve similarity between the distributions of the real and synthetic datasets. This standardised value tends towards zero when this is achieved, and increases in magnitude otherwise. Note that the `random_state` keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible. .. py:function:: propensity_metrics(real, synth, method='cart', feats=None, num_perms=20, estimator='perm', **kwargs) Propensity score-based metrics. This function calculates three metrics based on the propensity score mean-squared error (pMSE), all of which quantify utility by measuring the distinguishability of the synthetic data. That is, how readily real and synthetic data can be identified. To do this, the datasets are combined and their origins tracked by a boolean indicator. This combined dataset is then used to fit a binary classification model (CART or logistic regression with first-order interactions) with the indicator as the target. The propensity score for each row is then extracted and summarised to give a metric. The returned metrics are the observed pMSE along with the pMSE ratio and standardised pMSE. These second two metrics are given relative to the null case where the real and synthetic data are produced from identical processes. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param method: Which propensity model to use. Must be either CART (`"cart"`) or logistic regression with first-order interactions (`"logr"`). :type method: {"cart", "logr"}, default "cart" :param feats: List of features in the dataset to be used in the propensity model. If `None` (default), all common features are used. :type feats: list of str or None, default None :param num_perms: Number of permutations to consider when estimating the null case statistics with a CART model. :type num_perms: int, default 20 :param estimator: Which estimation process to use with a CART model. By default, permutations are used to ensure back-compatibility. :type estimator: {"perm", "boot"} :param \*\*kwargs: Keyword arguments passed to the propensity model. :type \*\*kwargs: dict, optional :returns: * **observed** (*float*) -- The observed pMSE. * **standard** (*float*) -- The null-standardised pMSE. * **ratio** (*float*) -- The observed-null pMSE ratio. :raises ValueError: If `method` is not one of `'cart'` or `'logr'`. .. seealso:: :obj:`sklearn.linear_model.LogisticRegression`, :obj:`sklearn.tree.DecisionTreeClassifier`, :obj:`synthgauge.metrics.propensity.pmse`, :obj:`synthgauge.metrics.propensity.pmse_ratio`, :obj:`synthgauge.metrics.propensity.pmse_standardised` .. rubric:: Notes For the CART model, `sklearn.tree.DecisionTreeClassifier` is used. Meanwhile, the logistic regression model uses `sklearn.linear_model.LogisticRegression`. Note that the `random_state` keyword argument is used to (independently) create the permutations and to fit the model when using a CART model. Without specifying this, the results will not be reproducible. Details on these metrics can be found at: https://doi.org/10.1111/rssa.12358 .. py:function:: specks(real, synth, classifier, **kwargs) Propensity score comparison via the Kolmogorov-Smirnov distance. The SPECKS metric was originally presented in https://arxiv.org/pdf/1803.06763.pdf and works as follows: 1. Stack the real and synthetic data, and create a variable indicating whether each record is real (0) or synthetic (1). 2. Calculate the propensity score for each record using a binary classifier on the indicator variable. 3. Compute the Kolmogorov-Smirnov distance between the empirical CDFs for the real and synthetic propensity scores. The Kolmogorov-Smirnov distance is defined as the maximum difference between two empirical distributions. Therefore, it is bounded between zero and one. If the synthetic data properly resembles the original data then they will be indistinguishable, leading to close empirical CDFs. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param classifier: Any `scikit-learn`-style classifier class with a `predict_proba` method. :type classifier: scikit-learn estimator :param \*\*kwargs: Keyword arguments to be passed to `classifer`. :type \*\*kwargs: dict, optional :returns: The Kolmogorov-Smirnov distance between the real and synthetic propensity score CDFs. :rtype: float .. rubric:: Notes The combined dataset is one-hot-encoded before being passed to the classifier so categorical features can be handled. The paper introducing SPECKS has also been published in METRON: https://doi.org/10.1007/s40300-021-00201-0. .. py:function:: jensen_shannon_distance(real, synth, feature, bins='auto', **kwargs) Jensen-Shannon distance. Describes the difference between two distributions in terms of entropy. Calculated as the square root of the Jensen-Shannon divergence, the Jensen-Shannon distance satisfies the mathematical definition of a metric. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Feature of the datasets to compare. This must be continuous. :type feature: str :param bins: The binning method to use. If `int`, is the number of bins. If `str`, must be a method accepted by `numpy.histogram_bin_edges`. If `None`, the feature is assumed to be categorical and counts are taken for each value in either dataset. :type bins: int or str or None, default "auto" :param \*\*kwargs: Keyword arguments for `scipy.spatial.distance.jensenshannon`. :type \*\*kwargs: dict, optional :returns: **distance** -- The computed distance between the distributions. :rtype: float .. seealso:: :obj:`synthgauge.metrics.univariate_distance.jensen_shannon_divergence`, :obj:`scipy.spatial.distance.jensenshannon` .. rubric:: Notes This is a wrapper for `scipy.spatial.distance.jensenshannon`. Since this function expects probability vectors, the data is first discretised into evenly-spaced bins. We can think of the Jensen-Shannon distance as the amount of information, or entropy, encoded in the difference between the `real` and `synth` distributions of the `feature`. The distance is zero if the distributions are identical, and is bounded above by one if they are nothing alike. This method is therefore good for comparing multiple synthetic datasets, or features within a dataset, to see which is closest to the real. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same. An optimal 'bins' value has not been suggested. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame( ... {"a": [0, 1, 3, 4], "b": ["cat", "cow", "dog", "emu"]} ... ) >>> synth = pd.DataFrame( ... {"a": [3, 3, 2, 0], "b": ["cat", "cat", "cow", "dog"]} ... ) The second feature appears to be more similar than the first across datasets: >>> jensen_shannon_distance(real, synth, "a") 0.41627730557884884 >>> jensen_shannon_distance(real, synth, "b", bins=None) 0.328452092654953 .. py:function:: jensen_shannon_divergence(real, synth, feature, bins='auto', **kwargs) Jensen-Shannon divergence. Also known as the information radius, the Jensen-Shannon divergence describes the similarity between two probability distributions in terms of entropy. This divergence modifies the Kullback-Leibler divergence to be symmetric and finite (between 0 and 1). The divergence does not satisfy the triangle inequality. Thus, it does not describe "distance" in the mathematical sense. Taking its square root provides a metric known as the Jensen-Shannon distance. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Feature of the datasets to compare. This must be continuous. :type feature: str :param bins: The binning method to use. If `int`, is the number of bins. If `str`, must be a method accepted by `numpy.histogram_bin_edges`. If `None`, the feature is assumed to be categorical and counts are taken for each value in either dataset. :type bins: int or str or None, default "auto" :param \*\*kwargs: Keyword arguments for `scipy.spatial.distance.jensenshannon`. :type \*\*kwargs: dict, optional :returns: The computed divergence between the distributions. :rtype: float .. seealso:: :obj:`synthgauge.metrics.univariate_distance.jensen_shannon_distance`, :obj:`scipy.spatial.distance.jensenshannon` .. rubric:: Notes This is a wrapper of `synthgauge.metrics.univariate_distance.jensen_shannon_distance`, which in turn wraps `scipy.spatial.distance.jensenshannon`. Since this function expects probability vectors, the data is first discretised into evenly-spaced bins. We can think of the Jensen-Shannon divergence as the amount of information, or entropy, encoded in the difference between the real and synthetic distributions of the feature. The divergence is zero if the distributions are identical, and is bounded above by one if they are nothing alike. This method is therefore good for comparing multiple synthetic datasets, or features within a dataset, to see which is closest to the real. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same. An optimal 'bins' value has not been suggested. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame( ... {"a": [0, 1, 3, 4], "b": ["cat", "cow", "dog", "emu"]} ... ) >>> synth = pd.DataFrame( ... {"a": [3, 3, 2, 0], "b": ["cat", "cat", "cow", "dog"]} ... ) The second feature appears to be more similar than the first across datasets: >>> jensen_shannon_divergence(real, synth, "a") 0.1732867951399863 >>> jensen_shannon_divergence(real, synth, "b", bins=None) 0.10788077716941784 .. py:function:: kolmogorov_smirnov(real, synth, feature, **kwargs) Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test statistic is the maximum difference between the cumulative distribution functions of the real and synthetic features. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Name of the feature to compare. This must be continuous. :type feature: str :param \*\*kwargs: Keyword arguments for `scipy.stats.ks_2samp`. :type \*\*kwargs: dict, optional :returns: * **statistic, pvalue** (*float*) -- Kolmogorov-Smirnov test statistic. * **pvalue** (*float*) -- Two-tailed p-value. .. seealso:: :obj:`scipy.stats.ks_2samp` .. rubric:: Notes This is a wrapper for `scipy.stats.ks_2samp`, which tests whether two samples are drawn from the same distribution by calculating the maximum difference between their cumulative distribution functions. If the returned statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions are the same. This approach is only defined if the feature is continuous. The documentation further suggests this method works best when one of the samples has a size of only a few thousand. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame( ... {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]} ... ) >>> synth = pd.DataFrame( ... {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]} ... ) The first feature appears to come from the same distribution in both datasets. >>> kolmogorov_smirnov(real, synth, "a") KstestResult(statistic=0.3333333333333333, pvalue=0.9307359307359307) The second feature appears to come from different distributions in the datasets. >>> kolmogorov_smirnov(real, synth, "b") KstestResult(statistic=0.8333333333333334, pvalue=0.025974025974025972) .. py:function:: kruskal_wallis(real, synth, feature, **kwargs) Kruskal-Wallis H test. The Kruskal-Wallis test seeks to determine whether two sets of data originated from the same distribution. This is acheived by pooling and ranking the datasets. A low p-value suggests the two sets originate from different distributions and are not similar. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Feature of the datasets to compare. This must be continuous. :type feature: str :param \*\*kwargs: Keyword arguments for `scipy.stats.kruskal`. :type \*\*kwargs: dict, optional :returns: * **statistic** (*float*) -- The Kruskal-Wallis H statistic. * **pvalue** (*float*) -- The p-value for the test. .. seealso:: :obj:`scipy.stats.kruskal` .. rubric:: Notes This is a wrapper function for `scipy.stats.kruskal`. The null hypothesis for this test is that the medians of the distributions are equal. The alternative hypothesis is then that they are different. This would suggest that the synthetic and real data are not similarly distributed. We notice, however, that failure to reject the null hypothesis only suggests that the medians could be equal and says nothing else about how the data are distributed. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame( ... {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]} ... ) >>> synth = pd.DataFrame( ... {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]} ... ) The test for the first feature suggests that the data are similarly distributed according to their medians. >>> kruskal_wallis(real, synth, "a") KruskalResult(statistic=0.5646387832699667, pvalue=0.45239722100817814) The second feature test is much clearer that the data are drawn from distributions with different medians. >>> kruskal_wallis(real, synth, "b") KruskalResult(statistic=4.877737226277376, pvalue=0.02720526089960062) .. py:function:: kullback_leibler(real, synth, feature, bins='auto', **kwargs) Kullback-Leibler divergence. Describes how much the synthetic feature distribution varies from the real distribution in terms of relative entropy. The divergence is assymmetric and does not satisfy the triangle inequality. Thus, it does not describe "distance" in the mathematical sense. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Feature of the datasets to compare. This must be continuous. :type feature: str :param bins: The binning method to use. If `int`, is the number of bins. If `str`, must be a method accepted by `numpy.histogram_bin_edges`. If `None`, the feature is assumed to be categorical and counts are taken for each value in either dataset. :type bins: int or str or None, default "auto" :param \*\*kwargs: Keyword arguments for `scipy.stats.entropy`. :type \*\*kwargs: dict, optional :returns: The computed divergence between the distributions. :rtype: float .. seealso:: :obj:`scipy.stats.entropy` .. rubric:: Notes This is a wrapper function for `scipy.stats.entropy`. Since this function expects a probability vector, the data is first discretised into evenly-spaced bins. We can think of the Kullback-Leibler divergence as a measure of surprise we might expect seeing an example from the real data, relative to the distribution of the synthetic. The divergence is zero if the distributions are identical, and larger values indicate that the two discretised distributions are further from one another. Note that the KL divergence may be infinite - the most common case where this happens is when the synthetic data does not contain any data in at least one bin. An optimal 'bins' value has not been suggested. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame( ... {"a": [0, 1, 2, 3], "b": ["cat", "cow", "dog", "emu"]} ... ) >>> synth = pd.DataFrame( ... {"a": [3, 1, 2, 0], "b": ["cat", "cat", "cow", "dog"]} ... ) The first feature is replicated up to a re-ordering in the synthetic data, so its KL divergence is zero: >>> kullback_leibler(real, synth, "a") 0.0 However, the second feature does not include the ``"emu"`` category in the synthetic data, so the divergence is infinite: >>> kullback_leibler(real, synth, "b", bins=None) inf .. py:function:: mann_whitney(real, synth, feature, **kwargs) Mann-Whitney U test. The Mann-Whitney test compares two sets of data by examining how well-mixed they are when pooled. This is acheived by ranking the pooled data. A low p-value suggests the data are not similar. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Feature of the datasets to compare. This must be continuous. :type feature: str :param \*\*kwargs: Keyword arguments for `scipy.stats.mannwhitneyu`. :type \*\*kwargs: dict, optional :returns: * **statistic** (*float*) -- The Mann-Whitney U statistic, in particular U for `synth`. * **pvalue** (*float*) -- Two-sided p-value assuming an asymptotic normal distribution. .. seealso:: :obj:`scipy.stats.mannwhitneyu` .. rubric:: Notes This is a wrapper function for `scipy.stats.mannwhitneyu`. The null hypothesis for this test is that for randomly selected real and synthetic values, the probability that the real value is greater than the synthetic is the same as the probability that the synthetic value is greater than the real. We reject this hypothesis if the p-value is suitably small. This would in turn suggest that the synthetic and real data are not similarly distributed. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame( ... {"a": [0, 1, 3, 3, 4, 4], "b": [1, 4, 5, 6, 7, 7]} ... ) >>> synth = pd.DataFrame( ... {"a": [0, 2, 2, 3, 3, 3], "b": [1, 1, 1, 2, 2, 3]} ... ) If we were to choose our p-value threshold as 0.05, we would reach the conclusion that the distributions of the first feature are similar but the distributions of the second feature are not. >>> mann_whitney(real, synth, "a") MannwhitneyuResult(statistic=22.5, pvalue=0.5041764308016705) >>> mann_whitney(real, synth, "b") MannwhitneyuResult(statistic=31.5, pvalue=0.033439907088311766) .. py:function:: wasserstein(real, synth, feature, **kwargs) The (first) Wasserstein distance. Also known as the "Earth Mover's" distance, this metric can be thought of as calculating the amount of "work" required to move from the distribution of the synthetic data to the distribution of the real data. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Feature of the datasets to compare. This must be continuous. :type feature: str :param \*\*kwargs: Keyword arguments for `scipy.stats.wasserstein_distance`. :type \*\*kwargs: dict, optional :returns: The computed distance between the distributions. :rtype: float .. seealso:: :obj:`scipy.stats.wasserstein_distance` .. rubric:: Notes This is a wrapper for `scipy.stats.wasserstein_distance`. Computationally, we can find the Wasserstein distance by calculating the area between the cumulative distribution functions for the two distributions. If :math:`s` is the synthetic feature distribution, :math:`r` is the real feature distribution, and :math:`R` and :math:`S` are their respective cumulative distribution functions, then .. math:: W(s, r) = \int_{-\infty}^{+\infty} |S - R| The distance is zero if the distributions are identical and increases as they become less alike. This method is therefore good for comparing multiple synthetic datasets, or features within a dataset, to see which is closest to the real. However, as this is not a test, there is no threshold distance below which we can claim the distributions are statistically the same. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame( ... {"a": [0, 1, 3, 4, 3, 4], "b": [3, 7, 2, 1, 7, 4]} ... ) >>> synth = pd.DataFrame( ... {"a": [3, 3, 2, 0, 2, 3], "b": [1, 5, 1, 1, 6, 3]} ... ) The first feature appears to be more similar than the second across datasets: >>> wasserstein(real, synth, "a") 0.6666666666666667 >>> wasserstein(real, synth, "b") 1.166666666666667 .. py:function:: wilcoxon(real, synth, feature, **kwargs) Wilcoxon signed-rank test. In this use, the Wilcoxon test compares the distributions of paired data. It does this by ranking the pairwise differences between the real and synthetic data. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feature: Feature of the datasets to compare. This must be continuous. :type feature: str :param \*\*kwargs: Keyword arguments for `scipy.stats.wilcoxon`. :type \*\*kwargs: dict, optional :returns: * **statistic** (*float*) -- The sum of the ranks of the differences above or below zero, whichever is greater. * **pvalue** (*float*) -- Two-sided p-value. .. seealso:: :obj:`scipy.stats.wilcoxon` .. rubric:: Notes This is a wrapper function for `scipy.stats.wilcoxon`. The null hypothesis for this test is that the median of the paired differences is zero. The alternative hypothesis is that it is different from zero. This would suggest that the synthetic and real data are not similarly distributed. This test only makes sense when the synthetic and real data are paired. That is, each synthetic datum is matched to a real one. In which case, it is required that data are ordered to reflect this. .. rubric:: Examples >>> import pandas as pd >>> real = pd.DataFrame({"a": [0, 1, 2, 3, 4, 5, 8, 9]}) >>> synth = pd.DataFrame({"a": [2, 0, 3, 7, 7, 4, 2, 1]}) By eye, you might think these distributions are quite different from one another. However, the Wilcoxon test suggests that these two datasets were drawn from similar distributions. >>> wilcoxon(real, synth, "a") WilcoxonResult(statistic=17.0, pvalue=0.9453125)