:py:mod:`synthgauge.metrics.privacy`
====================================

.. py:module:: synthgauge.metrics.privacy

.. autoapi-nested-parse::

   Privacy metrics.


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   synthgauge.metrics.privacy.tcap_score
   synthgauge.metrics.privacy.min_nearest_neighbour
   synthgauge.metrics.privacy.sample_overlap_score


.. py:function:: tcap_score(real, synth, key, target)

   Target Correct Attribution Probability (TCAP) score.

   This privacy metric calculates the average chance that the
   key-target pairings in a synthetic dataset reveal the true
   key-target pairings in associated real dataset.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param key: List of features in `synth` to use as the key.
   :type key: list of str
   :param target: Feature to use as the target.
   :type target: str

   :returns: The average TCAP score across the dataset.
   :rtype: float

   .. rubric:: Notes

   This metric provides an estimate of how well an intruder could infer
   attributes of groups in the real dataset by studying the synthetic.
   The choices for `key` and `target` will vary depending on the
   dataset in question but we would suggest the `key` features are
   those that could be readily available to an outsider and the
   `target` feature is one we wouldn't want them finding out, such as a
   protected characteristic.

   This method only works with categorical data, so binning of
   continuous data may be required.

   Full details may be found in:

   Taub and Elliott (2019). The Synthetic Data Challenge. The Hague,
   The Netherlands: Joint UNECE/Eurostat Work Session on Statistical
   Data Confidentiality, Session 3.


.. py:function:: min_nearest_neighbour(real, synth, feats=None, outliers_only=True, threshold=2, n_neighbours=5)

   Minimum nearest-neighbour distance.

   This privacy metric returns the smallest distance between any point
   in the real dataset and any point in the synthetic dataset. There is
   an option to only consider the outliers in the real dataset as these
   perhaps pose more of a privacy concern.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: Features in `real` and `synth` to use when calculating
                 distance. If `None` (default), all common features are used.
   :type feats: list of str or None, default None
   :param outliers_only: Boolean indicating whether to filter out the real data inliers
                         (default) or not.
   :type outliers_only: bool, default True
   :param threshold: Outlier decision threshold. Increase to include fewer points
                     from `real` in nearest-neighbour calculations.
   :type threshold: number, default 2
   :param n_neighbours: Number of neighbours to consider when identifying local
                        outliers.
   :type n_neighbours: int, default 5

   :returns: Minimum Manhattan distance between `real` and `synth` data.
   :rtype: float

   .. rubric:: Notes

   This privacy metric provides an insight into whether the synthetic
   dataset is too similar to the real dataset. It does this by
   calculating the minimum distance between the real records and the
   synthetic records.

   This metric assumes that categorical data is ordinal during distance
   calculations, or that it has already been suitably one-hot-encoded.


.. py:function:: sample_overlap_score(real, synth, feats=None, sample_size=0.2, runs=5, seed=None, score_type='unique')

   Return percentage of overlap between real and synth data based on
   random sampling.

   Samples from both the real and synthetic datasets are compared for
   similarity. This similarity, or overlap score, is based on the
   exact matches of real data records within the synthetic data.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: The features used to match records. If `None` (default), all
                 common features are used.
   :type feats: list of str or None, default None
   :param sample_size: The ratio (if `sample_size` between 0 and 1) or count
                       (`sample_size` > 1) of records to sample. Default is 0.2 (20%).
   :type sample_size: float or int, default 0.2
   :param runs: The number of sampling runs to use when computing the score.
   :type runs: int, default 5
   :param seed: Random number seed used for sampling.
   :type seed: int, optional
   :param score_type: Method used for calculating the overlap score. If "unique"
                      (default), the score is the percentage of unique records in the
                      real sample that have a match within the synthetic data. If
                      "sample", the score is the percentage of all records within the
                      real sample that have a match within the synth sample.
   :type score_type: {"unique", "sample"}, default "unique"

   :returns: **overlap_score** -- Estimated overlap score between `real` and `synth`.
   :rtype: float