:py:mod:`synthgauge.metrics.privacy` ==================================== .. py:module:: synthgauge.metrics.privacy .. autoapi-nested-parse:: Privacy metrics. Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: synthgauge.metrics.privacy.tcap_score synthgauge.metrics.privacy.min_nearest_neighbour synthgauge.metrics.privacy.sample_overlap_score .. py:function:: tcap_score(real, synth, key, target) Target Correct Attribution Probability (TCAP) score. This privacy metric calculates the average chance that the key-target pairings in a synthetic dataset reveal the true key-target pairings in associated real dataset. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param key: List of features in `synth` to use as the key. :type key: list of str :param target: Feature to use as the target. :type target: str :returns: The average TCAP score across the dataset. :rtype: float .. rubric:: Notes This metric provides an estimate of how well an intruder could infer attributes of groups in the real dataset by studying the synthetic. The choices for `key` and `target` will vary depending on the dataset in question but we would suggest the `key` features are those that could be readily available to an outsider and the `target` feature is one we wouldn't want them finding out, such as a protected characteristic. This method only works with categorical data, so binning of continuous data may be required. Full details may be found in: Taub and Elliott (2019). The Synthetic Data Challenge. The Hague, The Netherlands: Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Session 3. .. py:function:: min_nearest_neighbour(real, synth, feats=None, outliers_only=True, threshold=2, n_neighbours=5) Minimum nearest-neighbour distance. This privacy metric returns the smallest distance between any point in the real dataset and any point in the synthetic dataset. There is an option to only consider the outliers in the real dataset as these perhaps pose more of a privacy concern. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feats: Features in `real` and `synth` to use when calculating distance. If `None` (default), all common features are used. :type feats: list of str or None, default None :param outliers_only: Boolean indicating whether to filter out the real data inliers (default) or not. :type outliers_only: bool, default True :param threshold: Outlier decision threshold. Increase to include fewer points from `real` in nearest-neighbour calculations. :type threshold: number, default 2 :param n_neighbours: Number of neighbours to consider when identifying local outliers. :type n_neighbours: int, default 5 :returns: Minimum Manhattan distance between `real` and `synth` data. :rtype: float .. rubric:: Notes This privacy metric provides an insight into whether the synthetic dataset is too similar to the real dataset. It does this by calculating the minimum distance between the real records and the synthetic records. This metric assumes that categorical data is ordinal during distance calculations, or that it has already been suitably one-hot-encoded. .. py:function:: sample_overlap_score(real, synth, feats=None, sample_size=0.2, runs=5, seed=None, score_type='unique') Return percentage of overlap between real and synth data based on random sampling. Samples from both the real and synthetic datasets are compared for similarity. This similarity, or overlap score, is based on the exact matches of real data records within the synthetic data. :param real: Dataframe containing the real data. :type real: pandas.DataFrame :param synth: Dataframe containing the synthetic data. :type synth: pandas.DataFrame :param feats: The features used to match records. If `None` (default), all common features are used. :type feats: list of str or None, default None :param sample_size: The ratio (if `sample_size` between 0 and 1) or count (`sample_size` > 1) of records to sample. Default is 0.2 (20%). :type sample_size: float or int, default 0.2 :param runs: The number of sampling runs to use when computing the score. :type runs: int, default 5 :param seed: Random number seed used for sampling. :type seed: int, optional :param score_type: Method used for calculating the overlap score. If "unique" (default), the score is the percentage of unique records in the real sample that have a match within the synthetic data. If "sample", the score is the percentage of all records within the real sample that have a match within the synth sample. :type score_type: {"unique", "sample"}, default "unique" :returns: **overlap_score** -- Estimated overlap score between `real` and `synth`. :rtype: float