synthgauge.metrics.correlation

Correlation-based utility metrics.

Module Contents

Functions

correlation_msd(real, synth[, method, feats])

Mean-squared difference in correlation coefficients.

correlation_ratio_msd(real, synth[, categorical, numeric])

Correlation ratio mean-squared difference.

synthgauge.metrics.correlation.correlation_msd(real, synth, method='pearson', feats=None)[source]

Mean-squared difference in correlation coefficients.

This metric calculates the mean squared difference between the correlation matrices for the real and synthetic datasets. This gives an indication of how well the synthetic data has retained bivariate relationships.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • method ({"pearson", "spearman", "cramers_v"}, default "pearson") –

  • feats (list of str or None, default None) – Features to measure correlation across. If method=”cramers_v”, all numeric columns will be filtered out. Likewise, for the other correlation methods, all non-numeric columns are removed. If None (default), all common features that satisfy the needs of method are used.

Returns

Mean-squared difference of correlation coefficients.

Return type

float

See also

numpy.corrcoef

Notes

The smaller the number this function returns, the better the synthetic data captures the correlation between variables in the real data. This method is therefore good for comparing multiple synthetic datasets to see which is closest to the real. However, as this is not a test, there is no threshold value below which we can claim the datasets are statistically the same.

We only consider the coefficients above the main diagonal when calculating the MSD. If we included the entire matrix, we would double-count each pair of features as well as including the trivial ones along the main diagonal.

synthgauge.metrics.correlation.correlation_ratio_msd(real, synth, categorical=None, numeric=None)[source]

Correlation ratio mean-squared difference.

This metric calculates the mean-squared difference in association between categorical and continuous feature pairings in the real and synthetic datasets.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • categorical (list of str or None, default None) – Categorical features in real and synth to include in comparison. If None (default), uses all common object-type columns.

  • numeric (list of str or None, default None) – Numerical features in real and synth to include in comparison. If None (default), uses all common columns not selected by categorical.

Returns

Mean squared difference between real and synth in correlation ratio scores across all categorical-continuous feature pairs.

Return type

float