synthgauge.metrics.cluster

Utility metrics derived from centroid-based clustering.

Module Contents

Functions

clustered_msd(real, synth[, feats, method, k_min, ...])

(Multiple) clustered mean-squared difference (MSD).

synthgauge.metrics.cluster.clustered_msd(real, synth, feats=None, method='kmeans', k_min=2, k_max=10, random_state=None)[source]

(Multiple) clustered mean-squared difference (MSD).

This metric clusters the real and synthetic data together, measuring the synthetic utility according to its representation across the fitted clusters. Since there is often no obvious choice for the number of clusters, \(k\), we consider a range of values.

For each value of \(k\), the chosen clustering method is fit and the proportion of synthetic data in each cluster is recorded. The clustered MSD is then calculated as the mean-squared difference between these proportions and the overall proportion of synthetic data.

This collection of MSDs is summarised by taking its minimum to give the metric value.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – Features to use in the clustering. If None (default), all common features are used.

  • method ({"kmeans", "kprototypes"}, default "kmeans") – Clustering method to use. Only k-means and k-prototypes are implemented. If using k-means (default), only numeric columns are considered, while k-prototypes allows for mixed-type clustering.

  • k_min (int, default 2) – Minimum number of clusters to use. Default of 2.

  • k_max (int, default 10) – Maximum number of clusters to use. Default of 10.

  • random_state (int, optional) – The random seed used to fit the clustering algorithm.

Returns

The minimum observed clustered MSD.

Return type

float

Notes

This function can be used with a single value of k by setting k_min and k_max both to k. For instance, if a sensible number of clusters is known a priori.

This metric says nothing about how appropriate the clustering method may be for the data at hand, nor how the data are distributed among the clusters. Both methods considered here have rather strong assumptions about the relative size and characteristics of the clusters in the data. As such, exploratory analysis is advised to determine whether such centroid-based clustering is well-suited.