:py:mod:`synthgauge.metrics.cluster`
====================================

.. py:module:: synthgauge.metrics.cluster

.. autoapi-nested-parse::

   Utility metrics derived from centroid-based clustering.


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   synthgauge.metrics.cluster.clustered_msd


.. py:function:: clustered_msd(real, synth, feats=None, method='kmeans', k_min=2, k_max=10, random_state=None)

   (Multiple) clustered mean-squared difference (MSD).

   This metric clusters the real and synthetic data together, measuring
   the synthetic utility according to its representation across the
   fitted clusters. Since there is often no obvious choice for the
   number of clusters, :math:`k`, we consider a range of values.

   For each value of :math:`k`, the chosen clustering method is fit
   and the proportion of synthetic data in each cluster is recorded.
   The clustered MSD is then calculated as the mean-squared difference
   between these proportions and the overall proportion of synthetic
   data.

   This collection of MSDs is summarised by taking its minimum to give
   the metric value.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: Features to use in the clustering. If `None` (default), all
                 common features are used.
   :type feats: list of str or None, default None
   :param method: Clustering method to use. Only k-means and k-prototypes
                  are implemented. If using k-means (default), only numeric
                  columns are considered, while k-prototypes allows for mixed-type
                  clustering.
   :type method: {"kmeans", "kprototypes"}, default "kmeans"
   :param k_min: Minimum number of clusters to use. Default of 2.
   :type k_min: int, default 2
   :param k_max: Maximum number of clusters to use. Default of 10.
   :type k_max: int, default 10
   :param random_state: The random seed used to fit the clustering algorithm.
   :type random_state: int, optional

   :returns: The minimum observed clustered MSD.
   :rtype: float

   .. rubric:: Notes

   This function can be used with a single value of `k` by setting
   `k_min` and `k_max` both to `k`. For instance, if a sensible number
   of clusters is known a priori.

   This metric says nothing about how appropriate the clustering method
   may be for the data at hand, nor how the data are distributed among
   the clusters. Both methods considered here have rather strong
   assumptions about the relative size and characteristics of the
   clusters in the data. As such, exploratory analysis is advised to
   determine whether such centroid-based clustering is well-suited.