:py:mod:`synthgauge.metrics.correlation`
========================================

.. py:module:: synthgauge.metrics.correlation

.. autoapi-nested-parse::

   Correlation-based utility metrics.


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   synthgauge.metrics.correlation.correlation_msd
   synthgauge.metrics.correlation.correlation_ratio_msd


.. py:function:: correlation_msd(real, synth, method='pearson', feats=None)

   Mean-squared difference in correlation coefficients.

   This metric calculates the mean squared difference between the
   correlation matrices for the real and synthetic datasets. This gives
   an indication of how well the synthetic data has retained bivariate
   relationships.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param method:
   :type method: {"pearson", "spearman", "cramers_v"}, default "pearson"
   :param feats: Features to measure correlation across. If `method="cramers_v"`,
                 all numeric columns will be filtered out. Likewise, for the
                 other correlation methods, all non-numeric columns are removed.
                 If `None` (default), all common features that satisfy the needs
                 of `method` are used.
   :type feats: list of str or None, default None

   :returns: Mean-squared difference of correlation coefficients.
   :rtype: float

   .. seealso:: :obj:`numpy.corrcoef`

   .. rubric:: Notes

   The smaller the number this function returns, the better the
   synthetic data captures the correlation between variables in the
   real data. This method is therefore good for comparing multiple
   synthetic datasets to see which is closest to the real. However, as
   this is not a test, there is no threshold value below which we can
   claim the datasets are statistically the same.

   We only consider the coefficients above the main diagonal when
   calculating the MSD. If we included the entire matrix, we would
   double-count each pair of features as well as including the trivial
   ones along the main diagonal.


.. py:function:: correlation_ratio_msd(real, synth, categorical=None, numeric=None)

   Correlation ratio mean-squared difference.

   This metric calculates the mean-squared difference in association
   between categorical and continuous feature pairings in the real and
   synthetic datasets.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param categorical: Categorical features in `real` and `synth` to include in
                       comparison. If `None` (default), uses all common object-type
                       columns.
   :type categorical: list of str or None, default None
   :param numeric: Numerical features in `real` and `synth` to include in
                   comparison. If `None` (default), uses all common columns not
                   selected by `categorical`.
   :type numeric: list of str or None, default None

   :returns: Mean squared difference between `real` and `synth` in
             correlation ratio scores across all categorical-continuous
             feature pairs.
   :rtype: float