:py:mod:`synthgauge.plot`
=========================

.. py:module:: synthgauge.plot

.. autoapi-nested-parse::

   Functions for visually evaluating synthetic data.


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   synthgauge.plot.plot_histograms
   synthgauge.plot.plot_joint
   synthgauge.plot.plot_histogram3d
   synthgauge.plot.plot_correlation
   synthgauge.plot.plot_crosstab
   synthgauge.plot.plot_qq
   synthgauge.plot.plot_feat_density_diff


.. py:function:: plot_histograms(df, feats=None, groupby=None, figcols=2, figsize=None)

   Plot feature distributions.

   Plot a histogram (or countplot for categorical data) for each
   feature. Where multiple features are provided a grid will be
   generated to store all the plots.

   Optionally, a groupby feature can be specified to apply a grouping
   prior to calculating the distribution.

   :param df: Dataframe containing the feature(s) to be plotted.
   :type df: pandas.DataFrame
   :param feats: Features in to plot. If `None` (default), all features are used.
   :type feats: list of str or None, default None
   :param groupby: Feature on which to group data.
   :type groupby: str, optional
   :param figcols: Number of columns to use in the figure.
   :type figcols: int, default 2
   :param figsize: Size of figure in inches `(width, height)`. Defaults to
                   `matplotlib.pyplot.rcParams["figure.figsize"]`.
   :type figsize: tuple of float, optional

   :rtype: matplotlib.figure.Figure


.. py:function:: plot_joint(df, x, y, groupby=None, x_bins='auto', y_bins='auto', figsize=6)

   Plot bivariate and univariate graphs.

   Convenience function that leverages `seaborn`. For more granular
   control, refer to `seaborn.JointGrid` and `seaborn.jointplot`.

   :param df: DataFrame containing the feature(s) to plot.
   :type df: pandas.DataFrame
   :param x: Feature to plot on the x-axis and -margin.
   :type x: str
   :param y: Feature to plot on the y-axis and -margin.
   :type y: str
   :param groupby: Feature on which to group data.
   :type groupby: str, optional
   :param x_bins: Binning method for axis. If `array_like`, must be sequence of
                  bin edges. If `int`, specifies the number of bins to use. If
                  `str`, can be anything accepted by `numpy.histogram_bin_edges`.
                  Defaults to `"auto"`.
   :type x_bins: array_like or int or str, default "auto"
   :param y_bins: Binning method for axis. If `array_like`, must be sequence of
                  bin edges. If `int`, specifies the number of bins to use. If
                  `str`, can be anything accepted by `numpy.histogram_bin_edges`.
                  Defaults to `"auto"`.
   :type y_bins: array_like or int or str, default "auto"
   :param figsize: Size of each side of the figure in inches (it will be square).
                   Defaults to six inches.
   :type figsize: int, default 6

   :rtype: seaborn.axisgrid.JointGrid


.. py:function:: plot_histogram3d(df, x, y, x_bins='auto', y_bins='auto', figsize=None)

   Plot 3D histogram of two features.

   This is similar to a 2D histogram plot with an extra axis added
   to display the count for each feature-wise pair as 3D bars.

   :param df: DataFrame containing the feature(s) to plot.
   :type df: pandas.DataFrame
   :param x: Feature to plot on the x-axis and -margin.
   :type x: str
   :param y: Feature to plot on the y-axis and -margin.
   :type y: str
   :param x_bins: Binning method for axis. If the corresponding feature is
                  categorical, the bins will be set to the cardinality of that
                  feature. If `array_like`, must be sequence of bin edges. If
                  `int`, specifies the number of bins to use. If `str`, can be
                  anything accepted by `numpy.histogram_bin_edges`. Defaults to
                  `"auto"`.
   :type x_bins: array_like or int or str, default "auto"
   :param y_bins: Binning method for axis. If the corresponding feature is
                  categorical, the bins will be set to the cardinality of that
                  feature. If `array_like`, must be sequence of bin edges. If
                  `int`, specifies the number of bins to use. If `str`, can be
                  anything accepted by `numpy.histogram_bin_edges`. Defaults to
                  `"auto"`.
   :type y_bins: array_like or int or str, default "auto"
   :param figsize: Size of figure in inches `(width, height)`. Defaults to
                   `matplotlib.pyplot.rcParams["figure.figsize"]`.
   :type figsize: tuple of float, optional

   :rtype: matplotlib.figure.Figure


.. py:function:: plot_correlation(*dfs, feats=None, method='pearson', plot_diff=False, figcols=2, figsize=None, **kwargs)

   Plot correlation between features in a dataframe.

   For each dataframe provided a subplot is generated showing a
   correlation heatmap of the features. For numeric features, the
   method can be anything supported by `pandas.DataFrame.corr`; for
   categorical or object feature types `"cramers_v"` must be specified.
   If the method does not match the data type, an error is raised.

   The `plot_diff` parameter will also include a difference plot, i.e.
   correlation difference between two dataframes. This is only used
   when two dataframes are provided.

   :param \*dfs: Any number of dataframes to plot.
   :type \*dfs: pandas.DataFrame
   :param feats: Features to plot. Must be present in all elements of `dfs`.
                 If `None` (default), uses features common to all dataframes.
   :type feats: list of str or None, default None
   :param method: Correlation method. See `pandas.DataFrame.corr` for more details
                  on `"pearson"` and `"spearman"`. When `"cramers_v"` is
                  specified, correlation is calculated using
                  `synthgauge.metrics.correlation.cramers_v`.
   :type method: {"pearson", "spearman", "cramers_v"}, default "pearson"
   :param plot_diff: If `True` and exactly two dataframes are provided, will also
                     plot a heatmap of the absolute differences between the
                     respective datasets' correlations.
   :type plot_diff: bool, default False
   :param figcols: Number of columns to use in the figure. Only used when `feats`
                   contains more than one feature.
   :type figcols: int, default 2
   :param figsize: Size of figure in inches `(width, height)`. Defaults to
                   `matplotlib.pyplot.rcParams["figure.figsize"]`.
   :type figsize: tuple of float, optional
   :param \*\*kwargs: Any other keyword arguments to be passed to `seaborn.heatmap`.
                      For example `annot=True` will turn on cell annotations. See
                      documentation for more examples.
   :type \*\*kwargs: dict, optional

   :raises ValueError: If `method` does not match the data type(s) of `feats`.

   :rtype: matplotlib.figure.Figure


.. py:function:: plot_crosstab(real, synth, x, y, x_bins='auto', y_bins='auto', figsize=None, cmap='rocket', **kwargs)

   Plot cross-tabulation heatmap for two features.

   The two-feature crosstab calculation is performed and plotted as a
   heatmap. One heatmap is shown for the real data and one for the
   synthetic. Numeric features are discretised using the `*_bins`
   arguments.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param x: Feature to plot on the x-axis and -margin.
   :type x: str
   :param y: Feature to plot on the y-axis and -margin.
   :type y: str
   :param x_bins: Binning method for axis. If `array_like`, must be sequence of
                  bin edges. If `int`, specifies the number of bins to use. If
                  `str`, can be anything accepted by `numpy.histogram_bin_edges`.
                  Defaults to `"auto"`. Only used for numeric features.
   :type x_bins: array_like or int or str, default "auto"
   :param y_bins: Binning method for axis. If `array_like`, must be sequence of
                  bin edges. If `int`, specifies the number of bins to use. If
                  `str`, can be anything accepted by `numpy.histogram_bin_edges`.
                  Defaults to `"auto"`. Only used for numeric features.
   :type y_bins: array_like or int or str, default "auto"
   :param figsize: Size of figure in inches `(width, height)`. Defaults to
                   `matplotlib.pyplot.rcParams["figure.figsize"]`.
   :type figsize: tuple of float, optional
   :param cmap: Palette name for heatmap and colour bar. See the documentation
                for `seaborn.color_palette` on available palette formats.
                Defaults to `"rocket"`.
   :type cmap: str, default "rocket"
   :param \*\*kwargs: Any other keyword arguments to be passed to `seaborn.heatmap`.
                      For example, `annot=True` will turn on cell annotations. See
                      documentation for more examples.
   :type \*\*kwargs: dict, optional

   :rtype: matplotlib.figure.Figure


.. py:function:: plot_qq(real, synth, feature, n_quantiles=None, figsize=None)

   Generate a Q-Q plot for a feature of real and synthetic data.

   Quantile-quantile (Q-Q) plots are used to visualise two sets of
   numeric data to see if they are generated from the same
   distribution.

   In this case, it is used to provide some insight into the
   feature distributions for the synthetic and real data. If the
   scatter plot shows a straight line, then it can be inferred that the
   two distributions are similar and therefore the synthetically
   generated data follows the same distribution as the real data.

   See `Q-Q Plot <https://en.wikipedia.org/wiki/Q-Q_plot>`_ for more
   information.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feature: Feature to plot. Must be in `real` and `synth`.
   :type feature: str
   :param n_quantiles: Number of quantiles to calculate. If `None` (default), uses the
                       length of `real`.
   :type n_quantiles: int or None, default None
   :param figsize: Size of figure in inches `(width, height)`. Defaults to
                   `matplotlib.pyplot.rcParams["figure.figsize"]`.
   :type figsize: tuple of float, optional

   :raises TypeError: If `feature` is not a numeric data type feature.

   :rtype: matplotlib.figure.Figure


.. py:function:: plot_feat_density_diff(real, synth, feats=None, feat_bins=10, diff_bins=10, figsize=None)

   Plot real and synth feature density differences.

   For each feature, the density difference between `real` and `synth`
   is calculated using `metrics.density._feature_density_diff`.

   If a single feature is provided in `feats`, the plot shows the raw
   density differences for each bin in that feature.

   Where multiple features are provided, the density differences are
   pd.concatenated into a flattened array and a histogram plotted. The
   histogram represents the distribution of differences in densities
   across all features and bins.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: Features used to compute the densities. If `None` (default), all
                 common features are used.
   :type feats: list of str or None, default None
   :param feat_bins: Bins to use for computing the feature densities. This value is
                     passed to `numpy.histogram_bin_edges` so can be any value
                     accepted by that function. By default, uses 10 bins.
   :type feat_bins: str or int, default 10
   :param diff_bins: Bins to use when computing the multiple-feature difference
                     histogram. This value is passed to `numpy.histogram_bin_edges`
                     so can be any value accepted by that function. By default, uses
                     10 bins.
   :type diff_bins: str or int, default 10

   :rtype: matplotlib.figure.Figure