synthgauge.plot

Functions for visually evaluating synthetic data.

Module Contents

Functions

plot_histograms(df[, feats, groupby, figcols, figsize])

Plot feature distributions.

plot_joint(df, x, y[, groupby, x_bins, y_bins, figsize])

Plot bivariate and univariate graphs.

plot_histogram3d(df, x, y[, x_bins, y_bins, figsize])

Plot 3D histogram of two features.

plot_correlation(*dfs[, feats, method, plot_diff, ...])

Plot correlation between features in a dataframe.

plot_crosstab(real, synth, x, y[, x_bins, y_bins, ...])

Plot cross-tabulation heatmap for two features.

plot_qq(real, synth, feature[, n_quantiles, figsize])

Generate a Q-Q plot for a feature of real and synthetic data.

plot_feat_density_diff(real, synth[, feats, ...])

Plot real and synth feature density differences.

synthgauge.plot.plot_histograms(df, feats=None, groupby=None, figcols=2, figsize=None)[source]

Plot feature distributions.

Plot a histogram (or countplot for categorical data) for each feature. Where multiple features are provided a grid will be generated to store all the plots.

Optionally, a groupby feature can be specified to apply a grouping prior to calculating the distribution.

Parameters
  • df (pandas.DataFrame) – Dataframe containing the feature(s) to be plotted.

  • feats (list of str or None, default None) – Features in to plot. If None (default), all features are used.

  • groupby (str, optional) – Feature on which to group data.

  • figcols (int, default 2) – Number of columns to use in the figure.

  • figsize (tuple of float, optional) – Size of figure in inches (width, height). Defaults to matplotlib.pyplot.rcParams[“figure.figsize”].

Return type

matplotlib.figure.Figure

synthgauge.plot.plot_joint(df, x, y, groupby=None, x_bins='auto', y_bins='auto', figsize=6)[source]

Plot bivariate and univariate graphs.

Convenience function that leverages seaborn. For more granular control, refer to seaborn.JointGrid and seaborn.jointplot.

Parameters
  • df (pandas.DataFrame) – DataFrame containing the feature(s) to plot.

  • x (str) – Feature to plot on the x-axis and -margin.

  • y (str) – Feature to plot on the y-axis and -margin.

  • groupby (str, optional) – Feature on which to group data.

  • x_bins (array_like or int or str, default "auto") – Binning method for axis. If array_like, must be sequence of bin edges. If int, specifies the number of bins to use. If str, can be anything accepted by numpy.histogram_bin_edges. Defaults to “auto”.

  • y_bins (array_like or int or str, default "auto") – Binning method for axis. If array_like, must be sequence of bin edges. If int, specifies the number of bins to use. If str, can be anything accepted by numpy.histogram_bin_edges. Defaults to “auto”.

  • figsize (int, default 6) – Size of each side of the figure in inches (it will be square). Defaults to six inches.

Return type

seaborn.axisgrid.JointGrid

synthgauge.plot.plot_histogram3d(df, x, y, x_bins='auto', y_bins='auto', figsize=None)[source]

Plot 3D histogram of two features.

This is similar to a 2D histogram plot with an extra axis added to display the count for each feature-wise pair as 3D bars.

Parameters
  • df (pandas.DataFrame) – DataFrame containing the feature(s) to plot.

  • x (str) – Feature to plot on the x-axis and -margin.

  • y (str) – Feature to plot on the y-axis and -margin.

  • x_bins (array_like or int or str, default "auto") – Binning method for axis. If the corresponding feature is categorical, the bins will be set to the cardinality of that feature. If array_like, must be sequence of bin edges. If int, specifies the number of bins to use. If str, can be anything accepted by numpy.histogram_bin_edges. Defaults to “auto”.

  • y_bins (array_like or int or str, default "auto") – Binning method for axis. If the corresponding feature is categorical, the bins will be set to the cardinality of that feature. If array_like, must be sequence of bin edges. If int, specifies the number of bins to use. If str, can be anything accepted by numpy.histogram_bin_edges. Defaults to “auto”.

  • figsize (tuple of float, optional) – Size of figure in inches (width, height). Defaults to matplotlib.pyplot.rcParams[“figure.figsize”].

Return type

matplotlib.figure.Figure

synthgauge.plot.plot_correlation(*dfs, feats=None, method='pearson', plot_diff=False, figcols=2, figsize=None, **kwargs)[source]

Plot correlation between features in a dataframe.

For each dataframe provided a subplot is generated showing a correlation heatmap of the features. For numeric features, the method can be anything supported by pandas.DataFrame.corr; for categorical or object feature types “cramers_v” must be specified. If the method does not match the data type, an error is raised.

The plot_diff parameter will also include a difference plot, i.e. correlation difference between two dataframes. This is only used when two dataframes are provided.

Parameters
  • *dfs (pandas.DataFrame) – Any number of dataframes to plot.

  • feats (list of str or None, default None) – Features to plot. Must be present in all elements of dfs. If None (default), uses features common to all dataframes.

  • method ({"pearson", "spearman", "cramers_v"}, default "pearson") – Correlation method. See pandas.DataFrame.corr for more details on “pearson” and “spearman”. When “cramers_v” is specified, correlation is calculated using synthgauge.metrics.correlation.cramers_v.

  • plot_diff (bool, default False) – If True and exactly two dataframes are provided, will also plot a heatmap of the absolute differences between the respective datasets’ correlations.

  • figcols (int, default 2) – Number of columns to use in the figure. Only used when feats contains more than one feature.

  • figsize (tuple of float, optional) – Size of figure in inches (width, height). Defaults to matplotlib.pyplot.rcParams[“figure.figsize”].

  • **kwargs (dict, optional) – Any other keyword arguments to be passed to seaborn.heatmap. For example annot=True will turn on cell annotations. See documentation for more examples.

Raises

ValueError – If method does not match the data type(s) of feats.

Return type

matplotlib.figure.Figure

synthgauge.plot.plot_crosstab(real, synth, x, y, x_bins='auto', y_bins='auto', figsize=None, cmap='rocket', **kwargs)[source]

Plot cross-tabulation heatmap for two features.

The two-feature crosstab calculation is performed and plotted as a heatmap. One heatmap is shown for the real data and one for the synthetic. Numeric features are discretised using the *_bins arguments.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • x (str) – Feature to plot on the x-axis and -margin.

  • y (str) – Feature to plot on the y-axis and -margin.

  • x_bins (array_like or int or str, default "auto") – Binning method for axis. If array_like, must be sequence of bin edges. If int, specifies the number of bins to use. If str, can be anything accepted by numpy.histogram_bin_edges. Defaults to “auto”. Only used for numeric features.

  • y_bins (array_like or int or str, default "auto") – Binning method for axis. If array_like, must be sequence of bin edges. If int, specifies the number of bins to use. If str, can be anything accepted by numpy.histogram_bin_edges. Defaults to “auto”. Only used for numeric features.

  • figsize (tuple of float, optional) – Size of figure in inches (width, height). Defaults to matplotlib.pyplot.rcParams[“figure.figsize”].

  • cmap (str, default "rocket") – Palette name for heatmap and colour bar. See the documentation for seaborn.color_palette on available palette formats. Defaults to “rocket”.

  • **kwargs (dict, optional) – Any other keyword arguments to be passed to seaborn.heatmap. For example, annot=True will turn on cell annotations. See documentation for more examples.

Return type

matplotlib.figure.Figure

synthgauge.plot.plot_qq(real, synth, feature, n_quantiles=None, figsize=None)[source]

Generate a Q-Q plot for a feature of real and synthetic data.

Quantile-quantile (Q-Q) plots are used to visualise two sets of numeric data to see if they are generated from the same distribution.

In this case, it is used to provide some insight into the feature distributions for the synthetic and real data. If the scatter plot shows a straight line, then it can be inferred that the two distributions are similar and therefore the synthetically generated data follows the same distribution as the real data.

See Q-Q Plot for more information.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feature (str) – Feature to plot. Must be in real and synth.

  • n_quantiles (int or None, default None) – Number of quantiles to calculate. If None (default), uses the length of real.

  • figsize (tuple of float, optional) – Size of figure in inches (width, height). Defaults to matplotlib.pyplot.rcParams[“figure.figsize”].

Raises

TypeError – If feature is not a numeric data type feature.

Return type

matplotlib.figure.Figure

synthgauge.plot.plot_feat_density_diff(real, synth, feats=None, feat_bins=10, diff_bins=10, figsize=None)[source]

Plot real and synth feature density differences.

For each feature, the density difference between real and synth is calculated using metrics.density._feature_density_diff.

If a single feature is provided in feats, the plot shows the raw density differences for each bin in that feature.

Where multiple features are provided, the density differences are pd.concatenated into a flattened array and a histogram plotted. The histogram represents the distribution of differences in densities across all features and bins.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – Features used to compute the densities. If None (default), all common features are used.

  • feat_bins (str or int, default 10) – Bins to use for computing the feature densities. This value is passed to numpy.histogram_bin_edges so can be any value accepted by that function. By default, uses 10 bins.

  • diff_bins (str or int, default 10) – Bins to use when computing the multiple-feature difference histogram. This value is passed to numpy.histogram_bin_edges so can be any value accepted by that function. By default, uses 10 bins.

Return type

matplotlib.figure.Figure