synthgauge.utils

Utility functions for handling real and synthetic data.

Module Contents

Functions

df_combine(real, synth[, feats, source_col_name, ...])

Combine separate dataframes of real and synthetic data.

df_separate(data, source_col_name[, feats, ...])

Separate a dataframe into real and synthetic data.

launder(real, synth[, feats, suffix_real, suffix_synth])

Launder feature names and optionally filter.

cat_encode(df[, feats, return_all, convert_only, force])

Convert object features to categories.

synthgauge.utils.df_combine(real, synth, feats=None, source_col_name='source', source_val_real='real', source_val_synth='synth')[source]

Combine separate dataframes of real and synthetic data.

The dataframes are concatenated along the first axis (rows) and a source column is added to distinguish the real data from the synthetic data. Optionally, specific features can be selected.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – Features to combine. If None (default), all common features are used.

  • source_col_name (str, default "source") – Name of the source column. This is added to the combined dataset and filled with the source_val_real and source_val_synth values to signify the real and synthetic data respectively. Defaults to “source”.

  • source_val_real (any, default "real") – Value to use in source_col_name column to signify the real data. Defaults to “real”.

  • source_val_synth (any, default "synth") – Value to use in source_col_name column to signify the synthetic data. Defaults to “synth”.

Returns

combined – The combined dataframe.

Return type

pandas.DataFrame

synthgauge.utils.df_separate(data, source_col_name, feats=None, source_val_real='real', source_val_synth='synth', drop_source_col=True)[source]

Separate a dataframe into real and synthetic data.

The dataframe is split using a source column and real and synthetic flags. Optionally, specific features can be selected.

Parameters
  • data (pandas.DataFrame) – Dataframe to split into real and synthetic components.

  • source_col_name (str) – Name of the column used to signify real versus synthetic data.

  • feats (list of str or None, default None) – Features to separate. If None (default), uses all features.

  • source_val_real (any, default "real") – Value in source_col_name column signifying the real data. Defaults to “real”.

  • source_val_synth (any, default "synth") – Value in source_col_name column signifying the synthetic data. Defaults to “synth”.

  • drop_source_col (bool, default True) – Indicates whether the source_col_name column should be dropped from the outputs (default) or not.

Returns

  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

synthgauge.utils.launder(real, synth, feats=None, suffix_real='real', suffix_synth='synth')[source]

Launder feature names and optionally filter.

To provide clear distinction between the real and synthetic features, each dataframe is updated to append suffixes to the feature names. Optionally, specific features can be selected.

Parameters
  • real (pandas.DataFrame) – Dataframe containing the real data.

  • synth (pandas.DataFrame) – Dataframe containing the synthetic data.

  • feats (list of str or None, default None) – Features to launder. If None (default), all common features are used.

  • suffix_real (str, default "real") – Suffix to append to columns in real. Default is “real”.

  • suffix_synth (str, default "synth") – Suffix to append to columns in synth. Default is “synth”.

Returns

  • real (pandas.DataFrame) – Dataframe containing the laundered real data.

  • synth (pandas.DataFrame) – Dataframe containing the laundered synthetic data.

synthgauge.utils.cat_encode(df, feats=None, return_all=False, convert_only=False, force=False)[source]

Convert object features to categories.

Generates a new version of the input dataframe with the specified features categorically encoded with integer labels. Optionally, the features can be returned as category data type with no encoding.

Before performing the conversion, a check is made to identify any speficied features that are not object-type and thus less suited to categorical encoding. A warning is raised for these features and they will be ignored from subsequent encoding steps unless force is set to True.

Parameters
  • df (pandas.DataFrame) – Input dataframe to be converted.

  • feats (list of str or None, default None) – Features in df to convert to categorical. If None (default), all object-type columns are selected.

  • return_all (bool, default False) – If True, all features in df will be returned regardless of whether they were converted. If False (default), only the converted features are returned.

  • convert_only (bool, default False) – If True, the features will only be converted to the category data-type without being integer-encoded.

  • force (bool, default False) – If True, all features in feats will be encoded regardless of their data-type.

Warns

UserWarning – A warning is given if any of the features in feats are not of an object data type.

Returns

  • out_df (pandas.DataFrame) – Dataframe with (at least) the converted features.

  • cat_dict (dict or NoneType) – A dictionary mapping each encoded feature to its categories. If convert_only=True, returns as None.