:py:mod:`synthgauge.utils`
==========================

.. py:module:: synthgauge.utils

.. autoapi-nested-parse::

   Utility functions for handling real and synthetic data.


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   synthgauge.utils.df_combine
   synthgauge.utils.df_separate
   synthgauge.utils.launder
   synthgauge.utils.cat_encode


.. py:function:: df_combine(real, synth, feats=None, source_col_name='source', source_val_real='real', source_val_synth='synth')

   Combine separate dataframes of real and synthetic data.

   The dataframes are concatenated along the first axis (rows) and a
   source column is added to distinguish the real data from the
   synthetic data. Optionally, specific features can be selected.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: Features to combine. If `None` (default), all common features
                 are used.
   :type feats: list of str or None, default None
   :param source_col_name: Name of the source column. This is added to the combined dataset
                           and filled with the `source_val_real` and `source_val_synth`
                           values to signify the real and synthetic data respectively.
                           Defaults to `"source"`.
   :type source_col_name: str, default "source"
   :param source_val_real: Value to use in `source_col_name` column to signify the real
                           data. Defaults to `"real"`.
   :type source_val_real: any, default "real"
   :param source_val_synth: Value to use in `source_col_name` column to signify the
                            synthetic data. Defaults to `"synth"`.
   :type source_val_synth: any, default "synth"

   :returns: **combined** -- The combined dataframe.
   :rtype: pandas.DataFrame


.. py:function:: df_separate(data, source_col_name, feats=None, source_val_real='real', source_val_synth='synth', drop_source_col=True)

   Separate a dataframe into real and synthetic data.

   The dataframe is split using a source column and real and synthetic
   flags. Optionally, specific features can be selected.

   :param data: Dataframe to split into real and synthetic components.
   :type data: pandas.DataFrame
   :param source_col_name: Name of the column used to signify real versus synthetic data.
   :type source_col_name: str
   :param feats: Features to separate. If `None` (default), uses all features.
   :type feats: list of str or None, default None
   :param source_val_real: Value in `source_col_name` column signifying the real data.
                           Defaults to `"real"`.
   :type source_val_real: any, default "real"
   :param source_val_synth: Value in `source_col_name` column signifying the synthetic data.
                            Defaults to `"synth"`.
   :type source_val_synth: any, default "synth"
   :param drop_source_col: Indicates whether the `source_col_name` column should be
                           dropped from the outputs (default) or not.
   :type drop_source_col: bool, default True

   :returns: * **real** (*pandas.DataFrame*) -- Dataframe containing the real data.
             * **synth** (*pandas.DataFrame*) -- Dataframe containing the synthetic data.


.. py:function:: launder(real, synth, feats=None, suffix_real='real', suffix_synth='synth')

   Launder feature names and optionally filter.

   To provide clear distinction between the real and synthetic
   features, each dataframe is updated to append suffixes to the
   feature names. Optionally, specific features can be selected.

   :param real: Dataframe containing the real data.
   :type real: pandas.DataFrame
   :param synth: Dataframe containing the synthetic data.
   :type synth: pandas.DataFrame
   :param feats: Features to launder. If `None` (default), all common features
                 are used.
   :type feats: list of str or None, default None
   :param suffix_real: Suffix to append to columns in `real`. Default is `"real"`.
   :type suffix_real: str, default "real"
   :param suffix_synth: Suffix to append to columns in `synth`. Default is `"synth"`.
   :type suffix_synth: str, default "synth"

   :returns: * **real** (*pandas.DataFrame*) -- Dataframe containing the laundered real data.
             * **synth** (*pandas.DataFrame*) -- Dataframe containing the laundered synthetic data.


.. py:function:: cat_encode(df, feats=None, return_all=False, convert_only=False, force=False)

   Convert object features to categories.

   Generates a new version of the input dataframe with the specified
   features categorically encoded with integer labels. Optionally, the
   features can be returned as `category` data type with no encoding.

   Before performing the conversion, a check is made to identify any
   speficied features that are not `object`-type and thus less suited
   to categorical encoding. A warning is raised for these features and
   they will be ignored from subsequent encoding steps unless `force`
   is set to `True`.

   :param df: Input dataframe to be converted.
   :type df: pandas.DataFrame
   :param feats: Features in `df` to convert to categorical. If `None` (default),
                 all object-type columns are selected.
   :type feats: list of str or None, default None
   :param return_all: If `True`, all features in `df` will be returned regardless of
                      whether they were converted. If `False` (default), only the
                      converted features are returned.
   :type return_all: bool, default False
   :param convert_only: If `True`, the features will only be converted to the `category`
                        data-type without being integer-encoded.
   :type convert_only: bool, default False
   :param force: If `True`, all features in `feats` will be encoded regardless of
                 their data-type.
   :type force: bool, default False

   :Warns: **UserWarning** -- A warning is given if any of the features in `feats` are not of
           an `object` data type.

   :returns: * **out_df** (*pandas.DataFrame*) -- Dataframe with (at least) the converted features.
             * **cat_dict** (*dict or NoneType*) -- A dictionary mapping each encoded feature to its categories. If
               `convert_only=True`, returns as `None`.