embedder

embedder.embedder

Classes and functions for handling embedding objects.

Classes

Name	Description
EmbeddedDataFrame	A data frame with a reference to an `Embedder` object.
Embedder	Class for embedding a dataset.
SimilarityArray	Augmented NumPy array of similarity scores with extra attributes.

EmbeddedDataFrame

embedder.embedder.EmbeddedDataFrame(self, data, embedder, update_norms=True, update_thresholds=False, *args, **kwargs)

A data frame with a reference to an Embedder object.

An EmbeddedDataFrame (EDF) instance wraps together a pandas.DataFrame with a reference to a pprl.embedder.Embedder object. An EDF also has a mandatory bf_indices column, describing the Bloom filter indices used for linkage.

The EDF instance can also calculate bf_norms and thresholds columns which are used in the Embedder.compare() method to compute pprl.embedder.SimilarityArray instances.

Parameters

Name	Type	Description	Default
`data`	numpy.numpy.ndarray \| typing.Iterable \| dict \| pandas.pandas.DataFrame	Data to which to attach the embedder. Must include a `bf_indices` column with `list` data type.	required
`embedder`	pprl.embedder.embedder.Embedder	A compatible embedder object for the Bloom filter columns in `data`.	required
`update_norms`	bool	Whether to update the Bloom filter norms on creation. Defaults to `False`.	`True`
`update_thresholds`	bool	Whether to update the similarity thresholds on creation. Defaults to `True`.	`False`
`*args`		Additional positional arguments to pass to `pandas.DataFrame` along with `data`.	`()`
`**kwargs`		Additional keyword arguments to pass to `pandas.DataFrame` along with `data`.	`{}`

Attributes

Name	Type	Description
embedder_checksum	str	Hexadecimal string digest from `self.embedder`.

Notes

An EDF instance is usually created from an existing Embedder object by calling the embedder.embed() method. It can also be initialised using an embedder and a pandas.DataFrame that already has a bf_indices column via EmbeddedDataFrame(df, embedder).

If using the second method it is up to the user to ensure that the Embedder instance is compatible with the bf_indices column (as well as bf_norms and thresholds, if present) in the data frame. If in doubt, call edf.update_norms() and edf.update_thresholds() to refresh them.

Methods

Name	Description
anonymise	Remove raw data from embedded dataframe.
to_bloom_matrix	Convert Bloom filter indices into a binary matrix.
update_norms	Generate vector norms for each row.
update_thresholds	Generate matching thresholds for each row of the data.

anonymise

embedder.embedder.EmbeddedDataFrame.anonymise(keep=None)

Remove raw data from embedded dataframe.

Remove all columns from the embedded dataframe expect columns listed in keep and bf_indices, bf_norms and thresholds.

Returns

Type	Description
list[str]	Columns to be returned as they appear in the data in addition to `bf_indices`, `bf_norms` and `thresholds` if they are present in the data.

to_bloom_matrix

embedder.embedder.EmbeddedDataFrame.to_bloom_matrix()

Convert Bloom filter indices into a binary matrix.

The matrix has a row for each row in the EDF. The number of columns is equal to self.embedder.bf_size + self.embedder.offset. Each row in the matrix is a Bloom filter expressed as a binary vector, with the ones corresponding to hashed features. This representation is used in the Embedder.compare() method.

Returns

Type	Description
numpy.numpy.ndarray	Binary array of size `(len(self), self.embedder.bf_size + self.embedder.offset)`.

update_norms

embedder.embedder.EmbeddedDataFrame.update_norms()

Generate vector norms for each row.

Create or update the bf_norms column in the EDF. This method calculates, for each Bloom filter, its Euclidean norm when the filter is expressed as a binary vector, and saves it to the EDF. The norm is used to scale the (Soft) Cosine similarity scores.

Attributes

Name	Type	Description
data.bf_norms	list	Column of vector norms for each row in the EDF.

update_thresholds

embedder.embedder.EmbeddedDataFrame.update_thresholds()

Generate matching thresholds for each row of the data.

The threshold is the minimum similarity score that will be matched. It is found by getting the pairwise similarities between each row and the other rows in the same EDF, and taking the maximum of these.

Attributes

Name	Type	Description
data.thresholds	numpy.numpy.ndarray	Column for maximum similarity of each row within the EDF.

Embedder

embedder.embedder.Embedder(self, feature_factory, ff_args=None, bf_size=1024, num_hashes=2, offset=0, salt=None)

Class for embedding a dataset.

Each instance of the Embedder class represents an embedding space on personal data features. An Embedder instance is defined by three things:

A set of Bloom filter parameters
A set of feature factory functions
An embedding matrix that corresponds to the above

Parameters

Name	Type	Description	Default
`feature_factory`	dict	Mapping from dataset columns to feature generation functions.	required
`ff_args`	dict[str, dict] \| None	Mapping from dataset columns to keyword arguments for their respective feature generation functions.	`None`
`bf_size`	int	Size of the Bloom filter. Default is 1024.	`1024`
`num_hashes`	int	Number of hashes to perform. Default is two.	`2`
`offset`	int	Offset for Bloom filter to enable masking. Default is zero.	`0`
`salt`	str \| None	Cryptographic salt added to tokens from the data before hashing.	`None`

Attributes

Name	Type	Description
scm_matrix	numpy.numpy.ndarray	Soft Cosine Measure matrix. Initialised as an identity matrix of size `bf_size + offset`.
freq_matr_matched	numpy.numpy.ndarray	Matched frequency matrix for computing `scm_matrix`. Initialised as an identity matrix of size `bf_size + offset`.
freq_matr_unmatched	numpy.numpy.ndarray	Unmatched frequency matrix for computing `scm_matrix`. Initialised as an identity matrix of size `bf_size + offset`.
checksum	str	Hexadecimal string digest of the feature factory, SCM matrix, and other embedding parameters. Used to check an embedder is compatible with an `EmbeddedDataFrame`.

Notes

When an instance is initialised in code, the embedding matrix is initialised as an identity matrix; the matrix can then be trained using a pair of datasets with known match status and the trained Embedder instance pickled to file. The pre-trained Embedder instance can then be reinitialised from the pickle file.

Both the untrained and trained instances provide embed() and compare() methods. Comparing datasets using an untrained Embedder instance is equivalent to calculating Cosine similarities on ordinary Bloom filters. Comparing datasets using a pre-trained Embedder calculates the Soft Cosine Measure between Bloom filters. The Soft Cosine Measure embedding matrix is trained using an experimental method.

Methods

Name	Description
compare	Calculate a `SimilarityArray` on two EDFs.
embed	Encode data columns into features from Bloom embedding.
from_pickle	Initialise Embedder instance from pickle file.
to_pickle	Save Embedder instance to pickle file.
train	Fit Soft Cosine Measure matrix to two matched datasets.

compare

embedder.embedder.Embedder.compare(edf1, edf2, require_thresholds=True)

Calculate a SimilarityArray on two EDFs.

Given two EDFs, calculate all pairwise Soft Cosine Similarities between rows.

Parameters

Name	Type	Description	Default
`edf1`	pprl.embedder.embedder.EmbeddedDataFrame	An EDF instance with N rows. Must have `thresholds` column unless `require_thresholds=False`.	required
`edf2`	pprl.embedder.embedder.EmbeddedDataFrame	An EDF instance with M rows. Must have `thresholds` column unless `require_thresholds=False`.	required
`require_thresholds`	bool	If `True` (default), the comparison will fail if thresholds are not present. Must be explicitly set to `False` to allow comparison without thresholds.	`True`

Returns

Type	Description
pprl.embedder.embedder.SimilarityArray	An N by M array containing the similarity matrix of pairwise Soft Cosine similarities between rows of `edf1` and `edf2`.

Raises

Type	Description
ValueError	If `require_thresholds` is `True` and both EDFs don’t have a `thresholds` column.

embed

embedder.embedder.Embedder.embed(df, colspec, update_norms=True, update_thresholds=False)

Encode data columns into features from Bloom embedding.

Given a pandas DataFrame and a column specification, convert columns into string features, and then embed the features into Bloom filters. The method returns an instance of EmbeddedDataFrame, which is an augmented pandas DataFrame.

Parameters

Name	Type	Description	Default
`df`	pandas.pandas.DataFrame	Data frame to be embedded.	required
`colspec`	dict	Dictionary mapping columns in `df` to feature factory functions.	required
`update_norms`	bool	Whether to calculate vector norms for SCM and add to EDF. `False` by default.	`True`
`update_thresholds`	bool	Whether to calculate similarity thresholds and add to EDF. Used as an outside option in matching. `False` by default.	`False`

Returns

Type	Description
pprl.embedder.embedder.EmbeddedDataFrame	An embedded data frame with its embedder.

from_pickle

embedder.embedder.Embedder.from_pickle(path=None, pickled=None)

Initialise Embedder instance from pickle file.

Parameters

Name	Type	Description	Default
`path`	str	File path from which to load the pickled embedder.	`None`
`pickled`	bytes	Byte-string containing the pickled embedder.	`None`

Raises

Type	Description
ValueError	If not exactly one of `path` and `pickled` are specified.

Returns

Type	Description
pprl.embedder.embedder.Embedder	The reformed instance of the `Embedder` class.

to_pickle

embedder.embedder.Embedder.to_pickle(path=None)

Save Embedder instance to pickle file.

Parameters

Name	Type	Description	Default
`path`	str	File path at which to save the pickled embedder. If not specified, the pickled bytes string is returned.	`None`

Returns

Type	Description
bytes or None	If `path` is not specified, the pickled string comes back. Otherwise, nothing is returned.

train

embedder.embedder.Embedder.train(edf1, edf2, update=True, learning_rate=1.0, eps=0.01, random_state=None)

Fit Soft Cosine Measure matrix to two matched datasets.

This function updates the scm_matrix attribute in-place along with its constituent matrices, freq_matr_matched and freq_matr_unmatched.

Provide two datasets of pre-matched data, with matching records aligned. If update=True, the training is cumulative, so that train() can be called more than once, updating the same matrices each time by adding new frequency tables. Otherwise, all three matrices are reinitialised prior to training.

Parameters

Name	Type	Description	Default
`edf1`	pprl.embedder.embedder.EmbeddedDataFrame	An embedded dataset.	required
`edf2`	pprl.embedder.embedder.EmbeddedDataFrame	An Embedded dataset of known matches in the same order as `edf1`.	required
`update`	bool	Whether to update the existing SCM matrix, or overwrite it. Defaults to `True`.	`True`
`eps`	float	Small non-negative constant to avoid `-Inf` in log of frequencies. Default is one.	`0.01`
`learning_rate`	float	Scaling factor to dampen matrix updates. Must be in the interval `(0, 1]`. Default is 0.01.	`1.0`
`random_state`	None \| numpy.numpy.random.numpy.random.RandomState	Random state to pass to dataset jumbler. Defaults to `None`.	`None`

Attributes

Name	Type	Description
scm_matrix	numpy.numpy.ndarray	Soft Cosine Measure matrix that is fitted cumulatively or afresh.

SimilarityArray

embedder.embedder.SimilarityArray()

Augmented NumPy array of similarity scores with extra attributes.

Parameters

Name	Description	Default
`input_array`	Original array of similarity score data.	required
`thresholds`	2-tuple of similarity score thresholds for each axis. These thresholds are used when generating a matching.	required
`embedder_checksum`	Hexadecimal string digest of a `pprl.embedder.Embedder` object.	required

Notes

SimilarityArray objects are usually initialised from an instance of pprl.embedder.Embedder via the embedder.compare() method.

Methods

Name	Description
match	Compute a matching.

match

embedder.embedder.SimilarityArray.match(abs_cutoff=0, rel_cutoff=0, hungarian=True, require_thresholds=True)

Compute a matching.

Given an array of similarity scores, compute a matching of its elements, using the Hungarian algorithm by default. If the SimilarityArray has thresholds, masking is used to ensure that prospective matches whose similarity score is below the thresholds are not returned. An abs_cutoff (global minimum similarity score) can also be supplied.

Parameters

Name	Type	Description	Default
`abs_cutoff`	int or float	A lower cutoff for the similarity score. No pairs with similarity below the absolute cutoff will be matched. By default, this is 0.	`0`
`rel_cutoff`	int or float	A margin above the row/column-specific threshold. Raises all thresholds by a constant. By default, this is 0.	`0`
`hungarian`	bool	Whether to compute the unique matching using the Hungarian algorithm, filtered using `thresholds` and `abs_cutoff`. Default is `True`. If `False`, just return all pairs above the threshold.	`True`
`require_thresholds`	bool	If `True` (default), the matching will fail if `thresholds` is not present and valid. Must be explicitly set to `False` to allow matching without similarity thresholds.	`True`

Returns

Type	Description
tuple[list[int], list[int]]	2-tuple of indexes containing row and column indices of matched pairs eg. `([0, 1, ...], [0, 1, ...])`.

Notes

If hungarian=False, the matching returns all pairs with similarity score above the abs_cutoff, respecting thresholds if present. This method does not guarantee no duplicates.

Functions

Name	Description
nearest_pos_semi_definite	Calculate nearest positive semi-definite version of a matrix.

nearest_pos_semi_definite

embedder.embedder.nearest_pos_semi_definite(X, eps=0.0)

Calculate nearest positive semi-definite version of a matrix.

This function achieves this by setting all negative eigenvalues of the matrix to zero, or a small positive value to give a positive definite matrix.

Graciously taken from this StackOverflow post

Parameters

Name	Type	Description	Default
`X`	numpy.numpy.ndarray	Matrix-like array.	required
`eps`	float	Use a small positive constant to give a positive definite matrix. Default is 0 to give a positive semi-definite matrix.	`0.0`

Returns

Type	Description
numpy.numpy.ndarray	A positive (semi-)definite matrix.