embedder

embedder.embedder

Classes and functions for handling embedding objects.

Classes

Name Description
EmbeddedDataFrame A data frame with a reference to an Embedder object.
Embedder Class for embedding a dataset.
SimilarityArray Augmented NumPy array of similarity scores with extra attributes.

EmbeddedDataFrame

embedder.embedder.EmbeddedDataFrame(self, data, embedder, update_norms=True, update_thresholds=False, *args, **kwargs)

A data frame with a reference to an Embedder object.

An EmbeddedDataFrame (EDF) instance wraps together a pandas.DataFrame with a reference to a pprl.embedder.Embedder object. An EDF also has a mandatory bf_indices column, describing the Bloom filter indices used for linkage.

The EDF instance can also calculate bf_norms and thresholds columns which are used in the Embedder.compare() method to compute pprl.embedder.SimilarityArray instances.

Parameters

Name Type Description Default
data numpy.numpy.ndarray | typing.Iterable | dict | pandas.pandas.DataFrame Data to which to attach the embedder. Must include a bf_indices column with list data type. required
embedder pprl.embedder.embedder.Embedder A compatible embedder object for the Bloom filter columns in data. required
update_norms bool Whether to update the Bloom filter norms on creation. Defaults to False. True
update_thresholds bool Whether to update the similarity thresholds on creation. Defaults to True. False
*args Additional positional arguments to pass to pandas.DataFrame along with data. ()
**kwargs Additional keyword arguments to pass to pandas.DataFrame along with data. {}

Attributes

Name Type Description
embedder_checksum str Hexadecimal string digest from self.embedder.

Notes

An EDF instance is usually created from an existing Embedder object by calling the embedder.embed() method. It can also be initialised using an embedder and a pandas.DataFrame that already has a bf_indices column via EmbeddedDataFrame(df, embedder).

If using the second method it is up to the user to ensure that the Embedder instance is compatible with the bf_indices column (as well as bf_norms and thresholds, if present) in the data frame. If in doubt, call edf.update_norms() and edf.update_thresholds() to refresh them.

Methods

Name Description
anonymise Remove raw data from embedded dataframe.
to_bloom_matrix Convert Bloom filter indices into a binary matrix.
update_norms Generate vector norms for each row.
update_thresholds Generate matching thresholds for each row of the data.
anonymise

embedder.embedder.EmbeddedDataFrame.anonymise(keep=None)

Remove raw data from embedded dataframe.

Remove all columns from the embedded dataframe expect columns listed in keep and bf_indices, bf_norms and thresholds.

Returns
Type Description
list[str] Columns to be returned as they appear in the data in addition to bf_indices, bf_norms and thresholds if they are present in the data.
to_bloom_matrix

embedder.embedder.EmbeddedDataFrame.to_bloom_matrix()

Convert Bloom filter indices into a binary matrix.

The matrix has a row for each row in the EDF. The number of columns is equal to self.embedder.bf_size + self.embedder.offset. Each row in the matrix is a Bloom filter expressed as a binary vector, with the ones corresponding to hashed features. This representation is used in the Embedder.compare() method.

Returns
Type Description
numpy.numpy.ndarray Binary array of size (len(self), self.embedder.bf_size + self.embedder.offset).
update_norms

embedder.embedder.EmbeddedDataFrame.update_norms()

Generate vector norms for each row.

Create or update the bf_norms column in the EDF. This method calculates, for each Bloom filter, its Euclidean norm when the filter is expressed as a binary vector, and saves it to the EDF. The norm is used to scale the (Soft) Cosine similarity scores.

Attributes
Name Type Description
data.bf_norms list Column of vector norms for each row in the EDF.
update_thresholds

embedder.embedder.EmbeddedDataFrame.update_thresholds()

Generate matching thresholds for each row of the data.

The threshold is the minimum similarity score that will be matched. It is found by getting the pairwise similarities between each row and the other rows in the same EDF, and taking the maximum of these.

Attributes
Name Type Description
data.thresholds numpy.numpy.ndarray Column for maximum similarity of each row within the EDF.

Embedder

embedder.embedder.Embedder(self, feature_factory, ff_args=None, bf_size=1024, num_hashes=2, offset=0, salt=None)

Class for embedding a dataset.

Each instance of the Embedder class represents an embedding space on personal data features. An Embedder instance is defined by three things:

  1. A set of Bloom filter parameters
  2. A set of feature factory functions
  3. An embedding matrix that corresponds to the above

Parameters

Name Type Description Default
feature_factory dict Mapping from dataset columns to feature generation functions. required
ff_args dict[str, dict] | None Mapping from dataset columns to keyword arguments for their respective feature generation functions. None
bf_size int Size of the Bloom filter. Default is 1024. 1024
num_hashes int Number of hashes to perform. Default is two. 2
offset int Offset for Bloom filter to enable masking. Default is zero. 0
salt str | None Cryptographic salt added to tokens from the data before hashing. None

Attributes

Name Type Description
scm_matrix numpy.numpy.ndarray Soft Cosine Measure matrix. Initialised as an identity matrix of size bf_size + offset.
freq_matr_matched numpy.numpy.ndarray Matched frequency matrix for computing scm_matrix. Initialised as an identity matrix of size bf_size + offset.
freq_matr_unmatched numpy.numpy.ndarray Unmatched frequency matrix for computing scm_matrix. Initialised as an identity matrix of size bf_size + offset.
checksum str Hexadecimal string digest of the feature factory, SCM matrix, and other embedding parameters. Used to check an embedder is compatible with an EmbeddedDataFrame.

Notes

When an instance is initialised in code, the embedding matrix is initialised as an identity matrix; the matrix can then be trained using a pair of datasets with known match status and the trained Embedder instance pickled to file. The pre-trained Embedder instance can then be reinitialised from the pickle file.

Both the untrained and trained instances provide embed() and compare() methods. Comparing datasets using an untrained Embedder instance is equivalent to calculating Cosine similarities on ordinary Bloom filters. Comparing datasets using a pre-trained Embedder calculates the Soft Cosine Measure between Bloom filters. The Soft Cosine Measure embedding matrix is trained using an experimental method.

Methods

Name Description
compare Calculate a SimilarityArray on two EDFs.
embed Encode data columns into features from Bloom embedding.
from_pickle Initialise Embedder instance from pickle file.
to_pickle Save Embedder instance to pickle file.
train Fit Soft Cosine Measure matrix to two matched datasets.
compare

embedder.embedder.Embedder.compare(edf1, edf2, require_thresholds=True)

Calculate a SimilarityArray on two EDFs.

Given two EDFs, calculate all pairwise Soft Cosine Similarities between rows.

Parameters
Name Type Description Default
edf1 pprl.embedder.embedder.EmbeddedDataFrame An EDF instance with N rows. Must have thresholds column unless require_thresholds=False. required
edf2 pprl.embedder.embedder.EmbeddedDataFrame An EDF instance with M rows. Must have thresholds column unless require_thresholds=False. required
require_thresholds bool If True (default), the comparison will fail if thresholds are not present. Must be explicitly set to False to allow comparison without thresholds. True
Returns
Type Description
pprl.embedder.embedder.SimilarityArray An N by M array containing the similarity matrix of pairwise Soft Cosine similarities between rows of edf1 and edf2.
Raises
Type Description
ValueError If require_thresholds is True and both EDFs don’t have a thresholds column.
embed

embedder.embedder.Embedder.embed(df, colspec, update_norms=True, update_thresholds=False)

Encode data columns into features from Bloom embedding.

Given a pandas DataFrame and a column specification, convert columns into string features, and then embed the features into Bloom filters. The method returns an instance of EmbeddedDataFrame, which is an augmented pandas DataFrame.

Parameters
Name Type Description Default
df pandas.pandas.DataFrame Data frame to be embedded. required
colspec dict Dictionary mapping columns in df to feature factory functions. required
update_norms bool Whether to calculate vector norms for SCM and add to EDF. False by default. True
update_thresholds bool Whether to calculate similarity thresholds and add to EDF. Used as an outside option in matching. False by default. False
Returns
Type Description
pprl.embedder.embedder.EmbeddedDataFrame An embedded data frame with its embedder.
from_pickle

embedder.embedder.Embedder.from_pickle(path=None, pickled=None)

Initialise Embedder instance from pickle file.

Parameters
Name Type Description Default
path str File path from which to load the pickled embedder. None
pickled bytes Byte-string containing the pickled embedder. None
Raises
Type Description
ValueError If not exactly one of path and pickled are specified.
Returns
Type Description
pprl.embedder.embedder.Embedder The reformed instance of the Embedder class.
to_pickle

embedder.embedder.Embedder.to_pickle(path=None)

Save Embedder instance to pickle file.

Parameters
Name Type Description Default
path str File path at which to save the pickled embedder. If not specified, the pickled bytes string is returned. None
Returns
Type Description
bytes or None If path is not specified, the pickled string comes back. Otherwise, nothing is returned.
train

embedder.embedder.Embedder.train(edf1, edf2, update=True, learning_rate=1.0, eps=0.01, random_state=None)

Fit Soft Cosine Measure matrix to two matched datasets.

This function updates the scm_matrix attribute in-place along with its constituent matrices, freq_matr_matched and freq_matr_unmatched.

Provide two datasets of pre-matched data, with matching records aligned. If update=True, the training is cumulative, so that train() can be called more than once, updating the same matrices each time by adding new frequency tables. Otherwise, all three matrices are reinitialised prior to training.

Parameters
Name Type Description Default
edf1 pprl.embedder.embedder.EmbeddedDataFrame An embedded dataset. required
edf2 pprl.embedder.embedder.EmbeddedDataFrame An Embedded dataset of known matches in the same order as edf1. required
update bool Whether to update the existing SCM matrix, or overwrite it. Defaults to True. True
eps float Small non-negative constant to avoid -Inf in log of frequencies. Default is one. 0.01
learning_rate float Scaling factor to dampen matrix updates. Must be in the interval (0, 1]. Default is 0.01. 1.0
random_state None | numpy.numpy.random.numpy.random.RandomState Random state to pass to dataset jumbler. Defaults to None. None
Attributes
Name Type Description
scm_matrix numpy.numpy.ndarray Soft Cosine Measure matrix that is fitted cumulatively or afresh.

SimilarityArray

embedder.embedder.SimilarityArray()

Augmented NumPy array of similarity scores with extra attributes.

Parameters

Name Type Description Default
input_array Original array of similarity score data. required
thresholds 2-tuple of similarity score thresholds for each axis. These thresholds are used when generating a matching. required
embedder_checksum Hexadecimal string digest of a pprl.embedder.Embedder object. required

Notes

SimilarityArray objects are usually initialised from an instance of pprl.embedder.Embedder via the embedder.compare() method.

Methods

Name Description
match Compute a matching.
match

embedder.embedder.SimilarityArray.match(abs_cutoff=0, rel_cutoff=0, hungarian=True, require_thresholds=True)

Compute a matching.

Given an array of similarity scores, compute a matching of its elements, using the Hungarian algorithm by default. If the SimilarityArray has thresholds, masking is used to ensure that prospective matches whose similarity score is below the thresholds are not returned. An abs_cutoff (global minimum similarity score) can also be supplied.

Parameters
Name Type Description Default
abs_cutoff int or float A lower cutoff for the similarity score. No pairs with similarity below the absolute cutoff will be matched. By default, this is 0. 0
rel_cutoff int or float A margin above the row/column-specific threshold. Raises all thresholds by a constant. By default, this is 0. 0
hungarian bool Whether to compute the unique matching using the Hungarian algorithm, filtered using thresholds and abs_cutoff. Default is True. If False, just return all pairs above the threshold. True
require_thresholds bool If True (default), the matching will fail if thresholds is not present and valid. Must be explicitly set to False to allow matching without similarity thresholds. True
Returns
Type Description
tuple[list[int], list[int]] 2-tuple of indexes containing row and column indices of matched pairs eg. ([0, 1, ...], [0, 1, ...]).
Notes

If hungarian=False, the matching returns all pairs with similarity score above the abs_cutoff, respecting thresholds if present. This method does not guarantee no duplicates.

Functions

Name Description
nearest_pos_semi_definite Calculate nearest positive semi-definite version of a matrix.

nearest_pos_semi_definite

embedder.embedder.nearest_pos_semi_definite(X, eps=0.0)

Calculate nearest positive semi-definite version of a matrix.

This function achieves this by setting all negative eigenvalues of the matrix to zero, or a small positive value to give a positive definite matrix.

Graciously taken from this StackOverflow post

Parameters

Name Type Description Default
X numpy.numpy.ndarray Matrix-like array. required
eps float Use a small positive constant to give a positive definite matrix. Default is 0 to give a positive semi-definite matrix. 0.0

Returns

Type Description
numpy.numpy.ndarray A positive (semi-)definite matrix.