embedder
embedder.embedder
Classes and functions for handling embedding objects.
Classes
Name | Description |
---|---|
EmbeddedDataFrame | A data frame with a reference to an Embedder object. |
Embedder | Class for embedding a dataset. |
SimilarityArray | Augmented NumPy array of similarity scores with extra attributes. |
EmbeddedDataFrame
embedder.embedder.EmbeddedDataFrame(self, data, embedder, update_norms=True, update_thresholds=False, *args, **kwargs)
A data frame with a reference to an Embedder
object.
An EmbeddedDataFrame
(EDF) instance wraps together a pandas.DataFrame
with a reference to a pprl.embedder.Embedder
object. An EDF also has a mandatory bf_indices
column, describing the Bloom filter indices used for linkage.
The EDF instance can also calculate bf_norms
and thresholds
columns which are used in the Embedder.compare()
method to compute pprl.embedder.SimilarityArray
instances.
Parameters
Name | Type | Description | Default |
---|---|---|---|
data |
numpy.numpy.ndarray | typing.Iterable | dict | pandas.pandas.DataFrame | Data to which to attach the embedder. Must include a bf_indices column with list data type. |
required |
embedder |
pprl.embedder.embedder.Embedder | A compatible embedder object for the Bloom filter columns in data . |
required |
update_norms |
bool | Whether to update the Bloom filter norms on creation. Defaults to False . |
True |
update_thresholds |
bool | Whether to update the similarity thresholds on creation. Defaults to True . |
False |
*args |
Additional positional arguments to pass to pandas.DataFrame along with data . |
() |
|
**kwargs |
Additional keyword arguments to pass to pandas.DataFrame along with data . |
{} |
Attributes
Name | Type | Description |
---|---|---|
embedder_checksum | str | Hexadecimal string digest from self.embedder . |
Notes
An EDF instance is usually created from an existing Embedder
object by calling the embedder.embed()
method. It can also be initialised using an embedder and a pandas.DataFrame
that already has a bf_indices
column via EmbeddedDataFrame(df, embedder)
.
If using the second method it is up to the user to ensure that the Embedder
instance is compatible with the bf_indices
column (as well as bf_norms
and thresholds
, if present) in the data frame. If in doubt, call edf.update_norms()
and edf.update_thresholds()
to refresh them.
Methods
Name | Description |
---|---|
anonymise | Remove raw data from embedded dataframe. |
to_bloom_matrix | Convert Bloom filter indices into a binary matrix. |
update_norms | Generate vector norms for each row. |
update_thresholds | Generate matching thresholds for each row of the data. |
anonymise
embedder.embedder.EmbeddedDataFrame.anonymise(keep=None)
Remove raw data from embedded dataframe.
Remove all columns from the embedded dataframe expect columns listed in keep and bf_indices
, bf_norms
and thresholds
.
Returns
Type | Description |
---|---|
list[str] | Columns to be returned as they appear in the data in addition to bf_indices , bf_norms and thresholds if they are present in the data. |
to_bloom_matrix
embedder.embedder.EmbeddedDataFrame.to_bloom_matrix()
Convert Bloom filter indices into a binary matrix.
The matrix has a row for each row in the EDF. The number of columns is equal to self.embedder.bf_size + self.embedder.offset
. Each row in the matrix is a Bloom filter expressed as a binary vector, with the ones corresponding to hashed features. This representation is used in the Embedder.compare()
method.
Returns
Type | Description |
---|---|
numpy.numpy.ndarray | Binary array of size (len(self), self.embedder.bf_size + self.embedder.offset) . |
update_norms
embedder.embedder.EmbeddedDataFrame.update_norms()
Generate vector norms for each row.
Create or update the bf_norms
column in the EDF. This method calculates, for each Bloom filter, its Euclidean norm when the filter is expressed as a binary vector, and saves it to the EDF. The norm is used to scale the (Soft) Cosine similarity scores.
Attributes
Name | Type | Description |
---|---|---|
data.bf_norms | list | Column of vector norms for each row in the EDF. |
update_thresholds
embedder.embedder.EmbeddedDataFrame.update_thresholds()
Generate matching thresholds for each row of the data.
The threshold is the minimum similarity score that will be matched. It is found by getting the pairwise similarities between each row and the other rows in the same EDF, and taking the maximum of these.
Attributes
Name | Type | Description |
---|---|---|
data.thresholds | numpy.numpy.ndarray | Column for maximum similarity of each row within the EDF. |
Embedder
embedder.embedder.Embedder(self, feature_factory, ff_args=None, bf_size=1024, num_hashes=2, offset=0, salt=None)
Class for embedding a dataset.
Each instance of the Embedder
class represents an embedding space on personal data features. An Embedder
instance is defined by three things:
- A set of Bloom filter parameters
- A set of feature factory functions
- An embedding matrix that corresponds to the above
Parameters
Name | Type | Description | Default |
---|---|---|---|
feature_factory |
dict | Mapping from dataset columns to feature generation functions. | required |
ff_args |
dict[str, dict] | None | Mapping from dataset columns to keyword arguments for their respective feature generation functions. | None |
bf_size |
int | Size of the Bloom filter. Default is 1024. | 1024 |
num_hashes |
int | Number of hashes to perform. Default is two. | 2 |
offset |
int | Offset for Bloom filter to enable masking. Default is zero. | 0 |
salt |
str | None | Cryptographic salt added to tokens from the data before hashing. | None |
Attributes
Name | Type | Description |
---|---|---|
scm_matrix | numpy.numpy.ndarray | Soft Cosine Measure matrix. Initialised as an identity matrix of size bf_size + offset . |
freq_matr_matched | numpy.numpy.ndarray | Matched frequency matrix for computing scm_matrix . Initialised as an identity matrix of size bf_size + offset . |
freq_matr_unmatched | numpy.numpy.ndarray | Unmatched frequency matrix for computing scm_matrix . Initialised as an identity matrix of size bf_size + offset . |
checksum | str | Hexadecimal string digest of the feature factory, SCM matrix, and other embedding parameters. Used to check an embedder is compatible with an EmbeddedDataFrame . |
Notes
When an instance is initialised in code, the embedding matrix is initialised as an identity matrix; the matrix can then be trained using a pair of datasets with known match status and the trained Embedder
instance pickled to file. The pre-trained Embedder
instance can then be reinitialised from the pickle file.
Both the untrained and trained instances provide embed()
and compare()
methods. Comparing datasets using an untrained Embedder
instance is equivalent to calculating Cosine similarities on ordinary Bloom filters. Comparing datasets using a pre-trained Embedder
calculates the Soft Cosine Measure between Bloom filters. The Soft Cosine Measure embedding matrix is trained using an experimental method.
Methods
Name | Description |
---|---|
compare | Calculate a SimilarityArray on two EDFs. |
embed | Encode data columns into features from Bloom embedding. |
from_pickle | Initialise Embedder instance from pickle file. |
to_pickle | Save Embedder instance to pickle file. |
train | Fit Soft Cosine Measure matrix to two matched datasets. |
compare
embedder.embedder.Embedder.compare(edf1, edf2, require_thresholds=True)
Calculate a SimilarityArray
on two EDFs.
Given two EDFs, calculate all pairwise Soft Cosine Similarities between rows.
Parameters
Name | Type | Description | Default |
---|---|---|---|
edf1 |
pprl.embedder.embedder.EmbeddedDataFrame | An EDF instance with N rows. Must have thresholds column unless require_thresholds=False . |
required |
edf2 |
pprl.embedder.embedder.EmbeddedDataFrame | An EDF instance with M rows. Must have thresholds column unless require_thresholds=False . |
required |
require_thresholds |
bool | If True (default), the comparison will fail if thresholds are not present. Must be explicitly set to False to allow comparison without thresholds. |
True |
Returns
Type | Description |
---|---|
pprl.embedder.embedder.SimilarityArray | An N by M array containing the similarity matrix of pairwise Soft Cosine similarities between rows of edf1 and edf2 . |
Raises
Type | Description |
---|---|
ValueError | If require_thresholds is True and both EDFs don’t have a thresholds column. |
embed
embedder.embedder.Embedder.embed(df, colspec, update_norms=True, update_thresholds=False)
Encode data columns into features from Bloom embedding.
Given a pandas DataFrame and a column specification, convert columns into string features, and then embed the features into Bloom filters. The method returns an instance of EmbeddedDataFrame
, which is an augmented pandas DataFrame.
Parameters
Name | Type | Description | Default |
---|---|---|---|
df |
pandas.pandas.DataFrame | Data frame to be embedded. | required |
colspec |
dict | Dictionary mapping columns in df to feature factory functions. |
required |
update_norms |
bool | Whether to calculate vector norms for SCM and add to EDF. False by default. |
True |
update_thresholds |
bool | Whether to calculate similarity thresholds and add to EDF. Used as an outside option in matching. False by default. |
False |
Returns
Type | Description |
---|---|
pprl.embedder.embedder.EmbeddedDataFrame | An embedded data frame with its embedder. |
from_pickle
embedder.embedder.Embedder.from_pickle(path=None, pickled=None)
Initialise Embedder instance from pickle file.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path |
str | File path from which to load the pickled embedder. | None |
pickled |
bytes | Byte-string containing the pickled embedder. | None |
Raises
Type | Description |
---|---|
ValueError | If not exactly one of path and pickled are specified. |
Returns
Type | Description |
---|---|
pprl.embedder.embedder.Embedder | The reformed instance of the Embedder class. |
to_pickle
embedder.embedder.Embedder.to_pickle(path=None)
Save Embedder instance to pickle file.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path |
str | File path at which to save the pickled embedder. If not specified, the pickled bytes string is returned. | None |
Returns
Type | Description |
---|---|
bytes or None | If path is not specified, the pickled string comes back. Otherwise, nothing is returned. |
train
embedder.embedder.Embedder.train(edf1, edf2, update=True, learning_rate=1.0, eps=0.01, random_state=None)
Fit Soft Cosine Measure matrix to two matched datasets.
This function updates the scm_matrix
attribute in-place along with its constituent matrices, freq_matr_matched
and freq_matr_unmatched
.
Provide two datasets of pre-matched data, with matching records aligned. If update=True
, the training is cumulative, so that train()
can be called more than once, updating the same matrices each time by adding new frequency tables. Otherwise, all three matrices are reinitialised prior to training.
Parameters
Name | Type | Description | Default |
---|---|---|---|
edf1 |
pprl.embedder.embedder.EmbeddedDataFrame | An embedded dataset. | required |
edf2 |
pprl.embedder.embedder.EmbeddedDataFrame | An Embedded dataset of known matches in the same order as edf1 . |
required |
update |
bool | Whether to update the existing SCM matrix, or overwrite it. Defaults to True . |
True |
eps |
float | Small non-negative constant to avoid -Inf in log of frequencies. Default is one. |
0.01 |
learning_rate |
float | Scaling factor to dampen matrix updates. Must be in the interval (0, 1] . Default is 0.01. |
1.0 |
random_state |
None | numpy.numpy.random.numpy.random.RandomState | Random state to pass to dataset jumbler. Defaults to None . |
None |
Attributes
Name | Type | Description |
---|---|---|
scm_matrix | numpy.numpy.ndarray | Soft Cosine Measure matrix that is fitted cumulatively or afresh. |
SimilarityArray
embedder.embedder.SimilarityArray()
Augmented NumPy array of similarity scores with extra attributes.
Parameters
Name | Type | Description | Default |
---|---|---|---|
input_array |
Original array of similarity score data. | required | |
thresholds |
2-tuple of similarity score thresholds for each axis. These thresholds are used when generating a matching. | required | |
embedder_checksum |
Hexadecimal string digest of a pprl.embedder.Embedder object. |
required |
Notes
SimilarityArray
objects are usually initialised from an instance of pprl.embedder.Embedder
via the embedder.compare()
method.
Methods
Name | Description |
---|---|
match | Compute a matching. |
match
embedder.embedder.SimilarityArray.match(abs_cutoff=0, rel_cutoff=0, hungarian=True, require_thresholds=True)
Compute a matching.
Given an array of similarity scores, compute a matching of its elements, using the Hungarian algorithm by default. If the SimilarityArray
has thresholds, masking is used to ensure that prospective matches whose similarity score is below the thresholds are not returned. An abs_cutoff
(global minimum similarity score) can also be supplied.
Parameters
Name | Type | Description | Default |
---|---|---|---|
abs_cutoff |
int or float | A lower cutoff for the similarity score. No pairs with similarity below the absolute cutoff will be matched. By default, this is 0. | 0 |
rel_cutoff |
int or float | A margin above the row/column-specific threshold. Raises all thresholds by a constant. By default, this is 0. | 0 |
hungarian |
bool | Whether to compute the unique matching using the Hungarian algorithm, filtered using thresholds and abs_cutoff . Default is True . If False , just return all pairs above the threshold. |
True |
require_thresholds |
bool | If True (default), the matching will fail if thresholds is not present and valid. Must be explicitly set to False to allow matching without similarity thresholds. |
True |
Returns
Type | Description |
---|---|
tuple[list[int], list[int]] | 2-tuple of indexes containing row and column indices of matched pairs eg. ([0, 1, ...], [0, 1, ...]) . |
Notes
If hungarian=False
, the matching returns all pairs with similarity score above the abs_cutoff
, respecting thresholds
if present. This method does not guarantee no duplicates.
Functions
Name | Description |
---|---|
nearest_pos_semi_definite | Calculate nearest positive semi-definite version of a matrix. |
nearest_pos_semi_definite
embedder.embedder.nearest_pos_semi_definite(X, eps=0.0)
Calculate nearest positive semi-definite version of a matrix.
This function achieves this by setting all negative eigenvalues of the matrix to zero, or a small positive value to give a positive definite matrix.
Graciously taken from this StackOverflow post
Parameters
Name | Type | Description | Default |
---|---|---|---|
X |
numpy.numpy.ndarray | Matrix-like array. | required |
eps |
float | Use a small positive constant to give a positive definite matrix. Default is 0 to give a positive semi-definite matrix. | 0.0 |
Returns
Type | Description |
---|---|
numpy.numpy.ndarray | A positive (semi-)definite matrix. |