import os
import numpy as np
import pandas as pd
from pprl import EmbeddedDataFrame, Embedder, config
from pprl.embedder import features as feat
Embedder API run-through
This article shows the main classes, methods and functionality of the Embedder API.
First, we’ll import a few modules, including:
- the
features
module, that has functions for processing data into features for embedding - the
config
module, which includes our package configuration (such as the location of data directories) - some classes from the main
embedder
module
Data set-up
For this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.
= pd.DataFrame(
df1 dict(
id=[1,2,3],
=["Henry", "Sally", "Ina"],
forename= ["Tull", "Brown", "Lawrey"],
surname =["", "2/1/2001", "4/10/1995"],
dob=["male", "Male", "Female"],
gender=["", np.NaN, "County Durham"]
county
)
)
= pd.DataFrame(
df2 dict(
=[4,5,6],
personid=["Harry Tull", "Sali Brown", "Ina Laurie"],
full_name=["2/1/2001", "2/1/2001", "4/11/1995"],
date_of_birth=["M", "M", "F"],
sex=["Rutland", "Powys", "Durham"]
county
) )
Features are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.
In this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):
= dict(
feature_factory =feat.gen_name_features,
name=feat.gen_dateofbirth_features,
dob=feat.gen_sex_features,
sex=feat.gen_misc_features
misc
)
= dict(name={}, sex={}, dob={}) ff_args
Embedding
Now we can create an Embedder
object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.
= Embedder(feature_factory,
embedder
ff_args,= 2**10,
bf_size =2,
num_hashes )
Now we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec
that maps each column of the data into the feature_factory
functions. Any columns not mapped will not contribute to the embedding.
= embedder.embed(
edf1 =dict(forename="name", surname="name", dob="dob", gender="sex", county="misc")
df1, colspec
)= embedder.embed(
edf2 =dict(full_name="name", date_of_birth="dob", sex="sex", county="misc")
df2, colspec
)
print(edf1)
print(edf2)
id forename surname dob gender county \
0 1 Henry Tull male
1 2 Sally Brown 2/1/2001 Male NaN
2 3 Ina Lawrey 4/10/1995 Female County Durham
forename_features \
0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_]
1 [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_]
2 [_i, in, na, a_, _in, ina, na_]
surname_features \
0 [_t, tu, ul, ll, l_, _tu, tul, ull, ll_]
1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_]
2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr...
dob_features gender_features county_features \
0 [] [sex<m>]
1 [day<02>, month<01>, year<2001>] [sex<m>]
2 [day<04>, month<10>, year<1995>] [sex<f>] [county<county durham>]
all_features \
0 [_he, he, _t, ll, tul, ry_, l_, tu, ll_, y_, e...
1 [_br, wn_, ro, ll, al, ly, row, _b, y_, _sa, o...
2 [sex<f>, county<county durham>, na_, re, y_, a...
bf_indices bf_norms
0 [644, 773, 135, 776, 265, 778, 271, 402, 404, ... 6.244998
1 [129, 258, 130, 776, 523, 525, 398, 271, 671, ... 7.141428
2 [647, 394, 269, 13, 15, 532, 667, 28, 413, 155... 7.000000
personid full_name date_of_birth sex county \
0 4 Harry Tull 2/1/2001 M Rutland
1 5 Sali Brown 2/1/2001 M Powys
2 6 Ina Laurie 4/11/1995 F Durham
full_name_features \
0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...
1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...
2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...
date_of_birth_features sex_features county_features \
0 [day<02>, month<01>, year<2001>] [sex<m>] [county<rutland>]
1 [day<02>, month<01>, year<2001>] [sex<m>] [county<powys>]
2 [day<04>, month<11>, year<1995>] [sex<f>] [county<durham>]
all_features \
0 [_t, ll, tul, ry_, l_, county<rutland>, ar, tu...
1 [_br, wn_, i_, ro, li_, al, ali, row, _b, wn, ...
2 [uri, sex<f>, month<11>, na_, ur, ie, a_, au, ...
bf_indices bf_norms
0 [640, 130, 644, 135, 776, 778, 10, 271, 402, 5... 6.855655
1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 7.000000
2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.928203
Training
Discuss this at this stage
Computing the similarity scores and the matching
Now we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.
First, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.
personid | full_name | date_of_birth | sex | county | full_name_features | date_of_birth_features | sex_features | county_features | all_features | bf_indices | bf_norms | thresholds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Harry Tull | 2/1/2001 | M | Rutland | [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... | [day<02>, month<01>, year<2001>] | [sex<m>] | [county<rutland>] | [_t, ll, tul, ry_, l_, county<rutland>, ar, tu... | [640, 130, 644, 135, 776, 778, 10, 271, 402, 5... | 6.855655 | 0.187541 |
1 | 5 | Sali Brown | 2/1/2001 | M | Powys | [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... | [day<02>, month<01>, year<2001>] | [sex<m>] | [county<powys>] | [_br, wn_, i_, ro, li_, al, ali, row, _b, wn, ... | [130, 523, 525, 398, 271, 152, 671, 803, 806, ... | 7.000000 | 0.187541 |
2 | 6 | Ina Laurie | 4/11/1995 | F | Durham | [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... | [day<04>, month<11>, year<1995>] | [sex<f>] | [county<durham>] | [uri, sex<f>, month<11>, na_, ur, ie, a_, au, ... | [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... | 6.928203 | 0.082479 |
NB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).
Now you can compute the similarities:
= embedder.compare(edf1,edf2)
similarities
print(similarities)
[[0.60728442 0.09150181 0. ]
[0.2859526 0.78015612 0.08084521]
[0.08335143 0.10204083 0.57735028]]
Finally, you can compute the matching:
= similarities.match(abs_cutoff=0.5)
matching
print(matching)
(array([0, 1, 2]), array([0, 1, 2]))
Serialisation and file I/O
That’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.
First, the Embedder
object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder
object.
"embedder.pkl")
embedder.to_pickle(
= Embedder.from_pickle("embedder.pkl") embedder_copy
The copy has the same functionality as the original:
= embedder_copy.compare(edf1,edf2)
similarities
print(similarities)
[[0.60728442 0.09150181 0. ]
[0.2859526 0.78015612 0.08084521]
[0.08335143 0.10204083 0.57735028]]
NB: This won’t work if two datasets were embedded with different Embedder
instances, even if they’re identical. The compare()
method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame
with the new Embedder
object.
= EmbeddedDataFrame(edf2, embedder_copy) edf2_copy
In this case, be careful that the Embedder
is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.
Serialising the data
The EDF objects are just a thin wrapper around pandas.DataFrame
instances, so you can serialise to JSON using the normal methods.
"edf1.json")
edf1.to_json(
= pd.read_json("edf1.json")
edf1_copy
print(isinstance(edf1_copy,EmbeddedDataFrame))
print(isinstance(edf1_copy,pd.DataFrame))
False
True
The bf_indices
, bf_norms
and thresholds
columns will be preserved. However, this demotes the data frames back to normal pandas.DataFrame
instances and loses the link to an Embedder
instance.
To fix this, just re-initialise them:
= EmbeddedDataFrame(edf1_copy, embedder_copy) edf1_copy