Embedder API run-through

This article shows the main classes, methods and functionality of the Embedder API.

First, we’ll import a few modules, including:

import os
import numpy as np
import pandas as pd

from pprl import EmbeddedDataFrame, Embedder, config
from pprl.embedder import features as feat

Data set-up

For this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.

df1 = pd.DataFrame(
    dict(
        id=[1,2,3],
        forename=["Henry", "Sally", "Ina"],
        surname = ["Tull", "Brown", "Lawrey"],
        dob=["", "2/1/2001", "4/10/1995"],
        gender=["male", "Male", "Female"],
        county=["", np.NaN, "County Durham"]
    )
)

df2 = pd.DataFrame(
    dict(
        personid=[4,5,6],
        full_name=["Harry Tull", "Sali Brown", "Ina Laurie"],
        date_of_birth=["2/1/2001", "2/1/2001", "4/11/1995"],
        sex=["M", "M", "F"],
        county=["Rutland", "Powys", "Durham"]
    )
)

Features are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.

In this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):

feature_factory = dict(
    name=feat.gen_name_features,
    dob=feat.gen_dateofbirth_features,
    sex=feat.gen_sex_features,
    misc=feat.gen_misc_features
)

ff_args = dict(name={}, sex={}, dob={})

Embedding

Now we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.

embedder = Embedder(feature_factory,
                    ff_args,
                    bf_size = 2**10,
                    num_hashes=2,
                    )

Now we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.

edf1 = embedder.embed(
    df1, colspec=dict(forename="name", surname="name", dob="dob", gender="sex", county="misc")
)
edf2 = embedder.embed(
    df2, colspec=dict(full_name="name", date_of_birth="dob", sex="sex", county="misc")
)

print(edf1)
print(edf2)
   id forename surname        dob  gender         county  \
0   1    Henry    Tull               male                  
1   2    Sally   Brown   2/1/2001    Male            NaN   
2   3      Ina  Lawrey  4/10/1995  Female  County Durham   

                                   forename_features  \
0  [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_]   
1  [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_]   
2                    [_i, in, na, a_, _in, ina, na_]   

                                    surname_features  \
0           [_t, tu, ul, ll, l_, _tu, tul, ull, ll_]   
1  [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_]   
2  [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr...   

                       dob_features gender_features          county_features  \
0                                []        [sex<m>]                            
1  [day<02>, month<01>, year<2001>]        [sex<m>]                            
2  [day<04>, month<10>, year<1995>]        [sex<f>]  [county<county durham>]   

                                        all_features  \
0  [_he, he, _t, ll, tul, ry_, l_, tu, ll_, y_, e...   
1  [_br, wn_, ro, ll, al, ly, row, _b, y_, _sa, o...   
2  [sex<f>, county<county durham>, na_, re, y_, a...   

                                          bf_indices  bf_norms  
0  [644, 773, 135, 776, 265, 778, 271, 402, 404, ...  6.244998  
1  [129, 258, 130, 776, 523, 525, 398, 271, 671, ...  7.141428  
2  [647, 394, 269, 13, 15, 532, 667, 28, 413, 155...  7.000000  
   personid   full_name date_of_birth sex   county  \
0         4  Harry Tull      2/1/2001   M  Rutland   
1         5  Sali Brown      2/1/2001   M    Powys   
2         6  Ina Laurie     4/11/1995   F   Durham   

                                  full_name_features  \
0  [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...   
1  [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...   
2  [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...   

             date_of_birth_features sex_features    county_features  \
0  [day<02>, month<01>, year<2001>]     [sex<m>]  [county<rutland>]   
1  [day<02>, month<01>, year<2001>]     [sex<m>]    [county<powys>]   
2  [day<04>, month<11>, year<1995>]     [sex<f>]   [county<durham>]   

                                        all_features  \
0  [_t, ll, tul, ry_, l_, county<rutland>, ar, tu...   
1  [_br, wn_, i_, ro, li_, al, ali, row, _b, wn, ...   
2  [uri, sex<f>, month<11>, na_, ur, ie, a_, au, ...   

                                          bf_indices  bf_norms  
0  [640, 130, 644, 135, 776, 778, 10, 271, 402, 5...  6.855655  
1  [130, 523, 525, 398, 271, 152, 671, 803, 806, ...  7.000000  
2  [646, 647, 394, 269, 15, 272, 531, 532, 665, 6...  6.928203  

Training

Discuss this at this stage

Computing the similarity scores and the matching

Now we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.

First, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.

personid full_name date_of_birth sex county full_name_features date_of_birth_features sex_features county_features all_features bf_indices bf_norms thresholds
0 4 Harry Tull 2/1/2001 M Rutland [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... [day<02>, month<01>, year<2001>] [sex<m>] [county<rutland>] [_t, ll, tul, ry_, l_, county<rutland>, ar, tu... [640, 130, 644, 135, 776, 778, 10, 271, 402, 5... 6.855655 0.187541
1 5 Sali Brown 2/1/2001 M Powys [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... [day<02>, month<01>, year<2001>] [sex<m>] [county<powys>] [_br, wn_, i_, ro, li_, al, ali, row, _b, wn, ... [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 7.000000 0.187541
2 6 Ina Laurie 4/11/1995 F Durham [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... [day<04>, month<11>, year<1995>] [sex<f>] [county<durham>] [uri, sex<f>, month<11>, na_, ur, ie, a_, au, ... [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.928203 0.082479

NB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).

Now you can compute the similarities:

similarities = embedder.compare(edf1,edf2)

print(similarities)
[[0.60728442 0.09150181 0.        ]
 [0.2859526  0.78015612 0.08084521]
 [0.08335143 0.10204083 0.57735028]]

Finally, you can compute the matching:

matching = similarities.match(abs_cutoff=0.5)

print(matching)
(array([0, 1, 2]), array([0, 1, 2]))

Serialisation and file I/O

That’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.

First, the Embedder object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder object.

embedder.to_pickle("embedder.pkl")

embedder_copy = Embedder.from_pickle("embedder.pkl")

The copy has the same functionality as the original:

similarities = embedder_copy.compare(edf1,edf2)

print(similarities)
[[0.60728442 0.09150181 0.        ]
 [0.2859526  0.78015612 0.08084521]
 [0.08335143 0.10204083 0.57735028]]

NB: This won’t work if two datasets were embedded with different Embedder instances, even if they’re identical. The compare() method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame with the new Embedder object.

edf2_copy = EmbeddedDataFrame(edf2, embedder_copy)

In this case, be careful that the Embedder is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.

Serialising the data

The EDF objects are just a thin wrapper around pandas.DataFrame instances, so you can serialise to JSON using the normal methods.

edf1.to_json("edf1.json")

edf1_copy = pd.read_json("edf1.json")

print(isinstance(edf1_copy,EmbeddedDataFrame))
print(isinstance(edf1_copy,pd.DataFrame))
False
True

The bf_indices, bf_norms and thresholds columns will be preserved. However, this demotes the data frames back to normal pandas.DataFrame instances and loses the link to an Embedder instance.

To fix this, just re-initialise them:

edf1_copy = EmbeddedDataFrame(edf1_copy, embedder_copy)