import os
import time
from functools import partial
import pandas as pd
import numpy as np
from recordlinkage.datasets import load_febrl4
from pprl import EmbeddedDataFrame, Embedder, config
from pprl.embedder import features as feat
Linking the FEBRL datasets
This tutorial shows how the package can be used locally to match the FEBRL datasets, included as example datasets in the recordlinkage
package.
Load the data
The datasets we are using are 5000 records across two datasets with no duplicates, and each of the records has a valid match in the other dataset.
After loading the data, we can parse the true matched ID number from the indices.
= load_febrl4()
feb4a, feb4b
"true_id"] = (
feb4a[str.extract("^rec-([0-9]*)-")
feb4a.index.0].astype("int")
.iloc[:,
.to_list()
)"true_id"] = (
feb4b[str.extract("^rec-([0-9]*)-")
feb4b.index.0].astype("int")
.iloc[:,
.to_list() )
Create a feature factory
Define the feature processing functions that convert each field into a string so that they can be hashed into the Bloom filter. The dictionary of functions is called feature_factory
.
If we want to pass the feature factory functions into the embedder with non-default parameters, we have two options:
- Pass a dictionary of dictionaries of keyword arguments as an optional
ff_args
parameter (e.g.ff_args = {"dob": {"dayfirst": False, "yearfirst": True}})
) - Use
functools.partial()
, as we have below.
= dict(
feature_factory =feat.gen_name_features,
name=partial(feat.gen_dateofbirth_features, dayfirst=False, yearfirst=True),
dob=feat.gen_misc_features,
misc=partial(
address="addr", ngram_length=[4]
feat.gen_misc_shingled_features, label
), )
Initialise the embedder instance
This instance embeds each feature twice into a Bloom filter of length 1024.
= Embedder(feature_factory, bf_size=1024, num_hashes=2) embedder
Embed the datasets
The column specification colspec
is a dictionary that tells the embedder how to map feature-processing functions to columns of the data. Notice that we can map more than one column to the same function. This means that we can easily handle cases where fields we want to compare all together span several columns.
This process makes our Bloom filter robust to inconsistencies where, for example, surname and given name may be swapped, or addresses may be coded inconsistently.
To ensure that different feature types hash into different buckets, the functions gen_misc_features()
and gen_misc_shingled_features()
will add a label to parsed string features, which will default to the data column name.
For example, to ensure suburb doesn’t collide with state (if they happened to be the same), gen_misc_features()
would encode each of their tokens as suburb<token>
and state<token>
, respectively. If you want to map different columns into the same feature, such as address
below, you can set the label explicitly when passing the function to the embedder.
= dict(
colspec ="name",
given_name="name",
surname="dob",
date_of_birth="misc",
street_number="misc",
state="misc",
soc_sec_id="misc",
postcode="misc",
suburb="address",
address_1="address",
address_2
)
= embedder.embed(feb4a, colspec=colspec)
edf1 = embedder.embed(feb4b, colspec=colspec) edf2
Store the embedded datasets and their embedder to file.
"party1_data.json")
edf1.to_json("party2_data.json")
edf2.to_json("embedder.pkl") embedder.to_pickle(
Calculate similarity
Compute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.
= time.time()
start
edf1.update_thresholds()
edf2.update_thresholds()= time.time()
end
print(f"Updating thresholds took {end - start:.2f} seconds")
Updating thresholds took 8.35 seconds
Compute the matrix of similarity scores.
= embedder.compare(edf1,edf2) similarity_scores
Compute a match
Use the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.
= similarity_scores.match(require_thresholds=True) matching
Using the true IDs, evaluate the precision and recall of the match.
def get_results(edf1, edf2, matching):
"""Get the results for a given matching."""
= edf1.iloc[matching[0], edf1.columns.get_loc("true_id")]
trueids_matched1 = edf2.iloc[matching[1], edf2.columns.get_loc("true_id")]
trueids_matched2
= len(matching[0])
nmatches = sum(map(np.equal, trueids_matched1, trueids_matched2))
truepos = nmatches - truepos
falsepos
print(
f"True pos: {truepos} | False pos: {falsepos} | "
f"Precision: {truepos / nmatches:.1%} | Recall: {truepos / 5000:.1%}"
)
return nmatches, truepos, falsepos
= get_results(edf1, edf2, matching) _
True pos: 4969 | False pos: 0 | Precision: 100.0% | Recall: 99.4%
Then, we compute the match without using the row thresholds, calculating the same performance metrics:
= similarity_scores.match(require_thresholds=False)
matching = get_results(edf1, edf2, matching) _
True pos: 5000 | False pos: 0 | Precision: 100.0% | Recall: 100.0%
Without using the row thresholds, the number of false positives is larger, but the recall is much better. For some uses this balance may be preferable.
In testing, the use of local row thresholds provides a better trade-off between precision and recall, compared to using a single absolute threshold. It has the additional advantage, in a privacy-preserving setting, of being automatic and not requiring clerical review to set the level.