Exploring a simple linkage example

The Python package implements the Bloom filter linkage method (Schnell et al., 2009), and can also implement pretrained Hash embeddings (Miranda et al., 2022), if a suitable large, pre-matched corpus of data is available.

Let us consider a small example where we want to link two excerpts of data on bands. In this scenario, we are looking at some toy data on the members of a fictional, German rock trio called “Verknüpfung”. In this example we will see how to use untrained Bloom filters to match data.

Loading the data

First, we load our data into pandas.DataFrame objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.

import pandas as pd

df1 = pd.DataFrame(
    {
        "first_name": ["Laura", "Kaspar", "Grete"],
        "last_name": ["Daten", "Gorman", "Knopf"],
        "gender": ["F", "M", "F"],
        "date_of_birth": ["01/03/1977", "31/12/1975", "12/7/1981"],
        "instrument": ["bass", "guitar", "drums"],
    }
)
df2 = pd.DataFrame(
    {
        "name": ["Laura Datten", "Greta Knopf", "Casper Goreman"],
        "sex": ["female", "female", "male"],
        "main_instrument": ["bass guitar", "percussion", "electric guitar"],
        "birth_date": ["1977-03-23", "1981-07-12", "1975-12-31"],
    }
)

Note

These datasets don’t have the same column names or follow the same encodings, and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.

Thankfully, the PPRL Toolkit is flexible enough to handle this!

Creating and assigning a feature factory

The next step is to decide how to process each of the columns in our datasets.

To do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.

from pprl.embedder import features
from functools import partial

factory = dict(
    name=features.gen_name_features,
    sex=features.gen_sex_features,
    misc=features.gen_misc_features,
    dob=features.gen_dateofbirth_features,
    instrument=partial(features.gen_misc_shingled_features, label="instrument")
)
spec1 = dict(
    first_name="name",
    last_name="name",
    gender="sex",
    instrument="instrument",
    date_of_birth="dob",
)
spec2 = dict(name="name", sex="sex", main_instrument="instrument", birth_date="dob")

Tip

The feature generation functions, features.gen_XXX_features have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above. There are two ways to achieve this. Either use functools.partial to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the Embedder as ff_args.

Embedding the data

With our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.

Once we’ve decided, we can create our Embedder instance and use it to embed our data with their column specifications.

from pprl.embedder.embedder import Embedder

embedder = Embedder(factory, bf_size=1024, num_hashes=2)

edf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True)
edf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)

If we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features column for each of the original columns containing their pre-embedding string features, and there’s an all_features column that combines the features. Then there are three additional columns: bf_indices, bf_norms and thresholds.

edf1.columns

Index(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',
       'first_name_features', 'last_name_features', 'gender_features',
       'instrument_features', 'date_of_birth_features', 'all_features',
       'bf_indices', 'bf_norms', 'thresholds'],
      dtype='object')

The bf_indices column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.

print(edf1.bf_indices[0])

[2, 262, 903, 646, 9, 526, 654, 272, 15, 146, 17, 532, 531, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 56, 823, 440, 60, 61, 318, 319, 320, 444, 577, 836, 583, 332, 77, 590, 972, 465, 82, 211, 468, 84, 338, 851, 600, 593, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]

The bf_norms column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i])) for record i. The norm is used to scale the similarity measures so that they take values between -1 and 1.

The thresholds column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.

print(edf1.loc[:,["bf_norms","thresholds"]])
print(edf2.loc[:,["bf_norms","thresholds"]])

   bf_norms  thresholds
0  8.246211    0.114332
1  9.055386    0.143159
2  8.485281    0.143159
    bf_norms  thresholds
0   9.695360    0.294345
1   9.380832    0.157014
2  10.862781    0.294345

The processed features

Let’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.

First, we’ll look at date of birth:

print(edf1.date_of_birth_features[0])
print(edf2.birth_date_features[0])

['day<01>', 'month<03>', 'year<1977>']
['day<23>', 'month<03>', 'year<1977>']

Python can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.

Then we’ll look at name:

print(edf1.first_name_features[0] + edf1.last_name_features[0])
print(edf2.name_features[0])

['_l', 'la', 'au', 'ur', 'ra', 'a_', '_la', 'lau', 'aur', 'ura', 'ra_', '_d', 'da', 'at', 'te', 'en', 'n_', '_da', 'dat', 'ate', 'ten', 'en_']
['_l', 'la', 'au', 'ur', 'ra', 'a_', '_d', 'da', 'at', 'tt', 'te', 'en', 'n_', '_la', 'lau', 'aur', 'ura', 'ra_', '_da', 'dat', 'att', 'tte', 'ten', 'en_']

The two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.

The sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:

print(edf1.gender_features[0])
print(edf2.sex_features[0])

['sex<f>']
['sex<f>']

Finally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label="instrument")) processed the data:

print(edf1.instrument_features[0])
print(edf2.main_instrument_features[0])

['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>']
['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_g>', 'instrument<gu>', 'instrument<ui>', 'instrument<it>', 'instrument<ta>', 'instrument<ar>', 'instrument<r_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>', 'instrument<_gu>', 'instrument<gui>', 'instrument<uit>', 'instrument<ita>', 'instrument<tar>', 'instrument<ar_>']

Setting the label argument was important to ensure that the shingles match (and are hashed to the same slots) because the default behaviour of the function is to use the column name as a label: since the two columns have different names, the default wouldn’t have allowed the features to match to each other.

Performing the linkage

We can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.

similarities = embedder.compare(edf1, edf2)
similarities

SimilarityArray([[0.80050047, 0.10341754, 0.10047246],
                 [0.34170424, 0.16480856, 0.63029481],
                 [0.12155416, 0.54020787, 0.11933984]])

This SimilarityArray object is an augmented numpy.ndarray that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.

matching = similarities.match()
matching

(array([0, 1, 2]), array([0, 2, 1]))

So, all three of the records in each dataset were matched correctly. Excellent!