import pandas as pd
= pd.DataFrame(
df1
{"first_name": ["Laura", "Kaspar", "Grete"],
"last_name": ["Daten", "Gorman", "Knopf"],
"gender": ["F", "M", "F"],
"date_of_birth": ["01/03/1977", "31/12/1975", "12/7/1981"],
"instrument": ["bass", "guitar", "drums"],
}
)= pd.DataFrame(
df2
{"name": ["Laura Datten", "Greta Knopf", "Casper Goreman"],
"sex": ["female", "female", "male"],
"main_instrument": ["bass guitar", "percussion", "electric guitar"],
"birth_date": ["1977-03-23", "1981-07-12", "1975-12-31"],
} )
Exploring a simple linkage example
The Python package implements the Bloom filter linkage method (Schnell et al., 2009), and can also implement pretrained Hash embeddings (Miranda et al., 2022), if a suitable large, pre-matched corpus of data is available.
Let us consider a small example where we want to link two excerpts of data on bands. In this scenario, we are looking at some toy data on the members of a fictional, German rock trio called “Verknüpfung”. In this example we will see how to use untrained Bloom filters to match data.
Loading the data
First, we load our data into pandas.DataFrame
objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.
These datasets don’t have the same column names or follow the same encodings, and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.
Thankfully, the PPRL Toolkit is flexible enough to handle this!
Creating and assigning a feature factory
The next step is to decide how to process each of the columns in our datasets.
To do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.
from pprl.embedder import features
from functools import partial
= dict(
factory =features.gen_name_features,
name=features.gen_sex_features,
sex=features.gen_misc_features,
misc=features.gen_dateofbirth_features,
dob=partial(features.gen_misc_shingled_features, label="instrument")
instrument
)= dict(
spec1 ="name",
first_name="name",
last_name="sex",
gender="instrument",
instrument="dob",
date_of_birth
)= dict(name="name", sex="sex", main_instrument="instrument", birth_date="dob") spec2
The feature generation functions, features.gen_XXX_features
have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above. There are two ways to achieve this. Either use functools.partial
to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the Embedder
as ff_args
.
Embedding the data
With our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.
Once we’ve decided, we can create our Embedder
instance and use it to embed our data with their column specifications.
from pprl.embedder.embedder import Embedder
= Embedder(factory, bf_size=1024, num_hashes=2)
embedder
= embedder.embed(df1, colspec=spec1, update_thresholds=True)
edf1 = embedder.embed(df2, colspec=spec2, update_thresholds=True) edf2
If we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features
column for each of the original columns containing their pre-embedding string features, and there’s an all_features
column that combines the features. Then there are three additional columns: bf_indices
, bf_norms
and thresholds
.
edf1.columns
Index(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',
'first_name_features', 'last_name_features', 'gender_features',
'instrument_features', 'date_of_birth_features', 'all_features',
'bf_indices', 'bf_norms', 'thresholds'],
dtype='object')
The bf_indices
column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.
print(edf1.bf_indices[0])
[2, 262, 903, 646, 9, 526, 654, 272, 15, 146, 17, 532, 531, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 56, 823, 440, 60, 61, 318, 319, 320, 444, 577, 836, 583, 332, 77, 590, 972, 465, 82, 211, 468, 84, 338, 851, 600, 593, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]
The bf_norms
column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i]))
for record i
. The norm is used to scale the similarity measures so that they take values between -1 and 1.
The thresholds
column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.
print(edf1.loc[:,["bf_norms","thresholds"]])
print(edf2.loc[:,["bf_norms","thresholds"]])
bf_norms thresholds
0 8.246211 0.114332
1 9.055386 0.143159
2 8.485281 0.143159
bf_norms thresholds
0 9.695360 0.294345
1 9.380832 0.157014
2 10.862781 0.294345
The processed features
Let’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.
First, we’ll look at date of birth:
print(edf1.date_of_birth_features[0])
print(edf2.birth_date_features[0])
['day<01>', 'month<03>', 'year<1977>']
['day<23>', 'month<03>', 'year<1977>']
Python can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.
Then we’ll look at name:
print(edf1.first_name_features[0] + edf1.last_name_features[0])
print(edf2.name_features[0])
['_l', 'la', 'au', 'ur', 'ra', 'a_', '_la', 'lau', 'aur', 'ura', 'ra_', '_d', 'da', 'at', 'te', 'en', 'n_', '_da', 'dat', 'ate', 'ten', 'en_']
['_l', 'la', 'au', 'ur', 'ra', 'a_', '_d', 'da', 'at', 'tt', 'te', 'en', 'n_', '_la', 'lau', 'aur', 'ura', 'ra_', '_da', 'dat', 'att', 'tte', 'ten', 'en_']
The two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.
The sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:
print(edf1.gender_features[0])
print(edf2.sex_features[0])
['sex<f>']
['sex<f>']
Finally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label="instrument")
) processed the data:
print(edf1.instrument_features[0])
print(edf2.main_instrument_features[0])
['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>']
['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_g>', 'instrument<gu>', 'instrument<ui>', 'instrument<it>', 'instrument<ta>', 'instrument<ar>', 'instrument<r_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>', 'instrument<_gu>', 'instrument<gui>', 'instrument<uit>', 'instrument<ita>', 'instrument<tar>', 'instrument<ar_>']
Setting the label
argument was important to ensure that the shingles match (and are hashed to the same slots) because the default behaviour of the function is to use the column name as a label: since the two columns have different names, the default wouldn’t have allowed the features to match to each other.
Performing the linkage
We can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.
= embedder.compare(edf1, edf2)
similarities similarities
SimilarityArray([[0.80050047, 0.10341754, 0.10047246],
[0.34170424, 0.16480856, 0.63029481],
[0.12155416, 0.54020787, 0.11933984]])
This SimilarityArray
object is an augmented numpy.ndarray
that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.
= similarities.match()
matching matching
(array([0, 1, 2]), array([0, 2, 1]))
So, all three of the records in each dataset were matched correctly. Excellent!