perform

matching.perform

Functions for performing the matching itself.

Functions

Name Description
add_private_index Add anonymous match index to input datasets.
calculate_performance Calculate the performance of the match by counting the positives.
perform_matching Initiate the data, get similarities, and match the rows.

add_private_index

matching.perform.add_private_index(df1, df2, match, size_assumed=10000, colname='private_index')

Add anonymous match index to input datasets.

The match index assigns indices to both matched and unmatched records, so that they are indistinguishable. It doesn’t leak any info about the other dataset.

add_private_index only works with unique one-to-one matches. This is because there is no way to match many-to-one without leaking information about the successful matches.

Parameters

Name Type Description Default
df1 pandas.pandas.DataFrame A dataset. required
df2 pandas.pandas.DataFrame Another dataset. required
match tuple[numpy.numpy.ndarray, numpy.numpy.ndarray] A pair of matched indices, with no repeated indices. required
size_assumed int The assumed maximum size of each dataset. Default is 10,000. 10000
colname str A column name for the new index. By default "private_index". 'private_index'

Returns

Type Description
df1, df2: pd.DataFrame The same as input data, with private matching index added.

calculate_performance

matching.perform.calculate_performance(data_1, data_2, match)

Calculate the performance of the match by counting the positives.

Performance metrics are sent to the logger.

Parameters

Name Type Description Default
data_1 pandas.pandas.DataFrame Data frame for PARTY1. required
data_2 pandas.pandas.DataFrame Data frame for PARTY2. required
match tuple Tuple of indices of matched pairs between the data frames. required

perform_matching

matching.perform.perform_matching(data_1, data_2, embedder)

Initiate the data, get similarities, and match the rows.

Parameters

Name Type Description Default
data_1 pandas.pandas.DataFrame Data frame for PARTY1. required
data_2 pandas.pandas.DataFrame Data frame for PARTY2. required
embedder pprl.embedder.embedder.Embedder Instance used to embed both data frames. required

Returns

Type Description
pandas.pandas.DataFrame Output for PARTY1.
pandas.pandas.DataFrame Output for PARTY2.