perform

matching.perform

Functions for performing the matching itself.

Functions

Name	Description
add_private_index	Add anonymous match index to input datasets.
calculate_performance	Calculate the performance of the match by counting the positives.
perform_matching	Initiate the data, get similarities, and match the rows.

add_private_index

matching.perform.add_private_index(df1, df2, match, size_assumed=10000, colname='private_index')

Add anonymous match index to input datasets.

The match index assigns indices to both matched and unmatched records, so that they are indistinguishable. It doesn’t leak any info about the other dataset.

add_private_index only works with unique one-to-one matches. This is because there is no way to match many-to-one without leaking information about the successful matches.

Parameters

Name	Type	Description	Default
`df1`	pandas.pandas.DataFrame	A dataset.	required
`df2`	pandas.pandas.DataFrame	Another dataset.	required
`match`	tuple[numpy.numpy.ndarray, numpy.numpy.ndarray]	A pair of matched indices, with no repeated indices.	required
`size_assumed`	int	The assumed maximum size of each dataset. Default is 10,000.	`10000`
`colname`	str	A column name for the new index. By default `"private_index"`.	`'private_index'`

Returns

Type	Description
df1, df2: pd.DataFrame	The same as input data, with private matching index added.

calculate_performance

matching.perform.calculate_performance(data_1, data_2, match)

Calculate the performance of the match by counting the positives.

Performance metrics are sent to the logger.

Parameters

Name	Type	Description	Default
`data_1`	pandas.pandas.DataFrame	Data frame for `PARTY1`.	required
`data_2`	pandas.pandas.DataFrame	Data frame for `PARTY2`.	required
`match`	tuple	Tuple of indices of matched pairs between the data frames.	required

perform_matching

matching.perform.perform_matching(data_1, data_2, embedder)

Initiate the data, get similarities, and match the rows.

Parameters

Name	Type	Description	Default
`data_1`	pandas.pandas.DataFrame	Data frame for `PARTY1`.	required
`data_2`	pandas.pandas.DataFrame	Data frame for `PARTY2`.	required
`embedder`	pprl.embedder.embedder.Embedder	Instance used to embed both data frames.	required

Returns

Type	Description
pandas.pandas.DataFrame	Output for `PARTY1`.
pandas.pandas.DataFrame	Output for `PARTY2`.