features

embedder.features

Feature generation functions for various column types.

Functions

Name Description
gen_dateofbirth_features Generate labelled date features from a series of dates of birth.
gen_double_metaphone Generate the double methaphones of a string.
gen_features Generate string features of various types.
gen_misc_features Generate miscellaneous categorical features for a series.
gen_misc_shingled_features Generate shingled labelled features.
gen_name_features Generate a features series for a series of names.
gen_ngram Generate n-grams from a set of tokens.
gen_sex_features Generate labelled sex features from a series of sexes.
gen_skip_grams Generate skip 2-grams from a set of tokens.
split_string_underscore Split and underwrap a string at typical punctuation marks.

gen_dateofbirth_features

embedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])

Generate labelled date features from a series of dates of birth.

Features take the form ["day<dd>", "month<mm>", "year<YYYY>"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.

Parameters

Name Type Description Default
dob pandas.pandas.Series Series of dates of birth. required
dayfirst bool Whether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True. True
yearfirst bool Whether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False. False
default list[str] Default date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01. []

Returns

Type Description
pandas.pandas.Series Series containing lists of date features.

gen_double_metaphone

embedder.features.gen_double_metaphone(string)

Generate the double methaphones of a string.

This function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.

Parameters

Name Type Description Default
string str String from which to derive double metaphones. required

Returns

Type Description
str The next double metaphone in the sequence.

gen_features

embedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)

Generate string features of various types.

This function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.

Parameters

Name Type Description Default
string str Base string from which to generate features. required
ngram_length list Lengths of n-grams to make. Ignored if use_gen_ngram=False. [2, 3]
use_gen_ngram bool Whether to create n-grams. Default is True. True
use_gen_skip_grams bool Whether to create skip 2-grams. Default is False. False
use_double_metaphone bool Whether to create double metaphones. Default is False. False

Returns

Type Description
str The next feature in the sequence.

gen_misc_features

embedder.features.gen_misc_features(field, label=None)

Generate miscellaneous categorical features for a series.

Useful for keeping raw columns in the linkage data. All features use a label and take the form ["label<option>"] except for missing data, which are coded as "".

Parameters

Name Type Description Default
field pandas.pandas.Series Series from which to generate our features. required
label None | str | typing.Hashable Label for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used. None

Returns

Type Description
pandas.pandas.Series Series containing lists of miscellaneous features.

gen_misc_shingled_features

embedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)

Generate shingled labelled features.

Generate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().

Parameters

Name Type Description Default
field pandas.pandas.Series Series of string data. required
ngram_length list Shingle sizes to generate. By default [2, 3]. [2, 3]
use_gen_skip_grams bool Whether to generate skip 2-grams. False by default. False
label str A label to differentiate from other shingled features. If field has no name, this defaults to zz. None

Returns

Type Description
pandas.pandas.Series Series containing lists of shingled string features.

gen_name_features

embedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)

Generate a features series for a series of names.

Effectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.

Parameters

Name Type Description Default
names pandas.pandas.Series Series of names. required
ngram_length list[int] Lengths of n-grams to make. Ignored if use_gen_ngram=False. [2, 3]
use_gen_ngram bool Whether to create n-grams. Default is True. True
use_gen_skip_grams bool Whether to create skip 2-grams. Default is False. False
use_double_metaphone bool Whether to create double metaphones. Default is False. False

Returns

Type Description
pandas.pandas.Series Series containing lists of features.

gen_ngram

embedder.features.gen_ngram(split_tokens, ngram_length)

Generate n-grams from a set of tokens.

This is a generator function that contains a series of n-grams the size of the sliding window.

Parameters

Name Type Description Default
split_tokens list All the split-up tokens from which to form n-grams. required
ngram_length list Desired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams. required

Returns

Type Description
str The next n-gram in the sequence.

gen_sex_features

embedder.features.gen_sex_features(sexes)

Generate labelled sex features from a series of sexes.

Features take the form ["sex<option>"] or [""] for missing data.

Parameters

Name Type Description Default
sexes pandas.pandas.Series Series of sex data. required

Returns

Type Description
pandas.pandas.Series Series containing lists of sex features.

gen_skip_grams

embedder.features.gen_skip_grams(split_tokens)

Generate skip 2-grams from a set of tokens.

This function is a generator that contains a series of skip 2-grams.

Examples

>>> string = "dave james"
>>> tokens = split_string_underscore(string)
>>> skips = list(gen_skip_grams(tokens))
>>> print(skips)
["_a", "dv", "ae", "v_", "_a", "jm", "ae", "ms", "e_"]

Parameters

Name Type Description Default
split_tokens list All the split-up tokens from which to form skip 2-grams. required

Returns

Type Description
str The next skip 2-gram in the sequence.

split_string_underscore

embedder.features.split_string_underscore(string)

Split and underwrap a string at typical punctuation marks.

Currently, we split at any combination of spaces, dashes, dots, commas, or underscores.

Examples

>>> strings = ("dave  william johnson", "Francesca__Hogan-O'Malley")
>>> for string in strings:
...     print(split_string_underscore(string))
["_dave_", "_william_", "_johnson_"]
["_Francesca_", "_Hogan_", "_O'Malley_"]

Parameters

Name Type Description Default
string str String to split. required

Returns

Type Description
list[str] List of the split and wrapped tokens.