features

embedder.features

Feature generation functions for various column types.

Functions

Name	Description
gen_dateofbirth_features	Generate labelled date features from a series of dates of birth.
gen_double_metaphone	Generate the double methaphones of a string.
gen_features	Generate string features of various types.
gen_misc_features	Generate miscellaneous categorical features for a series.
gen_misc_shingled_features	Generate shingled labelled features.
gen_name_features	Generate a features series for a series of names.
gen_ngram	Generate n-grams from a set of tokens.
gen_sex_features	Generate labelled sex features from a series of sexes.
gen_skip_grams	Generate skip 2-grams from a set of tokens.
split_string_underscore	Split and underwrap a string at typical punctuation marks.

gen_dateofbirth_features

embedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])

Generate labelled date features from a series of dates of birth.

Features take the form ["day<dd>", "month<mm>", "year<YYYY>"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.

Parameters

Name	Type	Description	Default
`dob`	pandas.pandas.Series	Series of dates of birth.	required
`dayfirst`	bool	Whether the day comes first in the DOBs. Passed to `pd.to_datetime()` and defaults to `True`.	`True`
`yearfirst`	bool	Whether the year comes first in the DOBs. Passed to `pd.to_datetime()` and defaults to `False`.	`False`
`default`	list[str]	Default date to fill in missing data in feature (list) form. Default is the feature form of `2050-01-01`.	`[]`

Returns

Type	Description
pandas.pandas.Series	Series containing lists of date features.

gen_double_metaphone

embedder.features.gen_double_metaphone(string)

Generate the double methaphones of a string.

This function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.

Parameters

Name	Type	Description	Default
`string`	str	String from which to derive double metaphones.	required

Returns

Type	Description
str	The next double metaphone in the sequence.

gen_features

embedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)

Generate string features of various types.

This function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.

Parameters

Name	Type	Description	Default
`string`	str	Base string from which to generate features.	required
`ngram_length`	list	Lengths of n-grams to make. Ignored if `use_gen_ngram=False`.	`[2, 3]`
`use_gen_ngram`	bool	Whether to create n-grams. Default is `True`.	`True`
`use_gen_skip_grams`	bool	Whether to create skip 2-grams. Default is `False`.	`False`
`use_double_metaphone`	bool	Whether to create double metaphones. Default is `False`.	`False`

Returns

Type	Description
str	The next feature in the sequence.

gen_misc_features

embedder.features.gen_misc_features(field, label=None)

Generate miscellaneous categorical features for a series.

Useful for keeping raw columns in the linkage data. All features use a label and take the form ["label<option>"] except for missing data, which are coded as "".

Parameters

Name	Type	Description	Default
`field`	pandas.pandas.Series	Series from which to generate our features.	required
`label`	None \| str \| typing.Hashable	Label for the series. By default, the name of the series is used if available. Otherwise, if not specified, `misc` is used.	`None`

Returns

Type	Description
pandas.pandas.Series	Series containing lists of miscellaneous features.

gen_misc_shingled_features

embedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)

Generate shingled labelled features.

Generate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().

Parameters

Name	Type	Description	Default
`field`	pandas.pandas.Series	Series of string data.	required
`ngram_length`	list	Shingle sizes to generate. By default `[2, 3]`.	`[2, 3]`
`use_gen_skip_grams`	bool	Whether to generate skip 2-grams. `False` by default.	`False`
`label`	str	A label to differentiate from other shingled features. If `field` has no name, this defaults to `zz`.	`None`

Returns

Type	Description
pandas.pandas.Series	Series containing lists of shingled string features.

gen_name_features

embedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)

Generate a features series for a series of names.

Effectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.

Parameters

Name	Type	Description	Default
`names`	pandas.pandas.Series	Series of names.	required
`ngram_length`	list[int]	Lengths of n-grams to make. Ignored if `use_gen_ngram=False`.	`[2, 3]`
`use_gen_ngram`	bool	Whether to create n-grams. Default is `True`.	`True`
`use_gen_skip_grams`	bool	Whether to create skip 2-grams. Default is `False`.	`False`
`use_double_metaphone`	bool	Whether to create double metaphones. Default is `False`.	`False`

Returns

Type	Description
pandas.pandas.Series	Series containing lists of features.

gen_ngram

embedder.features.gen_ngram(split_tokens, ngram_length)

Generate n-grams from a set of tokens.

This is a generator function that contains a series of n-grams the size of the sliding window.

Parameters

Name	Type	Description	Default
`split_tokens`	list	All the split-up tokens from which to form n-grams.	required
`ngram_length`	list	Desired lengths of n-grams. For examples, `ngram_length=[2, 3]` would generate all 2-grams and 3-grams.	required

Returns

Type	Description
str	The next n-gram in the sequence.

gen_sex_features

embedder.features.gen_sex_features(sexes)

Generate labelled sex features from a series of sexes.

Features take the form ["sex<option>"] or [""] for missing data.

Parameters

Name	Type	Description	Default
`sexes`	pandas.pandas.Series	Series of sex data.	required

Returns

Type	Description
pandas.pandas.Series	Series containing lists of sex features.

gen_skip_grams

embedder.features.gen_skip_grams(split_tokens)

Generate skip 2-grams from a set of tokens.

This function is a generator that contains a series of skip 2-grams.

Examples

>>> string = "dave james"
>>> tokens = split_string_underscore(string)
>>> skips = list(gen_skip_grams(tokens))
>>> print(skips)
["_a", "dv", "ae", "v_", "_a", "jm", "ae", "ms", "e_"]

Parameters

Name	Type	Description	Default
`split_tokens`	list	All the split-up tokens from which to form skip 2-grams.	required

Returns

Type	Description
str	The next skip 2-gram in the sequence.

split_string_underscore

embedder.features.split_string_underscore(string)

Split and underwrap a string at typical punctuation marks.

Currently, we split at any combination of spaces, dashes, dots, commas, or underscores.

Examples

>>> strings = ("dave  william johnson", "Francesca__Hogan-O'Malley")
>>> for string in strings:
...     print(split_string_underscore(string))
["_dave_", "_william_", "_johnson_"]
["_Francesca_", "_Hogan_", "_O'Malley_"]

Parameters

Name	Type	Description	Default
`string`	str	String to split.	required

Returns

Type	Description
list[str]	List of the split and wrapped tokens.