features
embedder.features
Feature generation functions for various column types.
Functions
| Name | Description |
|---|---|
| gen_dateofbirth_features | Generate labelled date features from a series of dates of birth. |
| gen_double_metaphone | Generate the double methaphones of a string. |
| gen_features | Generate string features of various types. |
| gen_misc_features | Generate miscellaneous categorical features for a series. |
| gen_misc_shingled_features | Generate shingled labelled features. |
| gen_name_features | Generate a features series for a series of names. |
| gen_ngram | Generate n-grams from a set of tokens. |
| gen_sex_features | Generate labelled sex features from a series of sexes. |
| gen_skip_grams | Generate skip 2-grams from a set of tokens. |
| split_string_underscore | Split and underwrap a string at typical punctuation marks. |
gen_dateofbirth_features
embedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])
Generate labelled date features from a series of dates of birth.
Features take the form ["day<dd>", "month<mm>", "year<YYYY>"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
dob |
pandas.pandas.Series | Series of dates of birth. | required |
dayfirst |
bool | Whether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True. |
True |
yearfirst |
bool | Whether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False. |
False |
default |
list[str] | Default date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01. |
[] |
Returns
| Type | Description |
|---|---|
| pandas.pandas.Series | Series containing lists of date features. |
gen_double_metaphone
embedder.features.gen_double_metaphone(string)
Generate the double methaphones of a string.
This function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
string |
str | String from which to derive double metaphones. | required |
Returns
| Type | Description |
|---|---|
| str | The next double metaphone in the sequence. |
gen_features
embedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)
Generate string features of various types.
This function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
string |
str | Base string from which to generate features. | required |
ngram_length |
list | Lengths of n-grams to make. Ignored if use_gen_ngram=False. |
[2, 3] |
use_gen_ngram |
bool | Whether to create n-grams. Default is True. |
True |
use_gen_skip_grams |
bool | Whether to create skip 2-grams. Default is False. |
False |
use_double_metaphone |
bool | Whether to create double metaphones. Default is False. |
False |
Returns
| Type | Description |
|---|---|
| str | The next feature in the sequence. |
gen_misc_features
embedder.features.gen_misc_features(field, label=None)
Generate miscellaneous categorical features for a series.
Useful for keeping raw columns in the linkage data. All features use a label and take the form ["label<option>"] except for missing data, which are coded as "".
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
field |
pandas.pandas.Series | Series from which to generate our features. | required |
label |
None | str | typing.Hashable | Label for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used. |
None |
Returns
| Type | Description |
|---|---|
| pandas.pandas.Series | Series containing lists of miscellaneous features. |
gen_misc_shingled_features
embedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)
Generate shingled labelled features.
Generate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
field |
pandas.pandas.Series | Series of string data. | required |
ngram_length |
list | Shingle sizes to generate. By default [2, 3]. |
[2, 3] |
use_gen_skip_grams |
bool | Whether to generate skip 2-grams. False by default. |
False |
label |
str | A label to differentiate from other shingled features. If field has no name, this defaults to zz. |
None |
Returns
| Type | Description |
|---|---|
| pandas.pandas.Series | Series containing lists of shingled string features. |
gen_name_features
embedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)
Generate a features series for a series of names.
Effectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
names |
pandas.pandas.Series | Series of names. | required |
ngram_length |
list[int] | Lengths of n-grams to make. Ignored if use_gen_ngram=False. |
[2, 3] |
use_gen_ngram |
bool | Whether to create n-grams. Default is True. |
True |
use_gen_skip_grams |
bool | Whether to create skip 2-grams. Default is False. |
False |
use_double_metaphone |
bool | Whether to create double metaphones. Default is False. |
False |
Returns
| Type | Description |
|---|---|
| pandas.pandas.Series | Series containing lists of features. |
gen_ngram
embedder.features.gen_ngram(split_tokens, ngram_length)
Generate n-grams from a set of tokens.
This is a generator function that contains a series of n-grams the size of the sliding window.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
split_tokens |
list | All the split-up tokens from which to form n-grams. | required |
ngram_length |
list | Desired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams. |
required |
Returns
| Type | Description |
|---|---|
| str | The next n-gram in the sequence. |
gen_sex_features
embedder.features.gen_sex_features(sexes)
Generate labelled sex features from a series of sexes.
Features take the form ["sex<option>"] or [""] for missing data.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
sexes |
pandas.pandas.Series | Series of sex data. | required |
Returns
| Type | Description |
|---|---|
| pandas.pandas.Series | Series containing lists of sex features. |
gen_skip_grams
embedder.features.gen_skip_grams(split_tokens)
Generate skip 2-grams from a set of tokens.
This function is a generator that contains a series of skip 2-grams.
Examples
>>> string = "dave james"
>>> tokens = split_string_underscore(string)
>>> skips = list(gen_skip_grams(tokens))
>>> print(skips)
["_a", "dv", "ae", "v_", "_a", "jm", "ae", "ms", "e_"]Parameters
| Name | Type | Description | Default |
|---|---|---|---|
split_tokens |
list | All the split-up tokens from which to form skip 2-grams. | required |
Returns
| Type | Description |
|---|---|
| str | The next skip 2-gram in the sequence. |
split_string_underscore
embedder.features.split_string_underscore(string)
Split and underwrap a string at typical punctuation marks.
Currently, we split at any combination of spaces, dashes, dots, commas, or underscores.
Examples
>>> strings = ("dave william johnson", "Francesca__Hogan-O'Malley")
>>> for string in strings:
... print(split_string_underscore(string))
["_dave_", "_william_", "_johnson_"]
["_Francesca_", "_Hogan_", "_O'Malley_"]Parameters
| Name | Type | Description | Default |
|---|---|---|---|
string |
str | String to split. | required |
Returns
| Type | Description |
|---|---|
| list[str] | List of the split and wrapped tokens. |