features
embedder.features
Feature generation functions for various column types.
Functions
Name | Description |
---|---|
gen_dateofbirth_features | Generate labelled date features from a series of dates of birth. |
gen_double_metaphone | Generate the double methaphones of a string. |
gen_features | Generate string features of various types. |
gen_misc_features | Generate miscellaneous categorical features for a series. |
gen_misc_shingled_features | Generate shingled labelled features. |
gen_name_features | Generate a features series for a series of names. |
gen_ngram | Generate n-grams from a set of tokens. |
gen_sex_features | Generate labelled sex features from a series of sexes. |
gen_skip_grams | Generate skip 2-grams from a set of tokens. |
split_string_underscore | Split and underwrap a string at typical punctuation marks. |
gen_dateofbirth_features
embedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])
Generate labelled date features from a series of dates of birth.
Features take the form ["day<dd>", "month<mm>", "year<YYYY>"]
. Note that this feature generator can be used for any sort of date data, not just dates of birth.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dob |
pandas.pandas.Series | Series of dates of birth. | required |
dayfirst |
bool | Whether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True . |
True |
yearfirst |
bool | Whether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False . |
False |
default |
list[str] | Default date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01 . |
[] |
Returns
Type | Description |
---|---|
pandas.pandas.Series | Series containing lists of date features. |
gen_double_metaphone
embedder.features.gen_double_metaphone(string)
Generate the double methaphones of a string.
This function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone()
function under the hood, ignoring any empty strings. See their repository for details.
Parameters
Name | Type | Description | Default |
---|---|---|---|
string |
str | String from which to derive double metaphones. | required |
Returns
Type | Description |
---|---|
str | The next double metaphone in the sequence. |
gen_features
embedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)
Generate string features of various types.
This function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.
Parameters
Name | Type | Description | Default |
---|---|---|---|
string |
str | Base string from which to generate features. | required |
ngram_length |
list | Lengths of n-grams to make. Ignored if use_gen_ngram=False . |
[2, 3] |
use_gen_ngram |
bool | Whether to create n-grams. Default is True . |
True |
use_gen_skip_grams |
bool | Whether to create skip 2-grams. Default is False . |
False |
use_double_metaphone |
bool | Whether to create double metaphones. Default is False . |
False |
Returns
Type | Description |
---|---|
str | The next feature in the sequence. |
gen_misc_features
embedder.features.gen_misc_features(field, label=None)
Generate miscellaneous categorical features for a series.
Useful for keeping raw columns in the linkage data. All features use a label and take the form ["label<option>"]
except for missing data, which are coded as ""
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
field |
pandas.pandas.Series | Series from which to generate our features. | required |
label |
None | str | typing.Hashable | Label for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used. |
None |
Returns
Type | Description |
---|---|
pandas.pandas.Series | Series containing lists of miscellaneous features. |
gen_misc_shingled_features
embedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)
Generate shingled labelled features.
Generate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features()
, this function makes a call to gen_features()
via pd.Series.apply()
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
field |
pandas.pandas.Series | Series of string data. | required |
ngram_length |
list | Shingle sizes to generate. By default [2, 3] . |
[2, 3] |
use_gen_skip_grams |
bool | Whether to generate skip 2-grams. False by default. |
False |
label |
str | A label to differentiate from other shingled features. If field has no name, this defaults to zz . |
None |
Returns
Type | Description |
---|---|
pandas.pandas.Series | Series containing lists of shingled string features. |
gen_name_features
embedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)
Generate a features series for a series of names.
Effectively, this function is a call to pd.Series.apply()
using our gen_features()
string feature generator function.
Parameters
Name | Type | Description | Default |
---|---|---|---|
names |
pandas.pandas.Series | Series of names. | required |
ngram_length |
list[int] | Lengths of n-grams to make. Ignored if use_gen_ngram=False . |
[2, 3] |
use_gen_ngram |
bool | Whether to create n-grams. Default is True . |
True |
use_gen_skip_grams |
bool | Whether to create skip 2-grams. Default is False . |
False |
use_double_metaphone |
bool | Whether to create double metaphones. Default is False . |
False |
Returns
Type | Description |
---|---|
pandas.pandas.Series | Series containing lists of features. |
gen_ngram
embedder.features.gen_ngram(split_tokens, ngram_length)
Generate n-grams from a set of tokens.
This is a generator function that contains a series of n-grams the size of the sliding window.
Parameters
Name | Type | Description | Default |
---|---|---|---|
split_tokens |
list | All the split-up tokens from which to form n-grams. | required |
ngram_length |
list | Desired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams. |
required |
Returns
Type | Description |
---|---|
str | The next n-gram in the sequence. |
gen_sex_features
embedder.features.gen_sex_features(sexes)
Generate labelled sex features from a series of sexes.
Features take the form ["sex<option>"]
or [""]
for missing data.
Parameters
Name | Type | Description | Default |
---|---|---|---|
sexes |
pandas.pandas.Series | Series of sex data. | required |
Returns
Type | Description |
---|---|
pandas.pandas.Series | Series containing lists of sex features. |
gen_skip_grams
embedder.features.gen_skip_grams(split_tokens)
Generate skip 2-grams from a set of tokens.
This function is a generator that contains a series of skip 2-grams.
Examples
>>> string = "dave james"
>>> tokens = split_string_underscore(string)
>>> skips = list(gen_skip_grams(tokens))
>>> print(skips)
"_a", "dv", "ae", "v_", "_a", "jm", "ae", "ms", "e_"] [
Parameters
Name | Type | Description | Default |
---|---|---|---|
split_tokens |
list | All the split-up tokens from which to form skip 2-grams. | required |
Returns
Type | Description |
---|---|
str | The next skip 2-gram in the sequence. |
split_string_underscore
embedder.features.split_string_underscore(string)
Split and underwrap a string at typical punctuation marks.
Currently, we split at any combination of spaces, dashes, dots, commas, or underscores.
Examples
>>> strings = ("dave william johnson", "Francesca__Hogan-O'Malley")
>>> for string in strings:
print(split_string_underscore(string))
... "_dave_", "_william_", "_johnson_"]
["_Francesca_", "_Hogan_", "_O'Malley_"] [
Parameters
Name | Type | Description | Default |
---|---|---|---|
string |
str | String to split. | required |
Returns
Type | Description |
---|---|
list[str] | List of the split and wrapped tokens. |