bloom_filters
embedder.bloom_filters
Module for the Bloom filter encoder.
Classes
Name | Description |
---|---|
BloomFilterEncoder | Encoder of tokens and features via hashing and a Bloom filter. |
BloomFilterEncoder
embedder.bloom_filters.BloomFilterEncoder(self, size=1024, num_hashes=2, offset=0, salt=None)
Encoder of tokens and features via hashing and a Bloom filter.
The process for creating a cryptographically secure Bloom filter encoding of a set of tokens is as follows:
- Compute the hash digest for your tokens
- Convert the digest bytes into integers
- Map the integer to a bloom filter vector (modulo the length of the vector)
Parameters
Name | Type | Description | Default |
---|---|---|---|
size |
int | Size of the Bloom filter. Defaults to 1024 | 1024 |
num_hashes |
int | Number of hashes to perform. Defaults to two. | 2 |
offset |
int | Offset for Bloom filter indices to allow for masking. Defaults to zero. | 0 |
salt |
str | None | Cryptographic salt appended to tokens prior to hashing. | None |
Attributes
Name | Type | Description |
---|---|---|
hash_function | func | Hashing function (hashlib.sha256 ). |
Methods
Name | Description |
---|---|
bloom_filter_vector | Convert a feature vector into indices for a Bloom vector. |
bloom_filter_vector_collision_fraction | Convert a feature vector and return its collision fraction. |
bloom_filter_vector
embedder.bloom_filters.BloomFilterEncoder.bloom_filter_vector(feature)
Convert a feature vector into indices for a Bloom vector.
The index vector uses an optional offset for masking.
Parameters
Name | Type | Description | Default |
---|---|---|---|
feature |
list[str] | List of features to be converted. | required |
Returns
Type | Description |
---|---|
list | Index values used to create the Bloom filter vector. |
Examples
>>> bfe = BloomFilterEncoder()
>>> bfe.bloom_filter_vector(["a","b","c"])
334, 1013, 192, 381, 18, 720] [
bloom_filter_vector_collision_fraction
embedder.bloom_filters.BloomFilterEncoder.bloom_filter_vector_collision_fraction(feature)
Convert a feature vector and return its collision fraction.
The index vector uses an optional offset for masking.
Parameters
Name | Type | Description | Default |
---|---|---|---|
feature |
list[str] | List of features to be processed. | required |
Returns
Type | Description |
---|---|
list | Index values used to create the Bloom filter vector. |
float | Proportion of repeated indices. |
Examples
>>> bfe = BloomFilterEncoder()
>>> bfe.bloom_filter_vector_collision_fraction(["a","b","c"])
334, 1013, 192, 381, 18, 720], 0.0) ([