bloom_filters

embedder.bloom_filters

Module for the Bloom filter encoder.

Classes

Name Description
BloomFilterEncoder Encoder of tokens and features via hashing and a Bloom filter.

BloomFilterEncoder

embedder.bloom_filters.BloomFilterEncoder(self, size=1024, num_hashes=2, offset=0, salt=None)

Encoder of tokens and features via hashing and a Bloom filter.

The process for creating a cryptographically secure Bloom filter encoding of a set of tokens is as follows:

  1. Compute the hash digest for your tokens
  2. Convert the digest bytes into integers
  3. Map the integer to a bloom filter vector (modulo the length of the vector)

Parameters

Name Type Description Default
size int Size of the Bloom filter. Defaults to 1024 1024
num_hashes int Number of hashes to perform. Defaults to two. 2
offset int Offset for Bloom filter indices to allow for masking. Defaults to zero. 0
salt str | None Cryptographic salt appended to tokens prior to hashing. None

Attributes

Name Type Description
hash_function func Hashing function (hashlib.sha256).

Methods

Name Description
bloom_filter_vector Convert a feature vector into indices for a Bloom vector.
bloom_filter_vector_collision_fraction Convert a feature vector and return its collision fraction.
bloom_filter_vector

embedder.bloom_filters.BloomFilterEncoder.bloom_filter_vector(feature)

Convert a feature vector into indices for a Bloom vector.

The index vector uses an optional offset for masking.

Parameters
Name Type Description Default
feature list[str] List of features to be converted. required
Returns
Type Description
list Index values used to create the Bloom filter vector.
Examples
>>> bfe = BloomFilterEncoder()
>>> bfe.bloom_filter_vector(["a","b","c"])
[334, 1013, 192, 381, 18, 720]
bloom_filter_vector_collision_fraction

embedder.bloom_filters.BloomFilterEncoder.bloom_filter_vector_collision_fraction(feature)

Convert a feature vector and return its collision fraction.

The index vector uses an optional offset for masking.

Parameters
Name Type Description Default
feature list[str] List of features to be processed. required
Returns
Type Description
list Index values used to create the Bloom filter vector.
float Proportion of repeated indices.
Examples
>>> bfe = BloomFilterEncoder()
>>> bfe.bloom_filter_vector_collision_fraction(["a","b","c"])
([334, 1013, 192, 381, 18, 720], 0.0)