indexers.main

indexers.main

This module provides functionality for creating a VectorStore from a CSV (text) file. It defines the VectorStore class, which is used to model and create vector databases from CSV text files using a Vectoriser object.

This class requires a Vectoriser object from the vectorisers submodule, to convert the CSV’s text data into vector embeddings which are then stored in the VectorStore objects.

Key Features:

  • Batch processing of input files to handle large datasets.
  • Support for CSV file format (additional formats may be added in future updates).
  • Integration with a custom embedder for generating vector embeddings.
  • Logging for tracking progress and handling errors during processing.

VectorStore Class:

  • The VectorStore class is initialized with a Vectoriser object and a CSV knowledgebase.
  • Additional columns in the CSV may be specified as metadata to be included in the vector database.
  • Upon creation, the VectorStore is saved in parquet format for efficient, and quick reloading via the VectorStore’s .from_filespace() method.
  • A new piece of text data (or label) can be queried against the VectorStore in the following ways:
    • .search(): to find the most semantically similar pieces of text in the vector database.
    • .reverse_search(): to find all examples in the knowledgebase that have a given label.
    • .embed(): to generate a vector embedding for a given piece of text data.
  • ‘Hook’ methods may be specified to perform pre-processing on input data before embedding, and post-processing on the output of the search methods.

Classes

Name Description
VectorStore A class to model and create VectorStore objects for building and searching vector databases from CSV text files.

VectorStore

indexers.main.VectorStore(
    file_name,
    data_type,
    vectoriser,
    batch_size=8,
    meta_data=None,
    output_dir=None,
    overwrite=False,
    hooks=None,
)

A class to model and create VectorStore objects for building and searching vector databases from CSV text files.

Attributes

Name Type Description
file_name str the data file contatining the knowledgebase to build the VectorStore
data_type str the data type of the data file (curently only csv supported)
vectoriser VectoriserBase A Vectoriser object from the corresponding ClassifAI Pacakge module
batch_size int the batch size to pass to the vectoriser when embedding
meta_data dict key-value pairs of metadata to extract from the input file and their correpsonding types
output_dir str the path to the output directory where the VectorStore will be saved
vectors np.array a numpy array of vectors for the vector database
vector_shape int the dimension of the vectors
num_vectors int the number of records saved in the VectorStore
vectoriser_class str the type of Vectoriser used to create embeddings
hooks dict A dictionary of user-defined hooks for preprocessing and postprocessing.

Methods

Name Description
embed Converts text (provided via a VectorStoreEmbedInput object) into vector embeddings using the Vectoriser and
from_filespace Creates a VectorStore instance from stored metadata and Parquet files.
reverse_search Reverse searches the VectorStore using a VectorStoreReverseSearchInput object
search Searches the VectorStore using queries from a VectorStoreSearchInput object and returns
embed
indexers.main.VectorStore.embed(query)

Converts text (provided via a VectorStoreEmbedInput object) into vector embeddings using the Vectoriser and returns a VectorStoreEmbedOutput dataframe with columns id, text, and embedding.

Parameters
Name Type Description Default
query VectorStoreEmbedInput The VectorStoreEmbedInput object containing the strings to be embedded and their ids. required
Returns
Name Type Description
VectorStoreEmbedOutput The VectorStoreEmbedOutput object containing the embeddings along with their corresponding ids and texts.
Raises
Name Type Description
DataValidationError Raised if invalid arguments are passed.
HookError Raised if user-defined hooks fail.
ClassifaiError Raised if embedding operation fails.
from_filespace
indexers.main.VectorStore.from_filespace(folder_path, vectoriser, hooks=None)

Creates a VectorStore instance from stored metadata and Parquet files. This method reads the metadata and vectors from the specified folder, validates the contents, and initializes a VectorStore object with the loaded data. It checks that the metadata contains the required keys, that the Parquet file exists and is not empty, and that the vectoriser class matches the one used to create the vectors. If any checks fail, it raises a ValueError with an appropriate message. This method is useful for loading previously created vector stores without needing to reprocess the original text data.

Parameters
Name Type Description Default
folder_path str The folder path containing the metadata and Parquet files. required
vectoriser object The Vectoriser object used to transform text into vector embeddings. required
hooks dict [optional] A dictionary of user-defined hooks for preprocessing and postprocessing. Defaults to None. None
Returns
Name Type Description
VectorStore An instance of the VectorStore class.
Raises
Name Type Description
DataValidationError If input arguments are invalid or if there are issues with the metadata or Parquet files.
ConfigurationError If there are configuration issues, such as Vectoriser mismatches.
IndexBuildError If there are failures during loading or parsing the files.
search
indexers.main.VectorStore.search(query, n_results=10, batch_size=8)

Searches the VectorStore using queries from a VectorStoreSearchInput object and returns ranked results in VectorStoreSearchOutput object. In batches, converts users text queries into vector embeddings, computes cosine similarity with stored document vectors, and retrieves the top results.

Parameters
Name Type Description Default
query VectorStoreSearchInput A VectorStoreSearchInput object containing the text query or list of queries to search for with ids. required
n_results int [optional] Number of top results to return for each query. Default 10. 10
batch_size int [optional] The batch size for processing queries. Default 8. 8
Returns
Name Type Description
VectorStoreSearchOutput A VectorStoreSearchOutput object containing search results with columns for query_id, query_text, document_id, document_text, rank, score, and any associated metadata columns.
Raises
Name Type Description
DataValidationError Raised if invalid arguments are passed.
ConfigurationError Raised if the vector store is not initialized.
HookError Raised if user-defined hooks fail.
VectorisationError Raised if embedding queries fails.