Quick Start

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Documentation on the methods utilised and how Optimus functions is pending. This README will be updated to include links to this material once it is made available.

Software Prerequisites

You will need the following tools in order to be able to set up and use optimus:

A modern MacOS or linux installation, Windows is not supported and you are on your own trying it there
curl
zsh
python 3.6 or later
git

Quick Start example

There is a quick start example script that demonstrates how to use the pipeline called example.py in the root directory. The final dataset is written to optimus_results.csv also in the root directory.

A graphical UI for running Optimus

In order to make the tool more accessible a web app based UI was developed. This user interface will help process data without the need of any python coding.

If this is something that interests you please read this README.md file for more info.

Importing Optimus

Import Optimus into python either through the whole module

import optimus

or by importing the Optimus classes

from optimus import Optimus

Customise settings for Optimus

Configuration of the pipeline is controlled with a configuration file config.json file in the following format:

  {
    "data":"location/to/data.csv",
    "model":"location/to/wiki.en.bin"
    ...
  }

After creating a config.json file, the location can be passed when creating an instance of Optimus:

# Python
o = Optimus(config_path='path/to/config.json', ...)

Further settings can be added on an ad hoc basis and will overwrite any previous settings. To do so, pass in valid arguments into the Optimus class upon construction like so:

# Python
o = Optimus(
      config_path='path/to/config.json',
      data="path/to/new_data.csv",
      cutoff=6,
      ...
  )

Optimus has a default settings file to fall back on in case none of this is provided however using just default settings might cause issues. This is mainly due to the path specifications to the data and models in the default settings not being accurate.

The file etc/config.json stores the default arguments used by Optimus. Please do not edit this file.

Shortened reference:

obj = Optimus() -> Uses default settings
obj = Optimus(config_path='path/to/user/config.json') -> Uses custom config file
obj = Optimus(distance=10, stepsize=2, cutoff=16 ...) -> replace specific parameter values instead of those defined in the config file.

Running the code & getting outputs

Optimus takes in pandas.core.series.Series objects. In order to run a configured Optimus object on a series, simply call the object and enclose the desired series in the brackets. For example, for a pandas series called text:

from optimus import Optimus

O = Optimus()
results = O(text)

NOTE: If no data is passed into the the Optimus object the data defined in the config file will be used.

Additional arguments to Optimus

save_csv One can pass save_csv as an optional keyword argument. If the value is set to save_csv=True this will force Optimus to save the output DataFrame which includes all the labels from each iteration in the working directory as labelled.csv.
full Similarly if one just needs a dataframe to be returned and not saved, use the full=True setting to receive back the dataframe containing the mapped labels.
verbose A boolean value which will dictate how much will be printed to the console as the code runs. Some outputs are still maintained in the console even if verbose=False as this allows some idea of progress of the processing.

Managing Memory

The fastText model is large and requires a sizeable amount of RAM. Each instance of optimus will load its own fast text model on the first processing call. It does this by checking if the model was loaded before and if not will perform a ft.load_model() operation. Once its loaded, all subsequent runs (based on the same instance of Optimus) should not reload a model.

Replacing models and freeing memory

The Optimus object has a replace_model method. This method aims to provide a way to control the memory usage of the Optimus object. This method allows a user to reload and replace a new model or just to remove the loaded model from the Optimus object.

The method takes a string or a fastText loaded model and assigns it to the Optimus object. If no model parameter is passed, the method will simply delete and garbage collect the existing loaded model.

# Python
o = Optimus(args, kwargs)
output = o(some_data)

# Load from a path
o.replace_model('string/path/to/model')

# Provide an already loaded model
o.replace_model(fastText.load_model('string/path/to/model'))

# Delete the existing model in the Optimus object
o.replace_model()

Embedding plot functions

This pipeline comes with a helpful embedding visualiser module. This set of functions will allow users to pass in a pandas series full of text entries and a fastText model and use the model to embed these strings into first a n dimensional space which will then be reduced to 2 dimensional space using t-SNE.

This will then be plotted and exported into a ‘embedding_plot.html’ which is fully interactive.

# Python
import pandas as pd
from lib.emplot import plot

series = pd.Series(['string1', ..., 'string2'])
plot(series=series,
     model='path/to/model.bin',
     output_path='output_vectors.csv')

Working with large datasets

Ward linkage is computationally expensive. The process needs to calculate a pairwise distance matrix for all of the embedded vectors and this is of order $n^2$ for $n$ data points, in memory consumption. When you factor in that the models for the fastText embedding are already gigabytes in size this can become a problem.

Where data starts to push the boundaries of what is available to the process we currently recommend performing a sampling of your data points, using optimus to categorise the labelled points and then using (for example) a knn to ‘smear’ the generated labels across the points nearby.

Example code to do this is provided in the sampling/ directory. The program performs a simple random sample of the content of your list and then embeds these words before using the approach outlined above to generate labels for the out of sample words. This approach is naive, but can provide a starting point for more complex sampling mechanisms such as the use of apricot.

Edit on GitHub