Quick Start
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Documentation on the methods utilised and how Optimus functions is pending. This README will be updated to include links to this material once it is made available.
Software Prerequisites
You will need the following tools in order to be able to set up and use optimus:
- A modern MacOS or linux installation, Windows is not supported and you are on your own trying it there
- curl
- zsh
- python 3.6 or later
- git
Quick Start example
There is a quick start example script that demonstrates how to use the pipeline called example.py
in the root directory. The final dataset is written to optimus_results.csv
also in the root directory.
A graphical UI for running Optimus
In order to make the tool more accessible a web app based UI was developed. This user interface will help process data without the need of any python coding.
If this is something that interests you please read this README.md file for more info.
Importing Optimus
Import Optimus into python either through the whole module
import optimus
or by importing the Optimus classes
from optimus import Optimus
Customise settings for Optimus
Configuration of the pipeline is controlled with a configuration file
config.json
file in the following format:
1
2
3
4
5
{
"data":"location/to/data.csv",
"model":"location/to/wiki.en.bin"
...
}
After creating a config.json
file, the location can be passed when creating an
instance of Optimus:
1
2
# Python
o = Optimus(config_path='path/to/config.json', ...)
Further settings can be added on an ad hoc basis and will overwrite any previous settings. To do so, pass in valid arguments into the Optimus class upon construction like so:
1
2
3
4
5
6
7
# Python
o = Optimus(
config_path='path/to/config.json',
data="path/to/new_data.csv",
cutoff=6,
...
)
Optimus has a default settings file to fall back on in case none of this is provided however using just default settings might cause issues. This is mainly due to the path specifications to the data and models in the default settings not being accurate.
The file etc/config.json
stores the default arguments used by Optimus. Please
do not edit this file.
Shortened reference:
obj = Optimus()
-> Uses default settingsobj = Optimus(config_path='path/to/user/config.json')
-> Uses custom config fileobj = Optimus(distance=10, stepsize=2, cutoff=16 ...)
-> replace specific parameter values instead of those defined in the config file.
Running the code & getting outputs
Optimus takes in pandas.core.series.Series
objects. In order to run a
configured Optimus object on a series, simply call the object and enclose the
desired series in the brackets. For example, for a pandas series called text
:
1
2
3
4
from optimus import Optimus
O = Optimus()
results = O(text)
NOTE: If no data is passed into the the Optimus object the data defined in the config file will be used.
Additional arguments to Optimus
-
save_csv One can pass
save_csv
as an optional keyword argument. If the value is set tosave_csv=True
this will force Optimus to save the output DataFrame which includes all the labels from each iteration in the working directory as labelled.csv. -
full Similarly if one just needs a dataframe to be returned and not saved, use the full=True setting to receive back the dataframe containing the mapped labels.
-
verbose A boolean value which will dictate how much will be printed to the console as the code runs. Some outputs are still maintained in the console even if
verbose=False
as this allows some idea of progress of the processing.
Managing Memory
The fastText model is large and requires a sizeable amount of RAM. Each instance
of optimus will load its own fast text model on the first processing call. It
does this by checking if the model was loaded before and if not will perform a
ft.load_model()
operation. Once its loaded, all subsequent runs (based on the
same instance of Optimus) should not reload a model.
Replacing models and freeing memory
The Optimus object has a replace_model
method. This method aims to provide a
way to control the memory usage of the Optimus object. This method allows a user
to reload and replace a new model or just to remove the loaded model from the
Optimus object.
The method takes a string or a fastText loaded model and assigns it to the Optimus object. If no model parameter is passed, the method will simply delete and garbage collect the existing loaded model.
1
2
3
4
5
6
7
8
9
10
11
12
# Python
o = Optimus(args, kwargs)
output = o(some_data)
# Load from a path
o.replace_model('string/path/to/model')
# Provide an already loaded model
o.replace_model(fastText.load_model('string/path/to/model'))
# Delete the existing model in the Optimus object
o.replace_model()
Embedding plot functions
This pipeline comes with a helpful embedding visualiser module. This set of functions will allow users to pass in a pandas series full of text entries and a fastText model and use the model to embed these strings into first a n dimensional space which will then be reduced to 2 dimensional space using t-SNE.
This will then be plotted and exported into a ‘embedding_plot.html’ which is fully interactive.
1
2
3
4
5
6
7
8
9
# Python
import pandas as pd
from lib.emplot import plot
series = pd.Series(['string1', ..., 'string2'])
plot(series=series,
model='path/to/model.bin',
output_path='output_vectors.csv')
Working with large datasets
Ward linkage is computationally expensive. The process needs to calculate a pairwise distance matrix for all of the embedded vectors and this is of order $n^2$ for $n$ data points, in memory consumption. When you factor in that the models for the fastText embedding are already gigabytes in size this can become a problem.
Where data starts to push the boundaries of what is available to the process we currently recommend performing a sampling of your data points, using optimus to categorise the labelled points and then using (for example) a knn to ‘smear’ the generated labels across the points nearby.
Example code to do this is provided in the sampling/
directory. The program
performs a simple random sample of the content of your list and then embeds
these words before using the approach outlined above to generate labels for the
out of sample words. This approach is naive, but can provide a starting point
for more complex sampling mechanisms such as the use of
apricot.