Optimus has now been released. During its development I don’t think it is unfair to say that some of the design choices we made were those which would allow us to meet our deadlines and customer needs. Now that it has been delivered this puts me in a position where I could look back over the work and retrospectively think “how would I do that differently if I had the time to do it all again?”.
A blank canvas
To begin the redesign I am very lucky to have a growing and knowledgeable group of people forming the Campus’ NLP group. This allowed me to take stock of the wealth of projects we have completed and are currently tackling (see: 1 for the current list) as well as looking back at optimus. In doing so I recognised that almost all of these projects fell into a 4 step structure:
and that I could very easily abstract these to build a generic project template for future work, but I wanted more!
Over the course of development of optimus many new and important NLP techniques were realised, including BERT, ELMo, ULMFit, OpenAI Transformer and flair. Any project should be allowed to pick up and insert their model of choice with the least effort required. The problem? All of these different techniques offered different interfaces with different models to download. The aim of woffle became to build a project template in which it does not matter what mechanism you choose to do your embeddings; they should all provide the same interface and slot very easily into the existing structure of your project.
An idea was born, the next problem was implementing it. As a mathematician by trade and a Haskeller by nature I am far more comfortable working with functions than anything else. I sought out a means by which to turn all of the processing for any project into a composable combination of almost pure functions. The tasks identified above are a really great starting point from which to consider building these as the output of each step should be the input of the following and thus composition becomes possible.
The first and probably most important step of any of the NLP tasks I have encountered to date. Cleaning your text ready to embed it. This task can be further broken down into two typical ‘types’ of parsing: probabilistic and deterministic. If we write these as separate components then does this make it much harder to control the high level 4 step plan above? The answer is no, for this task we build a parser combinator by composing (in the mathematical sense) the two different types of parser to make a new parser which performs all of the functionality of both. So we can write the parsers we need and can then combine them to make one all encompassing ‘parse’. This is all made possible because text parsing is an endomorphism on the class of strings - put any text in and you should get some text back. The composition becomes a chain of mappings and these can be performed in any order and you do not stray from the correct domain.
The problem of handling the combination of two different types of parser was
easily solved using a higher order function (
compose in our repository) to
return a new parser given different parsers as its inputs. There was still the
task of allowing user defined regular expressions to be
evaluated at run time without clogging up the code. I felt the easiest way to
handle this was through the use of two plain text files in
./etc (because I’m
a *nix kinda guy). One was to handle the regex removals and the other was there
to deal with replacements which may impact your embedding and obscure the true
meaning of the content. The newly composed parser reads these in at import time
and then just gets on with its job.
This is where I first decided that woffle should be an opinionated template - having to deal with all of the possible probabilistic parsing techniques is a very difficult one as it really depends on the aim of the work. This led to a new condition on the work: it must be easily extensible so that if the opinion of the task as programmed wasn’t suitable to the task at hand then the functions should be easily extensible to allow for it by advanced users.
Computers are not capable of really understanding text. However, in recent years
we have seen great strides in what computers can do with numeric representations
of the text it is given. Currently continuous vector
embeddings are all the rage but we
are required to expand our concept of ‘embedding’ to that of generating a
numeric representation of the text. This allows our embeddings to not only
cater for the strongest convnet techniques but also consider discrete vector
representations through TF-IDF type mechanisms as well as sentiment scoring.
Keeping this general enough it is sufficient that each different technique
present an interface called
embed which is a function from into the
numeric spaces mentioned above (I have questions about whether it is truly a
‘vector space’ in the case of the continuous embeddings but that’s for another
It makes no sense to compose different embeddings together (just consider the
type of continuous representations: ). All we
need for this is to wrap the embedding mechanism and to have a model for it to
run against to look up the vectors. Models are managed through another
configuration file, this time in the root directory of
woffle because of its
importance. This is
config.ini and specifies the names of the models we prefer
for each of the tasks which use them.
Managing all of these embedding mechanisms and their required models (you
definitely do not want to train your own BERT
model) becomes difficult
as many of them are very large in size and come with their own pre-requisites.
In order to deal with this I have introduced a Makefile for managing the
make is, in my opinion, the best means of managing this kind of
program where you will be pulling in and discarding various different things as
you test them and develop your outputs. It provides exactly the type of
flexibility we need - it can even control the testing suite or run your
main.py. Over time this has gotten longer and more complicated to accommodate
all of the plethora of options which are available for the two example programs
we include (a sentiment analyser and a version of optimus built using woffle).
The special thing about optimus is the mechanism by which it generates labels
for clustered text items. The main aim of woffle is to replicate this
functionality and so the same clustering mechanism (Ward linkage) is available
woffle. However, there are plentiful other ways of collecting together those
points which are ‘alike’ in order to understand the wider collection of text.
For example it is plausible that you may wish to segment an interval to assign
categories for sentiment analysis. These all still have a commonality that they
map your numeric space into a more complex representation of that space into
grouped points in the space. Given a numeric representation space (not to be
confused with ) the collection embeds this set into a larger
dimension space where the extra dimension assigns an integer
to each of our vectors which represents the cluster number that the vector
In reality, keeping all of the vector and this extra dimension when it is likely
that there are a large number of vectors is not a great idea. Instead our
collection mechanism only returns this last dimension and requires the original
text or the embedding vectors to be available to later pieces of the program.
This choice breaks the composability of the program into one total function
but I feel it makes the tasks in the process easier to break up and makes the
main.py you write a little less visually dense.
As I mentioned above, the labelling mechanism of optimus is its main selling point. We iteratively clustered our way up the dendrogram relaxing what it meant to be part of a cluster at each stage so that as we went on they became more and more generic. Building this in a mostly functional way proved interesting. I rested finally on a design which mimics that of optimus. The processing walks through a series of functions which decide on the lexical similarity of the cluster and apply the correct method based on this decision.
For other tasks this step may not be particularly beneficial but it does allow for text to be applied to categories if you have created them. A simple example of this is converting the raw sentiment score to a ‘good’, ‘neutral’ or ‘bad’ label.
At this stage we finally map and consider our processing done. For the original text we have now gone through and have obtained some label or meaning being assigned to our original text in an almost composable manner with little to no state outside of the functions.
So the above describes how projects using
woffle are structured. In order to
reach implementation in python there were a large number of hurdles to clear.
Functional programming in python
Using a Functional Programming (FP) approach in a language where FP is a
combination of features implemented and afterthought for the rest is
woffle makes liberal use of the functools module and map. These
may not be the most pythonic of approaches but it allows us to make frequent use
of generator expressions to reduce memory overhead and allows us to perform the
composition mentioned earlier.
Namespaces are one honking great idea
Hopefully the title of this section is familiar to pythonistas, and whilst I
agree that namespaces are great there comes a point in many pieces where the
namespaces are so deep that it can feel like using java with the length and
complexity of the directory structure you develop. In order to navigate this
problem we introduce the concept of themes into
woffle. A theme is nothing
more than a namespace simplification block which deals with the default options
for the project type. Included with woffle at release are a hierarchical
clustering theme which provides you with the tools used for optimus (fasttext,
Ward linkage etc.) under the design laid out above as well as a sentiment
analysis theme. We also include example programs which utilise the functions we
provide to obtained the outputs desired.
The aim of the themes is only to simplify the code. Consider the following examples with and without the hierarchical clustering theme
from woffle.parse.deter.parse import parse from woffle.embed.numeric.fasttext import embed from woffle.cluster.deter import cluster from woffle.select.lexical import select
from woffle.hcluster import parse, embed, cluster, select
and what does the latter look like if we want to use BERT embeddings instead of the fasttext one?
from woffle.hcluster import parse, cluster, select from woffle.embed.numeric.bert import embed
This gives less advanced users a means by which to get the job done but also allows more advanced users to compose their own pipeline (hopefully) pretty easily.
As an example of how to build these theme blocks, our
hcluster.py looks like
# Setting the scene for doing heirarchical clustering as a task # this obfuscates all of the much uglier name space stuff and gives you # the bits you need for the task import woffle.parse.deter.parse as dp import woffle.parse.prob.spacy as pp from woffle.embed.numeric.fasttext import embed, embed_ from woffle.cluster.deter import cluster from woffle.functions.generics import compose, compose_ from woffle.select.lexical import select, select_ # parse = compose(pp.parse, dp.parse) # parse_ = compose(pp.parse_, dp.parse_) parse = dp.parse parse_ = dp.parse_
You will notice the extra functions ending in an underscore, these are the functions which get applied to the single text entities under consideration and you can think of their non-underscore siblings as being these things mapped across the list of text. These are not necessary to expose but were particularly useful during building for extending and testing. You may also note that the probabilistic parsing is turned off for the default. This is because the target functionality, which it is mimicking, does not use probabilistic parsing until it reaches the labelling portion; long after the parsing is performed.
For the more advanced users who like to tweak things to their own liking or want
to roll their own functions for each task we provide helper functions in
We give you two versions of our composition function:
compose: mathematical composition of functions where
compose_: a ‘programmer’ version where
compose_(g, f)(x)is , this can be considered to be the same as
composebut with the arguments reversed.
compose (and thus
compose_) is a higher order function and is used liberally
across all of woffle so it is important to become comfortable thinking of
functions in this way.
We also provide a number of functions for working with lists. It is worth noting that these are mostly catamorphisms so remove structure in some way. From left folds to our unpacking mechanisms for flattening embedded structure these functions are fairly useful where you need to change lists into numbers or lists of lists into lists.
For more information on optimus please see the excellent post by my colleague Steven Hopkins, who defined all of the best parts of the process, here and my previous introductory post here. If woffle sounds bad or you would just rather use optimus as-is then you can find it here.
If you wonder why we have called this
woffle rather than
waffle then this
comes from an introductory chapter of a mathematical textbook undergraduate
Miles Reid. I thoroughly recommend this book for anybody who is interested in
algebraic geometry and needs an introductory primer.
If you have never used the phrase ‘waffle’ or ‘waffling’ then you will find it means ‘to talk or write foolishly’. Which, if you have managed to read this far, you can see I particularly excel at.