Every day I'm wofflin'

Optimus has now been released. During its development I don’t think it is unfair to say that some of the design choices we made were those which would allow us to meet our deadlines and customer needs. Now that it has been delivered this puts me in a position where I could look back over the work and retrospectively think “how would I do that differently if I had the time to do it all again?”.

A blank canvas

To begin the redesign I am very lucky to have a growing and knowledgeable group of people forming the Campus’ NLP group. This allowed me to take stock of the wealth of projects we have completed and are currently tackling (see: 1 for the current list) as well as looking back at optimus. In doing so I recognised that almost all of these projects fell into a 4 step structure:

  • Parse
  • Embed
  • Collect
  • Label

and that I could very easily abstract these to build a generic project template for future work, but I wanted more!

Enter: woffle

Over the course of development of optimus many new and important NLP techniques were realised, including BERT, ELMo, ULMFit, OpenAI Transformer and flair. Any project should be allowed to pick up and insert their model of choice with the least effort required. The problem? All of these different techniques offered different interfaces with different models to download. The aim of woffle became to build a project template in which it does not matter what mechanism you choose to do your embeddings; they should all provide the same interface and slot very easily into the existing structure of your project.

An idea was born, the next problem was implementing it. As a mathematician by trade and a Haskeller by nature I am far more comfortable working with functions than anything else. I sought out a means by which to turn all of the processing for any project into a composable combination of almost pure functions. The tasks identified above are a really great starting point from which to consider building these as the output of each step should be the input of the following and thus composition becomes possible.

Parse

The first and probably most important step of any of the NLP tasks I have encountered to date. Cleaning your text ready to embed it. This task can be further broken down into two typical ‘types’ of parsing: probabilistic and deterministic. If we write these as separate components then does this make it much harder to control the high level 4 step plan above? The answer is no, for this task we build a parser combinator by composing (in the mathematical sense) the two different types of parser to make a new parser which performs all of the functionality of both. So we can write the parsers we need and can then combine them to make one all encompassing ‘parse’. This is all made possible because text parsing is an endomorphism on the class of strings - put any text in and you should get some text back. The composition becomes a chain of mappings and these can be performed in any order and you do not stray from the correct domain.

The problem of handling the combination of two different types of parser was easily solved using a higher order function (compose in our repository) to return a new parser given different parsers as its inputs. There was still the task of allowing user defined regular expressions to be evaluated at run time without clogging up the code. I felt the easiest way to handle this was through the use of two plain text files in ./etc (because I’m a *nix kinda guy). One was to handle the regex removals and the other was there to deal with replacements which may impact your embedding and obscure the true meaning of the content. The newly composed parser reads these in at import time and then just gets on with its job.

This is where I first decided that woffle should be an opinionated template - having to deal with all of the possible probabilistic parsing techniques is a very difficult one as it really depends on the aim of the work. This led to a new condition on the work: it must be easily extensible so that if the opinion of the task as programmed wasn’t suitable to the task at hand then the functions should be easily extensible to allow for it by advanced users.

Embed

Computers are not capable of really understanding text. However, in recent years we have seen great strides in what computers can do with numeric representations of the text it is given. Currently continuous vector embeddings are all the rage but we are required to expand our concept of ‘embedding’ to that of generating a numeric representation of the text. This allows our embeddings to not only cater for the strongest convnet techniques but also consider discrete vector representations through TF-IDF type mechanisms as well as sentiment scoring. Keeping this general enough it is sufficient that each different technique present an interface called embed which is a function from into the numeric spaces mentioned above (I have questions about whether it is truly a ‘vector space’ in the case of the continuous embeddings but that’s for another day).

It makes no sense to compose different embeddings together (just consider the type of continuous representations: ). All we need for this is to wrap the embedding mechanism and to have a model for it to run against to look up the vectors. Models are managed through another configuration file, this time in the root directory of woffle because of its importance. This is config.ini and specifies the names of the models we prefer for each of the tasks which use them.

Managing all of these embedding mechanisms and their required models (you definitely do not want to train your own BERT model) becomes difficult as many of them are very large in size and come with their own pre-requisites. In order to deal with this I have introduced a Makefile for managing the project. GNU make is, in my opinion, the best means of managing this kind of program where you will be pulling in and discarding various different things as you test them and develop your outputs. It provides exactly the type of flexibility we need - it can even control the testing suite or run your main.py. Over time this has gotten longer and more complicated to accommodate all of the plethora of options which are available for the two example programs we include (a sentiment analyser and a version of optimus built using woffle).

Collect

The special thing about optimus is the mechanism by which it generates labels for clustered text items. The main aim of woffle is to replicate this functionality and so the same clustering mechanism (Ward linkage) is available in woffle. However, there are plentiful other ways of collecting together those points which are ‘alike’ in order to understand the wider collection of text. For example it is plausible that you may wish to segment an interval to assign categories for sentiment analysis. These all still have a commonality that they map your numeric space into a more complex representation of that space into grouped points in the space. Given a numeric representation space (not to be confused with ) the collection embeds this set into a larger dimension space where the extra dimension assigns an integer to each of our vectors which represents the cluster number that the vector belongs to.

In reality, keeping all of the vector and this extra dimension when it is likely that there are a large number of vectors is not a great idea. Instead our collection mechanism only returns this last dimension and requires the original text or the embedding vectors to be available to later pieces of the program. This choice breaks the composability of the program into one total function but I feel it makes the tasks in the process easier to break up and makes the main.py you write a little less visually dense.

Label

As I mentioned above, the labelling mechanism of optimus is its main selling point. We iteratively clustered our way up the dendrogram relaxing what it meant to be part of a cluster at each stage so that as we went on they became more and more generic. Building this in a mostly functional way proved interesting. I rested finally on a design which mimics that of optimus. The processing walks through a series of functions which decide on the lexical similarity of the cluster and apply the correct method based on this decision.

For other tasks this step may not be particularly beneficial but it does allow for text to be applied to categories if you have created them. A simple example of this is converting the raw sentiment score to a ‘good’, ‘neutral’ or ‘bad’ label.

At this stage we finally map and consider our processing done. For the original text we have now gone through and have obtained some label or meaning being assigned to our original text in an almost composable manner with little to no state outside of the functions.

Design decisions

So the above describes how projects using woffle are structured. In order to reach implementation in python there were a large number of hurdles to clear.

Functional programming in python

Using a Functional Programming (FP) approach in a language where FP is a combination of features implemented and afterthought for the rest is challenging. woffle makes liberal use of the functools module and map. These may not be the most pythonic of approaches but it allows us to make frequent use of generator expressions to reduce memory overhead and allows us to perform the composition mentioned earlier.

Namespaces are one honking great idea

Hopefully the title of this section is familiar to pythonistas, and whilst I agree that namespaces are great there comes a point in many pieces where the namespaces are so deep that it can feel like using java with the length and complexity of the directory structure you develop. In order to navigate this problem we introduce the concept of themes into woffle. A theme is nothing more than a namespace simplification block which deals with the default options for the project type. Included with woffle at release are a hierarchical clustering theme which provides you with the tools used for optimus (fasttext, Ward linkage etc.) under the design laid out above as well as a sentiment analysis theme. We also include example programs which utilise the functions we provide to obtained the outputs desired.

The aim of the themes is only to simplify the code. Consider the following examples with and without the hierarchical clustering theme

from woffle.parse.deter.parse import parse
from woffle.embed.numeric.fasttext import embed
from woffle.cluster.deter import cluster
from woffle.select.lexical import select

vs

from woffle.hcluster import parse, embed, cluster, select

and what does the latter look like if we want to use BERT embeddings instead of the fasttext one?

from woffle.hcluster import parse, cluster, select
from woffle.embed.numeric.bert import embed

This gives less advanced users a means by which to get the job done but also allows more advanced users to compose their own pipeline (hopefully) pretty easily.

As an example of how to build these theme blocks, our hcluster.py looks like

# Setting the scene for doing heirarchical clustering as a task
# this obfuscates all of the much uglier name space stuff and gives you
# the bits you need for the task


import woffle.parse.deter.parse as dp
import woffle.parse.prob.spacy as pp

from woffle.embed.numeric.fasttext import embed, embed_
from woffle.cluster.deter import cluster

from woffle.functions.generics import compose, compose_

from woffle.select.lexical import select, select_

# parse  = compose(pp.parse,  dp.parse)
# parse_ = compose(pp.parse_, dp.parse_)
parse  = dp.parse
parse_ = dp.parse_

You will notice the extra functions ending in an underscore, these are the functions which get applied to the single text entities under consideration and you can think of their non-underscore siblings as being these things mapped across the list of text. These are not necessary to expose but were particularly useful during building for extending and testing. You may also note that the probabilistic parsing is turned off for the default. This is because the target functionality, which it is mimicking, does not use probabilistic parsing until it reaches the labelling portion; long after the parsing is performed.

Additional pieces

For the more advanced users who like to tweak things to their own liking or want to roll their own functions for each task we provide helper functions in woffle.functions.

We give you two versions of our composition function:

  • compose: mathematical composition of functions where compose(g, f)(x) is
  • compose_: a ‘programmer’ version where compose_(g, f)(x) is , this can be considered to be the same as compose but with the arguments reversed.

compose (and thus compose_) is a higher order function and is used liberally across all of woffle so it is important to become comfortable thinking of functions in this way.

We also provide a number of functions for working with lists. It is worth noting that these are mostly catamorphisms so remove structure in some way. From left folds to our unpacking mechanisms for flattening embedded structure these functions are fairly useful where you need to change lists into numbers or lists of lists into lists.

Making waffles

So that’s it, woffle in a nutshell. If you want to give it a try or get involved see our repo and its documentation. We have issue templates for bugs and feature requests.

For more information on optimus please see the excellent post by my colleague Steven Hopkins, who defined all of the best parts of the process, here and my previous introductory post here. If woffle sounds bad or you would just rather use optimus as-is then you can find it here.

Happy hacking!

Inspiration

If you wonder why we have called this woffle rather than waffle then this comes from an introductory chapter of a mathematical textbook undergraduate algebraic geometry by Miles Reid. I thoroughly recommend this book for anybody who is interested in algebraic geometry and needs an introductory primer.

If you have never used the phrase ‘waffle’ or ‘waffling’ then you will find it means ‘to talk or write foolishly’. Which, if you have managed to read this far, you can see I particularly excel at.


Gareth Clews

By

Senior Data Scientist at ONS Data Science Campus.

Updated