Introduction
woffle
is built to allow users to perform their NLP based tasks through a
common interface regardless of their choice of back end processing. The
directory hierarchy of the program is task based and because there are so many
different tasks, which can be split further into probabilistic and deterministic
tasks and so the.import.namespaces.look.rather.messy.and.long
. To circumvent
having to import these things for all tasks there are pre-packaged import blocks
which we refer to as themes.
A theme is a way of masking the import block for specific tasks giving the functions to use with less hassle. The following themes are included in the current version of the repository
Each theme comes with default parsing, embedding (sentiment can be considered as a mapping from a string into a numeric representation space so the interface is still referred to as an embedding), clustering and representative label selection for the cluster options. Please see the links above (or in the sidebar) for instructions on how to use each of these themes and what their default builds look like.
If a theme does not meet your needs then I hope that this project makes it easy enough for you to compose your own processing means. Each step is supposed to be composable with the previous so long as the correct data type is presented at each return statement. For more information on how to build a custom processing procedure then please see the advanced use page.
The breakdown of a theme
Each theme consists of at most four parts
- Parsing
- Embedding
- Grouping
- Labelling
where these parts are purposefully left quite loosely defined. This allows us to consider sentiment scores as a mean of embedding by thinking it as mapping text onto the numeric value corresponding to its sentiment. The internal workings of each of these things can differ based on projects but we offer some opinionated defaults for each theme.
1. Parsing
An endomorphism of strings, the default properties of this function across all
projects are deterministic replacements of some items based on matching regular
expressions. The list of all expressions to match is defined in ./etc/regex
in the form pattern = replacement
. It is also possible to encode specific
patterns into a replacement and this is split out into ./etc/encoding
but the
functionality is the same.
Default functionality
with open('etc/regex') as f:
replace = toml.load(f)
with open('etc/encoding') as f:
encode = toml.load(f)
def regexes(r : dict, x : str) -> str:
return compose(*[zfunctools.partial(re.sub, i, j, flags=re.IGNORECASE)
for i,j in r.items()])(x)
replacements = functools.partial(regexes, replace)
encoding = functools.partial(regexes, encode)
parse_ = compose( encoding
, replacements
, str.strip
)
parse = functools.partial(map, parse_)
There are, however, probabilistic techniques included in woffle.parse.prob
in
the form of spaCy functions. For example, the spacy parse()
function
converts your text into a spacy doc
data type before extracting the root of
the phrase, lemmatising it and checking that it is part of the spacy vocabulary
for your chosen language.
If you wish to use this then you can simply compose them using our provided function
from woffle.parse.deter import parse as dp
from woffle.parse.prob.spacy import parse as pp
from woffle.functions.generics import compose
parse = compose(pp, dp)
2. Embedding
Any function which maps strings into a numeric space is suitable here. For
hierarchical clustering this takes the form of functions which map strings onto
real valued vectors or perhaps strings to integer valued vectors if you wish to
use term frequency for this. For the included themes the sentiment theme
includes a mapping from string onto {-1,0,1}
to represent sentiment polarity.
The interface to this should be provided as embed()
regardless of which mechanism you
use.
3. Grouping
A similarly abstract concept, this can vary from grouping values based on like
sentiment polarity score to Ward linkage which is provided through the
hcluster
theme as its default operation.
4. Labelling
This step is very important for the hierarchical clustering theme (see
./examples/optimus.py
for what it means to perform this in an unsupervised
classification task) where labels are generated through a multi-faceted decision
tree but this task is also available in sentiment where we simply map {-1,0,1}
onto the representative {negative, neutral, positive}
for each piece of text.
Pre-packaged themes
We provide, as of this version of the woffle template, themes for hierarchical
clustering tasks and sentiment polarity classification tasks. These are
augmented by examples of building both of these using woffle in ./examples
and
the examples are expected to inform you on how to go about these things.
If you want to perform a task which is not already available or find a bug, please fork the repo or raise an issue using the appropriate ticket template.