The Data Science Campus has been exploring how to process unlabelled list data that are collected manually in an uncontrolled fashion with no supplementary information to allow aggregation of data.
- Steven Hopkins
- Gareth Clews
- Arturas Eidukas
The enabling of analysis on datasets acquired from several ferry operators.
The main output is the processing of the datasets into well-structured hierarchical datasets that enable aggregation across categories for analytical understanding of trade flows. The project, on a wider scope, is aiming to open source a generalised tool for these sorts of problems that can be used by analysts to understand similar free-text variables in their own work.
The unsupervised processing of free-text using current methods such as word embeddings and clustering algorithms.
For processed datasets - Department for Environment Food and Rural Affairs (Defra) and indirectly the cross-Whitehall group on UK trade. For the generalised tool - the analytical community who use Python for natural language analysis.
Code and outputs
Optimus - Github repository o p t i m u s – turning free-text lists into hierarchical datasets
Please contact firstname.lastname@example.org for more information.