This project explores whether it is possible to classify financial corporations to their detailed Standard Industry Classification 2007 (SIC2007) using data on their financial assets and liabilities, and other firm-level information. The project makes use of a number of unique features derived from both structured and unstructured text.
- Alex Noyvirt
The ability to classify of the financial companies into sub-sectors by using firm-level information on their financial assets and liabilities, and other business information such as turnover and employment has been identified as a high priority for the Flow of Funds collaboration project between the Office for National Statistics (ONS) and the Bank of England (BOE).
The outcomes: 1) The recommendations made by the report can be used to improve the financial services survey by designing additional questions with more discriminative power for classification. 2) Companies highlighted by the algorithm as potentially misclassified by the SIC 2007 code recorded in the inter-departmental business register can be included in the process for manual revaluation of their activities. 3) Methodology can be reapplied easily for classification of companies from other administrative datasets.
Machine learning, i.e. Random Forest and XGBoost in a distributed environment (Cloudera & Apache Spark) have been applied together with an extensive set of feature selection methods. The work has been implemented using Scala.
ONS Flow of funds - economic statistics
Use the - [ ] format to specify a general delivery timeline. Mark completed parts with - [x].
- February 2018 Feature selection methods developed
- May 2018 Processing of 6 million machine models
- June 2018 Evaluation of the results
- June 2018 Project report work
- August 2018 Report finished
- September Show & tell to stakeholders (FCA)
- Future Project Completed
Please contact email@example.com for more information.
- No updates yet.