This project explores whether it is possible to classify financial corporations to their detailed Standard Industry Classification 2007 (SIC2007) using data on their financial assets and liabilities, and other firm-level information. The project makes use of a number of unique features derived from both structured and unstructured text.
- Alex Noyvirt
The ability to classify the financial companies into sub-sectors by using firm-level information on their financial assets and liabilities, and other business information such as turnover and employment, has been identified as a high priority for the Flow of Funds collaboration project between the Office for National Statistics (ONS) and the Bank of England (BoE).
the recommendations made by the report can be used to improve the financial services survey by designing additional questions with more discriminative power for classification
companies highlighted by the algorithm as potentially misclassified by the SIC 2007 code recorded in the Inter-Departmental Business Register (IDBR) can be included in the process for manual revaluation of their activities
methodology can be reapplied easily for classification of companies from other administrative datasets
Machine learning, that is, Random Forest and XGBoost in a distributed environment (Cloudera and Apache Spark) have been applied together with an extensive set of feature selection methods. The work has been implemented using Scala.
ONS flow of funds - economic statistics
- February 2018: feature selection methods developed
- May 2018: processing of 6 million machine models
- June 2018: evaluation of the results
- June 2018: project report work
- August 2018: report finished
- September 2018: show & tell to stakeholders (FCA)
- Future: project completed
Please contact email@example.com for more information.