Infrastructure for the Data Science Campus

What do you need to start using Data Science tools and techniques with your data? Can you just use your existing IT infrastructure? Or do you need to develop something more specialist?

The answer depends on whatyou are doing, how you are doing it and how your existing infrastructre is set up.

In this blog we discuss the infrastructure that is under continual development at the Data Science Campus. We explore how it contrasts and works along side the office wide offering and see that the two exist as two different - but complimentary - networks addressing different needs.

The challenges

Recent rapid development in technology and the availability of new datasets offers the opportunity to exploit huge amounts of new information. This can be used to help policymakers, researchers and businesses make more informed decisions. The Data Science Campus (the Campus) at the Office for National Statistics (ONS) has been created to respond to this challenge. The goals of the Campus are to investigate the use of new data sources, including administrative data and big data for public good, in addition to helping build data science capability for the benefit of the UK. A new generation of tools and technologies is being used to exploit the growth and availability of these new data sources and provide innovative methods to provide rich informed measurement and analyses on the economy, the global environment and wider society.

An infrastructure that provides the necessary tools and technologies to develop new techniques, utilise the latest tools and explore new datasets is required to achieve these goals in a sustainable and scalable manner. However, this must be balanced against ensuring that data governance and security is maintained to the highest quality. This is a significant challenge, especially in organisations that are going through extensive digital transformation. In this blog we will outline the networks used within the ONS and the Campus and how they adress these needs.

The ONS data service

The ONS corporate network is undergoing significant transformation. It is a highly secure area containing standard corporate IT services such as Email, Skype for Business, and SharePoint. These have followed on from key datacentre and hardware refresh projects within the ONS, and a complete upgrade of business services. This is an ongoing progress but has seen a major upgrade in the technology and services provided within the ONS.

alt text

Each of these new IT services are separated and secured through secure zoned networks - see below. These are isolated from the core network. Each zone is secure by default and changes are managed under strict control using firewall rules. This approach ensures only the relevant users have access to each service. It also increases performance, reliability and resiliency within the system as well as mitigating against network saturation.

alt text

Several zones combine to form the ONS Data Service. This is an office wide service that sits on the core network and aims to enable teams across ONS to transform by providing access to support, data and technology. To achieve this, the ONS Data Service is also an organisational structure. In addition to the technological services required by the office, it also provides an ethical, methodological and standards framework alongside training, support and capability building. This is an important aspect of the service to ensure that all users can use and gain as much leverage as possible from the system.

alt text

The service manages and provides the tools to acquire, control, securely ingest and prepare data before placing in a distributed data lake. This Hadoop Distributed File System (HDFS) contains the bulk of the business sensitive data. The service also provides tools for the preparation and exploration of data using tools such as Python, R and Spark. Libraries and packages are managed through a repository manager. In addition, production and export pipelines are provided. Users access the platforms and tool sets from the corporate network using Virtual Desktop Infrastructure (VDI) for ease of access and data explorer tools are provided through the Cloudera Data Science Workbench (CDSW).

The secure and robust implementation of the corporate infrastructure allows data science techniques to be applied to sensitive data; however due to security constraints, it also restricts the libraries, models and tools that can be used.
This can limit innovation and constrains the ability for data scientists to implement and experiment with new tools and develop new techniques. The ONS Data Service is still developing and evolving, but in ensuring secure and controlled access to data it also restricts the use of the latest tools in areas such as image processing, geospatial analysis and natural language processing (NLP).

The Data Science Campus network (DSCN) has been created as separate infrastructure to provide users with IT services and tool sets required to investigate more advanced techniques and produce the next generation of statistics.

Getting hold of the data

Data is central to the work at ONS. Data enters the business from a multitude of sources in both structured and non-structured forms. The security, access control and regulation of data is of the upmost importance. The Trusted Data Zone (TDZ) was created to securely ingest this data for use in the ONS Data Service platform. The TDZ contains tools to handle data ingestion, including Apache NiFi and API Gateways. The zone also contains a multi Anti-Virus scanner to ensure that the data is free from malware and viruses before entering the network.

In contrast data on the Campus network is non-sensitive, and the network is designed so data cannot enter the corporate network. This could be in a distributed manner using HDFS, as in the core network, or in relational or NoSQL forms as appropriate for its end use. The initial steps of ingestion – planned acquisition, data control and ethical reviews – are the same whatever network is being used. This is important to ensure that the appropriate platform is being used for the right data.

The Campus network

The DSCN sits in the Demilitarized Zone (DMZ) in complete isolation from the corporate network on dedicated hardware. This includes 800 GHz CPU, 2.5TB RAM and 100 TB spanning two data centres, joined by a 20Gbps inter-link. There are also five physical TPU machines. The DSCN is accessible from the corporate network and external networks. Users can access and use a variety of tools and can develop their own data systems as required, including many that would be restricted on the corporate network. Tools include a variety of languages and integrated development environments, automation and containers. There is also the ability to create web services and APIs to share findings, tools and visualisations with stakeholders. The system is designed to encourage innovation and experimentation. It has been used to
develop techniques and tools that can later be migrated to the corporate data
service using continuous integration pipelines.

alt text

There are 4 core areas where the Data Science Campus Network is used:

  • Development and exploration of techniques and tools

  • Completion of projects

  • Capability and training portals

  • Integration with other teams outside of the Campus

The freedom and flexibility of the system allows users to quickly set up and test new open source techniques as well as develop and test their own tools. This includes NLP, computer vision, geospatial and Deep Learning approaches. This has enabled a number or projects to be successfully completed and published. Pipelines and tools can then be fine-tuned and exported to be used on the main ONS Data Service.

alt text

The network also has a training sandbox and environment. This is used by the Campus capability team as a reliable way of ensuring the right environment and tools are loaded and available to all users during Data Science training sessions. These are delivered across ONS and UK Government.

A two way street

Developing a network that sits seeprately from the main organisation has its risks. Data Scientists have a need to access new and different tools while busy IT teams are concerned with security and support. Without care this can lead to barriers and fustration on all sides.

Setting up the campus network was only possible with the active inclusion of IT architects and information assurance experts. To break down barriers and develop a common undertanding of the language, needs and concerns of all parties an arhitect was seconded to the Campus. The campus also takes an active role in developing the roadmap and development priorities for the ONS Data Service, interacting with both users and service providers.

The network is also used by different departments across the ONS, including National Accounts and Economic Statistics, Sustainable Development Goals and International Development. This is important as it allows a wider set of users within ONS, promoting teams to build their capability, increase skills and give feedback on their needs.

These are an important aspect of the network, as provides a two-way channel between the Campus and the rest of the office, building momentum and ensuring innovation and ideas are aligned with the wider needs of the ONS.

Example projects

The data science campus network has been utilised to deliver various projects, many that are features on the Campus GitHub pages. These include:

  • Access to Services - revealing travel information regarding access to health, social and other services. Uses the Campus propeR tool that uses opensource
    transport data to analyse multi-modal travel in the UK.

  • Identifying household gardens - accurate estimate of the green space within UK household gardens.

  • Identifying emerging trends from patent data - patents and other technical literature have key terminology trends identified, which may inform business and government decisions regarding new technologies. The analysis includes when and where terminology usage occurs, considered both nationally and internationally. This builds on the Campus PyGrams pipeline that allows information retrieval from large document collections.

  • Turning free-txt lists into hierarchical datasets - using optimus, a processing pipeline which can automatically group and generate hierarchical labels for each group to structure the data

  • Mapping the urban forest - generating vegetation density maps by for the road -network of an entire city. The developed system is built on recent advancements in the field of deep learning for semantic image segmentation.

  • Re-using the approach - the Campus is partnering with the National Institute of Statistics Rwanda (NISR), supporting them to deliver a similar IT infrastructure for a new Data Hub in Rwanda.

  • Training and development sandpit – to develop data science capability within ONS and Government wide, with resources to teach R, Python, Spark, Git, Natural Language Processing and machine learning.

What next?

The Campus also has some isolated hardware, such as GPU units that will be integrated into the network more fully. Improved access from different machines and locations will also be a short term improvment. As the main ONS Data Service develops and improved we will see many of the tools and techniques migrate over to the more secure environment. However we feel there will always be a need for a developmental and experimental space for Data Science work.

Although the Campus infra structure is currently on premises, we envisage that parts of the service will migrate to Cloud based infastructre. Initially we expect a hybrid solution, depending on the data, stakeholder and services involved.

Summary

An office wide ONS Data Service provides the access to the support, data and technology services to enable teams to transform. It controls the ingestion, technology and security tools to allow data science to be performed on sensitive data. However, the increased security also constrains and restricts libraries, models and tools that can be used. This can limit innovation and constrains the ability for data scientists to implement and experiment with new tools and develop new techniques.

The Data Science Campus network (DSCN) has been created as separate infrastructure to provide users with IT services and tool sets required to investigate more advanced techniques and produce the next generation of statistics. Users have much more freedom to develop the systems and services required to develop cutting edge techniques and pipelines. These can be refined and developed for future use on the ONS Data Service. A number of successful projects have been completed using both platforms.


David Pugh

By

Senior Data Scientist at ONS Data Science Campus.

Updated