Data Science Campus

Ensuring privacy using downsampling of streamed images

2021-06-30T00:00:00+00:00

In this blog we describe how we remove personal information (vehicle number plates) from streamed traffic camera images, in accordance with ONS’ ethical and privacy standards. We are only interested in aggregate, non-identifiable time series, so removing identifiable information is the first part of our processing.

We focus on the process of removing legible and disclosive number plates from downstream processing as part of the Traffic Cameras project published by the Data Science Campus in September 2020. The Traffic Cameras project emerged from the need to better understand changing patterns in mobility and behavior in real time in response to the coronavirus pandemic (COVID-19). The project utilizes publicly available traffic camera imagery, deep learning algorithms, and cloud services to produce in almost real time counts of pedestrians and vehicles in select city centres. Although traffic cameras are not designed to capture number plates, in rare circumstances they may be legible in camera imagery. To prevent us accidentally capturing identifiable number plates, we employ an image downsampling method to reduce image resolutions and therefore detail, which obfuscates number plates in any processed images.

This blog details how we evaluated the impact downsampling had on the performance of object detection algorithms through the assessment of object count time series, where we wished to ensure the time series was of sufficient quality whilst obfuscating number plates. This concludes development work and we have now released the source code to GitHub under the “chrono_lens” repository.

Identifying legible and disclosive example number plates

Collecting, storing, and processing traffic camera images of public spaces poses significant challenges. Our focus here is the accidental disclosure of vehicle registration number plates. The images we collect, store, and analyze are captured at various cameras settings, locations, illuminations and weather conditions. Of the thousands of images recorded at 10-minute intervals, a proportion of which are high definition. Being able to definitively recognize a number plate from these images is rare given that numerous factors must coincide at the same time. Such factors include:

Camera resolution.
Number plate angle and distance relative to the camera.
Speed of the vehicle.
Weather conditions (rain, fog, and/or sun glare).

While all these factors aligning perfectly is rare, our continuous sampling means we may record dozens of such disclosive images daily. To address this issue, we proposed downsampling all images upon being downloaded that exceeded a given resolution threshold. We had to consider what resolution would suffice to remove all disclosive number plates and secondly measure the impact downsampling images would have on the performance of object detection algorithms and thus object counts.

Downsizing too aggressively would have a needlessly detrimental impact on objects counts. Therefore, an optimal resolution threshold had to be determined. Figure 1 illustrates modified examples of disclosive vehicle number plates recorded on public traffic cameras.

Figure 1: Three modified examples of disclosive number plates captured by public traffic cameras (where the original number plate has been replaced with “ONS DSC”)

Measuring the distribution of image resolutions

We assessed the resolutions of downloaded images aiming to identify the frequency of high-resolution images. Furthermore, we could identify specific locations supplying higher resolution images. To achieve this, for a 24-hour period we recorded the meta data of images provided from all our traffic camera images providers including Transport for Greater Manchester (TfGM). We specifically recorded each image’s width, height, location, and camera ID. Figure 2 illustrates the resolution of images from TfGM, in addition to the number of observations for each resolution.

Figure 2: TfGM traffic camera images resolutions and number of observations

Figure 2 shows that a noticeable number of images have widths approaching 1000 pixels and heights approaching 450 and 550 pixels. We see that higher resolution images account for a good proportion of total images. Number plates shown in figure 1 are all sourced from similar higher resolution images. The greater number of higher resolution images increases the probability of recording legible vehicle number plates.

Downsampling images

Applying the same methodology to create Figure 2, we recorded resolutions of images for all available locations noting that the most common image width was 440 pixels. Given that no legible number plates had been identified with images of 440 pixels width, 440 pixels in width was chosen as our threshold to deem an image as being of ‘too high resolution’ if exceeded.

We decided to choose a 24-hour period from all locations to run our downsampling code on. These sample 24-hour periods would later be the subject of our before and after downsampling object count time series analysis. However, choosing a random 24-hour periods would not suffice. Traffic camera ‘uptime’ varies significantly by location, and even week-to-week for any specific location. Luckily, the team had previously developed scripts that recorded the ‘uptime’ of cameras for a given location to accompany our weekly faster indicators release. We could use the output charts to identify healthy weeks and subsequently 24-hour periods to run our analysis on.

Figure 3 illustrates what percentage of images for a given location were recorded as faulty, missing, or present. We see that Northern Ireland and TfGM images recorded greater than 90% of images being present. Conversely, Durham images shows greater than 50% of images as faulty, approximately 20% and 25% as missing and present, respectively. The week ending 27 December 2020 (2020-12-27) is likely a viable candidate from which to choose a 24-hour period for Northern Ireland, TfGM, and Transport for London (TfL).

Figure 3: Cameras status report per location for week ending 2020-12-27

Next, we had to assess 24-hour periods within a given week to find the healthiest 24-hour period. Figure 4 shows one week of data identifying periods where individual cameras were either present, missing, or faulty. For TfGM, we see the 22nd of December 00:00 – 23:50 was a healthy 24-hour period. This process of selecting healthy 24-hour periods was repeated for all locations to determine a best possible 24-hour period to conduct our analysis on. Table 1 shows the chosen 24-hour periods per location.

Figure 4: Camera status per camera for TfGM week ending 2020-12-27

Location	Candidate week ending	Chosen day
TfGM	2020-12-27	2020-12-22
Northern Ireland	2020-12-27	2020-12-24
North East	2021-01-10	2021-01-05
Durham	2021-01-17	2021-01-05
Reading	2020-12-27	2021-12-23
TfL	2020-12-27	2020-12-22
Southend	2020-12-27	2020-12-22

Table 1: Identified 24-hour periods of good camera uptime for each location

Lastly, we ran an iterative script which duplicated 24-hour periods, with higher resolution images (width exceeding 440 pixels) being downsampled to 440 pixels width. Figure 5 shows number plates before and after downsampling, where we employed the OpenCV ‘resize’ function with the interpolation method set to ‘INTER_AREA’.

a. Before (1920x1080) and after (440x274)

b. Before (1280x1024) and after (440x352).

c. Before (1280x1024) and after (440x352).

Figure 5: Comparison of number plates before and after downsampling

Time series comparison before and after downsampling

Using the meta data described above in the Section ‘Measuring the distribution of image resolutions’, we gathered lists of ‘guilty cameras’ that were known to provide higher resolution images. The lists of ‘guilty’ cameras allowed us to only analyze object counts for images provided by higher resolution cameras. This method served two purposes, (1) to reduce processing time having to only analyze a portion of images instead of the whole collection, and (2) allowed fair before and after object count comparisons between locations as including lower resolution cameras, which make up the majority of images, would bias the result if included. Figure 6 illustrates a 10-period moving average of summed object counts from ‘guilty’ cameras for each 10-minute snapshot for before and after downsampling per location over 24-hour periods. Reading and TfL are excluded from Figure 6 as no cameras from these locations provided images that exceeded our high-resolution threshold.

Figure 6a: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for Durham with an applied 10 period moving average

Figure 6b: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for Southend with an applied 10 period moving average

Figure 6c: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for TfGM with an applied 10 period moving average

Figure 6d: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for Northern Ireland with an applied 10 period moving average

Figure 6e: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for the North East with an applied 10 period moving average

When comparing object counts between “native” image resolution and downsampled resolution, we noted two main points. Firstly, downsampling higher resolution images does impact object counts for all categories and locations. Secondly, before and after counts track closely together, showing the same trends, peaks, and troughs. Before and after counts for person & cyclist and bus deviate the greatest across all locations. Generally, object counts are negatively impacted by downsampling as the blue lines trend slightly lower, with exceptions such as Southend-van, Southend-car, and Northern Ireland-van. Given that Figure 6 only shows guilty cameras, the impact on the overall counts for locations will be minimal. We investigate this further with correlation analysis, as presented in Table 2.

Pearson Correlation	Before: Car	Before: Van	Before: Bus	Before: Truck	Before: Person/Cyclist
After: Car	0.96	0.34	0.06	0.14	0.05
After: Van	0.34	0.87	0.08	0.24	0.05
After: Bus	0.06	0.05	0.86	0.10	0.12
After: Truck	0.14	0.17	0.15	0.80	0.01
After: Person/Cyclist	0.05	0.04	0.11	0.02	0.86

Table 2: Correlation table for global before and after downsampling objects counts

Table 2 demonstrates the global Pearson correlation table for all data shown in Figure 6. The high Pearsons correlation scores in Table 2 reflects the close relationship between before and after downsampling object counts in Figure 6. Globally (across all locations), car counts hold up the best with a 0.96 correlation, van second with 0.87, bus and person & cyclist tied third with 0.86, and truck with 0.80.

Conclusion

We demonstrate the implementation of a simple downsampling technique to remove legible vehicle registration plates to satisfy disclosure control checks. Furthermore, we show that time series before and after downsampling have shown significant close correlation across all locations and object types. This enables our system to simply download images into memory and immediately downsample them (discarding the higher resolution original), prior to passing on the smaller image for subsequent processing. No other sections of our analysis pipeline were impacted, and potentially disclosive images are never stored.

The full source code of the traffic cameras project is now available on GitHub (“chrono_lens”), where the system creates a Google Cloud Platform project, scaling out to process cameras as required. The system can pull publicly available imagery and records detected object counts into a BigQuery database, which is then used to generate weekly reports and a related CSV file of aggregate counts for export and further analysis.

Using custom parameters to view multiple metrics on the same page

2021-06-16T00:00:00+00:00

On every Google data studio chart, there is the option to add ‘Optional Metrics’ which adds additional metrics to the settings of any chart. However, this can cause problems with a page full of charts needing optional metrics, or if you are using community data visualizations which may not have optional metrics built in. If, for example, you wanted a single drop-down box that would filter every chart on a report including community visualizations, this would be possible using calculated fields and parameters in Data Studio. These calculated fields link together to provide a user-friendly filtering for a range of metrics at whole page level.

We’ve used custom parameters a lot within the Data Science Campus, mostly to filter charts and tables so they can be seen across multiple metrics or downloaded to a user’s needs. To demonstrate its use, we shall explore COVID-19 open data from the coronavirus dashboard. In order to show how it is built, I will set up a custom parameter to filter and choose between all COVID-19 metrics available on a basic time-series graph and bar chart.

Figure 1 Two charts with metrics added via ‘optional metrics’ menu. Filtering is only allowed at one chart level.

Firstly, let’s address optional metrics and why we steered away from this built-in function. In both charts above I’ve added all the metrics to optional metrics. Within the Data Science campus, the goal of many of our Data Studio pages is to behave like a synced report, with all charts connecting via controls. As you can see above, the issue with optional metrics is you can only filter by one chart, ignoring other parts of the page and therefore not good enough for a page with more than one chart.

Figure 2 Two charts with a custom metric selector.

In this dashboard page, like many pages used within the Data Science Campus, a custom parameter is used to allow exploration across all charts presented. The custom parameters give the user an efficient tool to allow better understanding of the data visualisations.

How to setup custom parameters

In Data Studio, the first step is to create a parameter with a list of all the metrics you want to choose between. In my example I have called it Custom Metric Selector and added the available metrics from the government dashboard.

When this parameter is completed, create a calculated field that that links to the parameter by adding the created parameter as the field’s formula.

Now that the field and parameter have been created, the final calculated field is the metric that will be added to chart. It is a CASE statement that links the list in the parameter to the metrics of your dataset.

Finally, you can add these calculated fields to a page instead of normal metrics to make life easier for any user. Add the final calculated field to the metrics of a chart and add the parameter to a drop-down control filter for the custom metrics to work together on a whole page as shown below.

We have shown in this post how we can use Google Data Studio’s custom parameters in order to bypass the limitations of the optional metrics function and to be able to view multiple metrics with community visualisations. To find out more about custom parameters, take a look at Google Data Studio’s documentation. You can also explore the report described in this post.

Using custom queries to quickly compare two time series in Google Data Studio

2021-06-02T00:00:00+00:00

Google Data Studio makes it easy to pull data from Big Query and display it as a simple visualisation, but sometimes it is not clear exactly what data you need until you are looking at a report. In those situations, a custom query can be built with parameters that allow for pre-filtering, on-the-fly calculations and even introducing an offset to your time series data so that you can compare two traces at a glance. Custom queries are written in SQL within Data Studio, to select, join, filter and perform other transformations on data across any number of tables in Big Query.

We make use of custom queries for a lot of our work using Data Studio in the Data Science Campus, in order to compare several time series data sets quickly. To demonstrate the use of custom queries we shall explore COVID-19 open data from the UK Governments Coronavirus Dashboard. For the demonstration we will display the number of new cases (people with a positive test result) alongside the number of deaths within 28 days of a positive test, for each day in the first two weeks of 2021.

Disclaimer: this is a cut of official data and our use here as part of the Office for National Statistics is purely for visualisation purposes and not to comment on the data or trends.

Two challenges to overcome when visually comparing time series data is that noise can obscure trends, and that the data can be on vastly different scales (such as COVID-19 cases and deaths). To overcome these challenges, we used Boolean parameters in our custom query. For the user, choosing a value for a Boolean parameter is as simple as checking a box, but behind the scenes this introduces branching logic to the SQL query that allows calculations to be performed on demand – in this case, smoothing (calculating a seven-day moving average), and min-max normalisation as shown in Figure 1.

Figure 1: Time series before and after min-max normalisation is applied using the tick box. The data in the above figures is for Hampshire in the first two weeks of 2021. Source: Government Coronavirus Dashboard

The idea of this dashboard page was to provide a tool to quickly explore two or more time series data sets. To allow this, we introduced an offset parameter to the custom query. The user selects any integer and the time series in question is “shifted” by that many days, as shown in Figure 2. If there is a similarity between the two data sets being observed, this offset capability makes it easy to find visually. This process highlights data sets that should be explored and compared more closely.

We used this process for different data sets, but for the purposes of demonstration you can see it below for the open COVID-19 data.

Figure 2: Introducing an offset allows the user to visually inspect two time series in search of patterns. In the second image, the deaths series has been shifted by four days. The data in the above figures is for Hampshire in the first two weeks of 2021. Source: Government Coronavirus Dashboard

How to use custom queries to connect to BigQuery

From Data Studio, in edit mode, select Resource > Manage Added Data Sources and + Add a data source. Select BigQuery as your source, and on the left-hand side of the screen select CUSTOM QUERY. From here, select the project you want to connect to and you will be presented with a custom query input box as well as the Add Parameter button.

Figure 3: Creating a custom query within Google Data Studio

The parameters you create can be Boolean, string, or numeric, and you can either choose to allow any value as input (as we did for offset) or limit input to pre-defined options.

Figure 4: Creating a parameter for a custom query. This is a string parameter that is limited to take on only four values, the UK nations.

To incorporate your parameter into your query, simply type @[parameter name], as shown below:

SELECT  
date,  
area_name,  
cases
FROM `data.covid_uk_incidence`
-- filter by nation
WHERE area_name = @nation

To introduce branching logic with Boolean parameters, you would use a CASE statement:

SELECT  
date,  
area_name,
CASE @smooth
  	WHEN false THEN cases
  	WHEN true THEN  
  		AVG(cases) OVER (  
        	PARTITION BY (area_name)   
        	ORDER BY date  
        	ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
  END AS cases,
FROM `data.covid_uk_incidence`
WHERE area_name = @nation

And to introduce an offset to time series data, create a parameter of type Number(whole) and insert it into a DATE_ADD function in your query.

SELECT  
DATE_ADD(date, INTERVAL @offset DAY) AS date,
area_name,
CASE @smooth
  	WHEN false THEN cases
  	WHEN true THEN  
  		AVG(cases) OVER (  
        	PARTITION BY (area_name)   
        	ORDER BY date  
        	ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
  END AS cases,
FROM `data.covid_uk_incidence`
WHERE area_name = @nation

There are many other ways to make use of parameters in custom queries. For example, using a parameter to filter data (as shown above) prevents data studio from loading an entire dataset only to filter it again afterwards.

To find out more about using custom queries with parameters, take a look at Data Studio’s documentation. You can also explore the report described in this post.

Blending Data Using Choromap

2021-05-19T00:00:00+00:00

Previously, we — the Data Science Campus at the Office for National Statistics — created blog posts on how to use our Google Data Studio Community Visualisations choromap and choromap-lite. Both tools have been very successful, not only for projects we run at the Data Science Campus, but also for others around the world looking for a choropleth map solution to their Data Studio visualisation needs. Since releasing these tools, we have spoken to people around the world on how to use choromap and choromap-lite. We have also incorporated outside collaborators’ improvements via our GitHub page.

Until now, the choice of whether to use choromap or choromap-lite depended on the user’s needs. choromap-lite is useful for small GeoJSON files and time series data; whereas choromap can handle larger GeoJSON files.

However, in this post, we show how the heavyweight choromap can use Google Data Studio’s Blend to:

improve run-time and efficiency when displaying time series data, and
allow the user to change the GeoJSON boundaries displayed.

That is, the user could view data at a national level, then switch to exploring the data at a more granular level. With choromap you can put the power of data exploration on choropleth maps in the hands of your user.

1 Why not just a BigQuery join?

Suppose we have a time series data set comprising multiple geography levels (we call this the Metric data set). Each geography area has a single data point for each day of the week. The total number of data points is n x m, where n is the number of polygons in all geography levels and m is the number of days. On its own, this data set may not be large. For a time period of around two weeks the data may only be 1 MB or less. However, the Geometry (GeoJSON) data for all these polygons may be considerable larger, at around 10 MB for even the most generalised shapes (we call this the Geometry data set). If we joined those two data sets together so that each data point had a geometry cell (as required in choromap) our resulting table would be over 200 MB in size. This is because of the replication of the geometry information for each time series data point. For a year’s worth of data, this would be several GBs of data when joined.

Figure 1: An image showing a sewing machine

Thus, by joining the two data sets, all we have done is make a larger data table before it has been passed to Data Studio and choromap. A useful thing about Data Studio is that metric aggregations are done before the data is passed to a Community Visualisation like choromap. So, what if we could only load the Metric and Geometry data we want to visualise and then join the filtered data sets?

That’s where blending in Data Studio comes in.

2 Data Studio Blend

Before blending, both Metric and Geometry data sets need to be added to the report. After which, choromap needs to be added from the Community visualisations and components menu. When they are added, the Metric and Geometry data sets can be blended by clicking (+) Blend Data below data source on the right-hand side when editing the choromap widget. In the first column, the Metric data set can be selected; the second column can then comprise the Geometry data set. The number of join keys depends on whether the Metric data set comprises multiple geography levels. If it is a single geography level, then the join key can just be the unique area ID, common to both data sets; if the data comprises multiple geography levels (national, county, city etc) then the join key should be the unique ID and a field specifying the geography level. This is because in some cases different geography level areas may share the same code, e.g. a county which is also a city. Finally, the relevant dimensions and metrics can be included in the blend.

3 COVID-19 Cases Example

COVID-19 cases data is an ideal example of time series data presented at multiple geography levels. The UK Government make a number of statistics available for the general public at coronavirus.data.gov.uk. Here, we have used their API to download COVID-19 case data (by specimen date) for the time period 1st January 2021 to 15th January 2021. We have also downloaded the data for the UK, each UK nation (England, Scotland, Wales and Northern Ireland) and for each English region (London, East of England, South East, Midlands, North West, South West, and North East and Yorkshire).

Disclaimer: this is a cut of official data and our use here as part of the Office for National Statistics is purely for visualisation purposes and not to comment on the data or trends.

The downloaded COVID-19 case data was uploaded to Google BigQuery alongside a GeoJSON CSV file comprising the geometries for each geography level. These geometries were downloaded from geoportal.statistics.gov.uk, converted to WGS84 reference system using QGIS and then saved as a CSV as shown in Section 1. Data: Fitting a square peg in a round hole of this blog post. Both data sources were added to a Data Studio report and then blended, as shown below.

Figure 2 An image showing how the two data sources were blended

choromap can then use the geometry information from the Geometry data set and the data from the Metric data set to create the visualisation. You can then filter the data based on a date and/or the geography level you’re interested in. The final result is shown below in a GIF.

Figure 3: A GIF showing how the user can choose different dates and geography levels to present the data at on a map

We have shown in this post how we can use Google Data Studio’s Blend functionality with choromap to make engaging and dynamic choropleth maps. A report for this work has been created here.

Harder (Easier), better, faster, stronger; choromap-lite for Google Data Studio

2020-11-17T00:00:00+00:00

In this article I introduced a tool on how to create choropleth maps in Google’s Data Studio. I was really chuffed with choromap, but it soon became apparent it wasn’t sustainable for certain data sets. Take time series, geospatial data sets for example, say where you have a metric per day, per region. In the original choromap approach each row of your data set would have to have the geometry information needed to plot. This is problematic for two reasons:

Geometry information within the raw data increases your file size. With time series data the amount of rows translates to t * a where t is the number of time series steps and a the number of areas. Essentially we only need the area information for a but choromap forces us to increase t fold. This is a waste of computation and storage.
D3 is essentially taking all those layers of the same geometry and stacking them. This again is a waste of computation.

Instead of having the geometry information within the data, a better approach is to have the geometry information separate and combine the two. This is what choromap-lite does (click for Google Data Studio example report).

1. Method

Data Studio doesn’t allow the user to load data externally. This means we cannot host geometry data externally and load it into our D3 code — boo. Instead all the information must be passed directly to D3 using Data Studio. Whereas before we used the DATA dimensions to pass through the geometry information, here instead we rely on ingesting the geometry in the STYLE tab. As we don’t need to combine the data prior to data ingest, this process is much, much easier. All we need is our structured data file, which comprises at minimum the following:

An ID for the area (e.g. 1, 2, 3)
A name for the area (e.g. The Place, The Good Place, The Bad Place)
A metric for the area (e.g. review: 3, 5, 1)

id | name           | review
| The Place      | 3
| The Good Place | 5
| The Bad Place  | 1

You could also include a timestamp for time series data sets, and additional metrics.

For plotting the geometries we need a GeoJSON file with a CRS (Coordinate Reference System) set to WGS84. In the properties field of the GeoJSON object there should be:

An ID for the area
A name for the area

{
“type”: “FeatureCollection”,
“name”: “some_polygons”,
“crs”: { “type”: “name”, “properties”: { “name”: “urn:ogc:def:crs:OGC:1.3:CRS84” } },
“features”: [
{ “type”: “Feature”, “properties”: { “id”: “1”, “name”: “That Place”}, “geometry”: { “type”: “MultiPolygon”, “coordinates”: [ [ [ [ . . .

As you can see, it’s then a simple case of joining the data and the GeoJSON based on the ID.

2. JavaScript Join

Whilst the rest of our code remains similar to choromap, choromap-lite uses a JavaScript method to combine the data and geoJSON:

geojson.features.forEach(val => {
  let { properties } = val
  let newProps = new_data[properties.id]
  val.properties = { ...properties, ...newProps }
  if (typeof val.properties.met === "undefined") {
    val.properties.met = null
  }
})

In essence, this takes the var named geojson, which is our GeoJSON object, and then for each geometry it appends our DATA metric to the object. Nice and easy*.

*There are hidden steps in the code to turn all field names in properties to those required above.

3. Using in Data Studio

To use choromap-lite import it into your Data Studio report click custom visualisations and components and select choromap-lite, or if it isn’t there:

click Explore More
click Build your own visualisation
in the manifest path type gs://choromap-lite/

Then, to get this to work in Data Studio all we need to do is paste the GeoJSON text into the STYLE tab field named Input GeoJSON. Below this are two boxes where you enter the ID and name of each geometry within the GeoJSON object. For example for:

Input GeoJSON
{
“type”: “FeatureCollection”,
“name”: “some_polygons”,
“crs”: { “type”: “name”, “properties”: { “name”: “urn:ogc:def:crs:OGC:1.3:CRS84” } },
“features”: [
{ “type”: “Feature”, “properties”: { “id”: “1”, “name”: “That Place”}, “geometry”: { “type”: “MultiPolygon”, “coordinates”: [ [ [ [ . . .

The ID field would be id and the name field would be name.

In the DATA tab the top dimension should be the ID within the data set that links the data to the geometry ID, and the bottom dimension should be the name that links the data to the geometry name.

You can then select up to 3 metrics you want to plot. By default the top metric will plot, but the user can use the dropdown box to choose the metric they want (new feature, not in choromap). In addition, you can zoom and pan choromap-lite!

Below is an example of it in action. Note, after you paste the GeoJSON in STYLE you’ll need to refresh the page.

An example of the output from using choromap-lite

And that’s it…just check out the STYLE tab for aesthetics.

Thanks for reading.

4. Known Limitations

So, choromap-lite is super good at doing a lot, but if you throw it some mega-big GeoJSON (10MB+) it’ll probably crash your dashboard. For large, complex geometries it’s probably best to use the original choromap tool. If you have time series data though I do recommend choromap-lite, it’ll save on BigQuery costs if you use them; just simplify your geometries a little first.

This post has also been posted to medium.

Creating choropleth maps in Google Data Studio

2020-07-13T00:00:00+00:00

Over the past few months I’ve been using Google Data Studio in my role at the Data Science Campus, Office for National Statistics. Having used data visualisation tools like Shiny, D3 and Dash to create bespoke dashboards in other projects I did groan when asked to use Google’s late entry into the dashboard market. Things got off to an amicable start when using the built-in line and bar chart tools; however, there was a noticeable lack customisable mapping tools in Data Studio. Ok, so it isn’t fair to point the finger just at Google in this market, because as far as I can see only Tableau offers a decent mapping capability. Even Google’s latest acquisition, Looker, seems to be limited. It’s just that after using libraries such as leaflet, D3 or deck gl, I was a little disappointed that Google would not have ported their geo-based infrastructure over to their Data Studio product. Credit to Google though, Data Studio is free, whereas others in the market are not.

Currently, at the time of writing, Google Data Studio only has two mapping tools: Geo Map and Google Maps. Geo Map allows for the creation of choropleth maps at country and sub-country level. For the US, this is US and state level. For the UK, this is UK and devolved nation level (England, Wales, Scotland and Northern Ireland). Choropleths cannot be created for smaller areas than these. The Google Maps tool in Data Studio allows for the plotting of Bubble Maps, whose centroid can be tied to areas as large as a country, or specific latitude and longitudes. Currently, there is no ability to visualise polygons in Google Maps in Data Studio, meaning that choropleths cannot be created.

What we really want to be able to do in Google Data Studio (source: https://cartographicperspectives.org/index.php/journal/article/view/cp81-butler/1437)

Thankfully, Google haven’t shipped Data Studio as a closed product. Their intention seems to see how others use it and are willing to received feedback. In line with this, they allow data visualisation enthusiasts to create ‘community visualisations’. The only community visualisation mapping tool currently on offer is r42’s hexmap, made by Ralph Spandl. It’s an excellent tool and will definitely be useful for visualising metrics of dense point data. However, no one had cracked (or found the need to) custom choropleth mapping in community visualisations as of yet.

Here, I present a new community visualisation tool called choromap that creates choropleth maps based on user-created polygons. Here is an example of choromap in action in a Data Studio report.

Choromap allows the user to choose the boundaries they want to visualise. Here are Counties and Unitary Authorities for England, UK.

1. Data: Fitting a square peg in a round hole

The first challenge was to get Google Data Studio to recognise geographical data. To start, Google Data Studio does not recognise BigQuery GEOGRAPHY data. A side note here, BigQuery GEOGRAPHY data from public data sets does not work well in choromap even after converting it to GeoJSON, see the Known Limitations section of this post. All geographical data must be loaded into Data Studio as a STRING. Here, we need to create a STRING GeoJSON type data set, save to a CSV file, which can then be read into Data Studio through one of its many ingestion methods (Cloud Storage, Google Sheets, BigQuery, file upload, to name a few).

For the complete guide on how to create the GeoJSON type CSV file, please refer to Lak Lakshmanan’s excellent Medium post on ‘How to load geographic data like shapefiles into BigQuery’. Below is a simplified version of the method, which requires gdal-bin:

download a shapefile (.shp) of the geometries you wish to plot. For example Counties (December 2017) Super Generalised Clipped Boundaries in England by ONS
Convert, if not already, the CRS (Coordinate Reference System) is WGS84
unzip the folder.
cd to the folder.
run ogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from <filename>" <filename>.csv <filename>.shp where <filename> is the filename of the shapefile.

A pre-created, sample data set for England’s Counties and Unitary Authorities can be found here (6.97 MB). The schema of this sample data is as follows:

Column Name	Description	type
Row	The row number	STRING
ctyua20cd	The area code	STRING
ctyua20nm	The area name	STRING
geometry	The geometry	STRING
st_areashape	The shape area	float
st_lengthshape	The shape length	float

The important column is geometry. Below is an extract of the geometry for The City of London in the UK:

{"type":"MultiPolygon","coordinates":[[[[-0.096786300655468,51.52332130413849],[-0.096469831048681,51.52282154015623],[-0.095088768971737,51.52313723284216],[-0.094344739975396,51.5214831278268],[-0.092516830405343,51.52148578000455],[-0.092374579177695,51.52102758467574],[-0.08969358569076,51.52071506110647],[-0.090004405037251,51.51997008801451],[-0.086227541891935,51.51880878001998],[-0.085217909720332,51.52033453455192],[-0.083325592193781,51.51981439149066],[-0.081762362787683,51.52075732827536],[-0.081050452990135,51.52195339914494],[-0.078471489442032,51.52151013413389],[-0.079429829306423,51.51884510407231],[-0.078082679099946,51.51896786642452],[-0.078146943602546,51.51846889558505],[-0.076876952796642,51.51665852905],[-0.073969190592163,51.51445357376097],[-0.073063262728713,51.5118083055404],[-0.072781089875862,51.51029829096138],[-0.074550758806615,51.50995867750763],[-0.075584143211524,51.5097499891425],[-0.076285605682929,51.5105438078804],[-0.077789551314662,51.51011438065631],[-0.078882667624717,51.50941192157917],[-0.079099326627114,51.50905757810306],[-0.078721305098214,51.50882744549881],[-0.079395354287654,51.50781128259451],[-0.080360355480934,51.50808169862811],[-0.085479453429595,51.50860342872304],[-0.087115487530159,51.50898448206557],[-0.088668830381722,51.50896992503621],[-0.091976761860799,51.50942135075249],[-0.095234497930682,51.51017176588185],[-0.095201094749157,51.51061514125803],[-0.096162542706015,51.51026430527382],[-0.099899320495398,51.51082545323712],[-0.108470620393255,51.51087126554509],[-0.11158056489403,51.51083164484565],[-0.111567244164817,51.51173049399255],[-0.112414895757562,51.51276926532587],[-0.111738730695415,51.51319547361804],[-0.111980747153716,51.51368491737404],[-0.111101534210641,51.51382547859707],[-0.111606871170285,51.51533799647236],[-0.113821109173385,51.51825760445579],[-0.107826700590834,51.51776531637376],[-0.105349963286698,51.51854099504494],[-0.101820882919502,51.5196655764016],[-0.100301084969584,51.52012831764441],[-0.097670260922139,51.5207223492556],[-0.097624204154032,51.52103184600052],[-0.097403288303114,51.5215930126154],[-0.097972548740085,51.52287738232243],[-0.096786300655468,51.52332130413849]]],[[[-0.10423511672847,51.5086262019039],[-0.104688136951949,51.50840920893765],[-0.104701932690223,51.50863143466984],[-0.10423511672847,51.5086262019039]]]]}

As you see it’s just a JSON object masked as a STRING. To combine this file with a data set comprising useful information you can then either JOIN locally and upload, or upload as is and use Data Studio’s BLEND tool.

2. Under the hood of choromap: D3.js

Choromap was created using a medley of D3 guides. I took inspiration and code from most basic choropleth map in d3.js , choropleth map with hover effect in d3.js , and Observable’s choropleth guide for example, although our approach doesn’t use an asynchronous method presented in those guides. Of course, I also spent hours on StackOverflow..

I also largely followed the Create Custom Javascript Visualizations in Data Studio Codelab guide, which provides a good basic example of creating a bar chart.

Choromap in essence uses a lot of pre-existing D3 libraries, such as the tool-tip and legend libraries. It relies on the user-defined information specified in Data Studio to create a GeoJSON type object, which it can then plot.

3. Using choromap in Data Studio

To use choromap first import it into your Data Studio report click custom visualisations and components and select choromap, or if it isn’t there:

click Explore More
click Build your own visualisation
in the manifest path type gs://choromap/

Then add the chart to your report. In the DATA tab, for the Dimensions make sure the first is the name of code of the area and the second the STRING geometry column. You can then select a Metric you want the colormap to plot against. This should be all you need to do (oh, and make sure Community visualisations access is ‘on’ for the data source).

In the STYLE tab you can then customise:

the minimum, middle and maximum value plot colours (and null values)
the boundary line colour and width
the text colour
a number of legend options including orientation, position and title
mapping to linear, quantile or custom colours breaks

To see choromap in action, this is an example of the Counties and Unitary Authorities data set.

4. Example usage

We can calculate the roundness (circularity) of an object by using the following formula:

Roundness = π * Area / Perimeter ^ 2

When we apply this to England’s Counties and Unitary Authorities, we show that Enfield in Greater London is England’s most round (circular) authority with a value of 0.61. The last round authority is Oxfordshire with a value of 0.19. The infographic below was created using Data Studio and choromap. The legend shows a linear colourmap between grey and hot pink. Then the two polygons at the bottom were created using Data Studio’s filter option and filtering on Enfield and Oxfordshire.

England’s authorities based on their ‘roundness’.

Known Limitations

As all things in life, things aren’t as simple as they appear. Below are the current known limitations of using and plotting data with choromap.

Because of something called a winding convention order raw GeoJSON files will not work. The file must be a pseudo-GeoJSON CSV file created using the method above.
Unfortunately, BigQuery public data sets also use the same convention as GeoJSON and therefore cannot be converted easily to use with this tool.
Both the above limitations can be resolved largely by simplifying the geometries using ST_SIMPLIFY but this ultimately makes the geometries more simple. For plotting this isn’t too much of an issue, but complex geometries are important for computational geometry.

Summary

This post shows how I created a community visualisation tool called choromap for Google’s Data Studio, which allows users to create choropleth maps for bespoke boundaries.

This post has also been posted to medium.

A list of the Campus’ Open Source tools

2020-05-13T00:00:00+00:00

graphite
green-spaces
mobius
optimus
propeR
pyGrams
readpyne
woffle

graphite

GitHub repo: link ⧉

Building an OpenTripPlanner server locally using a Java Virtual Machine (JVM) or Docker.

green-spaces

GitHub pages: link ⧉

GitHub repo: link ⧉

The Green Spaces project comprises a python tool that can render GeoJSON polygons over aerial imagery and analyse pixels contained within the polygons

mobius

GitHub repo: link ⧉

A python tool for extracting data from the Google Mobility Reports published to show how mobility has changed during the COVID-19 pandemic.

optimus

Github pages: link ⧉

GitHub repo: link ⧉

A text processing pipeline for turning unstructured text data into hierarchical datasets.

propeR

GitHub pages: link ⧉

GitHub repo: link ⧉

This R package was created to analyse multimodal transport for a number of research projects at the Data Science Campus, Office for National Statistics.

pyGrams

GitHub pages: link ⧉

GitHub repo: link ⧉

This python-based app is designed to extract popular or emergent n-grams/terms (words or short phrases) from free text within a large (>1,000) corpus of documents. Example corpora of granted patent document abstracts are included for testing purposes.

readpyne

GitHub repo: link ⧉

Toolkit for extracting relevant lines from receipts or similar Image data.

tron

GitHub repo: link ⧉

Detecting collective human behaviour in internet bandwidth consumption data

woffle

GitHub repo: link ⧉

woffle is a project template which aims to allows you to compose various NLP tasks via a common interface using each of the most popular currently available tools. This includes spaCy, fastText and flair.

Mx Tea: a tea break session planning bot using Slack and Google Cloud Platform

2020-04-21T00:00:00+00:00

Updates

12/05/2020: Thanks to S.Tietz for his recommendations to grouping and spotting mistakes in this post. 25/02/2021: Mr Tea is now Mx Tea. Channel.history is depreciated and replaced by conversations.history.

Since the COVID-19 lockdown came into force, we’ve been working on keeping the Data Science Campus team community going, looking after our wellbeing and making sure we’re all generally ok. And some of that involves novel and fun ways of making working-from-home feel less like working-on-your-own.

As well as the usual slack channels filled with pictures of pets, the outside world, and children’s first steps, we’ve been reinventing tea breaks. Here, we explain how this has developed into an automated system that could be used elsewhere. Our code repository for this can be found here ⧉..

The tea break

Each Tuesday and Thursday at 14:00, members of the Campus dial into a meeting to have a chat with colleagues whilst sipping a cup of tea. The novelty about these tea break sessions is that the meeting comprises four-to-five randomly assigned colleagues. The benefits to this are two-fold. Firstly, this allows the sessions to have a group size sufficiently small enough for all attendees to engage in the conversations. Secondly, it allows for a rotation of colleagues which one chats to. So far, we have seen a real positive uptake of these sessions.

Source: pexels

Approach

The initial approach was to have a dedicated Slack channel. In the morning, users would then say whether they were available for the afternoons tea break. From this, the list of attendees could be randomised into small groups. To achieve this, we created a simple python script. A list of meeting links could be created to host the tea breaks sessions. It’s up to you which blend of meeting hosting applications you choose here. For ease, we created a number of recurring meetings every Tuesday and Thursday at 14:00 so that the meeting links do not change.

Ok, so 20-minutes work every Tuesday and Thursday isn’t exactly hard-graft, but we always wanted to make this more automated so it didn’t rely on one person to be the tea administrator.

Therefore, we created Mx Tea.

Source: Data Science Campus

You created what?

Mx Tea is a Slack Bot who asks a simple question…

Source: Data Science Campus

Keen tea enthusiasts then give the post a thumbs up if they want to participate. We then use the list of people who gave a thumbs up to create groups for the tea break.

Source: Data Science Campus

Simple!

Ok, but, really how?

Ok, so I may have skipped a few steps there on how Mx Tea actually works. Essentially it boils down to three steps:

create the Slack App
post and gather responses with the Slack App using python
deploy to Google Cloud Platform

Step 1. Create the Slack Bot

The first of you need to create a Slack App and set the appropriate Scopes (minimum: channels:history, channels:read, incoming-webhook, users:read).

For posting messages to a Slack channel, we use Slack webhooks. For getting the responses back from the channel we use an OAuth Access Token.

To use the App, you need to both install it to the desired channel (done in the App settings page), as well as add to the channel (done in the channel, like you would a real user: @app-name).

Now your App (bot) is ready for action.

Step 2. Post and gather responses with the Slack App using python

Both posting and retrieving messages from Slack use curl to send a HTTP request. This can be done in your terminal, but we are going to use python here. For this we choose to use the requests library.

Step 2.1. Posting your question in the channel

Posting a question is nice and simple. For this you need the following:

your webhook_url
your message

Then, using requests and json libraries we can make a request to the webhook.

#!/usr/bin/env python
import requests
import json


""" Your Slack webhook """
slack_webhook = 'slack_webhook'

""" Question message """
slack_question_message = "Morning all, give this post a thumbs up if you are available for a tea break this afternoon!"

def ask_question():
	""" Asks the question to the Slack channel """
	headers = {'Content-type': 'application/json'}
	payload = {'text': slack_question_message}
	payload = json.dumps(payload)
	requests.post(slack_webhook, data=payload, headers=headers)

Calling ask_question() will then post the message to the channel you added the App to.

Step 2.2. Creating the groups and posting them in the channel

This is a little harder, but not too difficult. We will break it down into chunks.

First, you need the following python libraries:

#!/usr/bin/env python
import requests
import random as rnd
import math
import json

Then, the first thing to do is to setup all the preamble variables we will need.

""" Post your meeting links below """
meeting_links = \
	[
	'link_1',
	'link_2',
	'etc'
	]

""" Your Slack webhook """
slack_webhook = 'slack_webhook'

""" Your Slack token """
slack_token = 'slack_token'

""" Slack channel ID """
slack_channel_id = 'slack_channel_ID'

""" Slack bot ID """
slack_bot_id = 'slack_bot_ID'

""" Question message """
slack_question_message = "Morning all, give this post a thumbs up if you are available for a tea break this afternoon!"

""" Requested group size (note, the algorithm may use +1 if ir provides a better split of people """
group_size = 4

Next, we want to get the posts from the channel that relate to our Slack App.

def get_posts():
	""" Gets posts from the channel by the bot and returns as JSON object """
	url = f"https://slack.com/api/conversations.history?token={slack_token}&channel={slack_channel_id}&bot_id={slack_bot_id}"
	return requests.get(url).json()

Note: We have updated the above to use conversations.history as of February 2021 channels.history is depreciated.

Let’s call posts = get_posts() to get these posts. Ok, now we have all the Apps posts in that channel. But what we really want is the most recent post - the post to check the availability of people for the tea break. Therefore, let’s filter the results:

def get_recent_question(posts):
	""" Gets the most recent question post from the channel by the bot """
	latest_poll = {}
	for post in posts['messages']:
		if post.get('text') == f"{slack_question_message}":
			latest_poll = post
			break
	return latest_poll

Calling poll = get_recent_question(posts) will then bring back that specific post. Next, we want to get a list of people who gave the post a thumbs up. So we request this from the post using:

def get_attendees_ids(poll):
	""" Gets a list of attendees slack IDs """
	attendees_ids = []
	reactions = poll.get('reactions')
	for reaction in reactions:
		if reaction.get('name') == "+1":
			attendees_ids.append(reaction.get('users'))
	return attendees_ids[0]

Great, we call attendees_ids = get_attendees_ids(poll) and we get a list of those who gave a thumbs up. Slack allows us to tag people using their ID, so let’s do that. Next, let’s allocate these IDs to a group. There are to catches here that increase the group size if needed. First, if the split of attendees is better if group size is increased by one. Second, if there are too many attendees for the number of meeting links (this will increase the group size to compensate).

def allocate(ids):
	""" Allocates ids to groups """
	rnd.shuffle(ids)

	result = []

	# increase groupsize by one if the groups are more evenly split by doing so
	groupsize = group_size if abs(((len(ids) / group_size) % 1) - 0.5) >= abs(((len(ids) / (group_size + 1)) % 1) - 0.5) else group_size + 1

	# increase groupsize if we don't have enough meeting links
	groupsize = max(groupsize, math.ceil(len(ids) / len(meeting_links)))

	while True:
		# Make sure no group has less than two people
		if len(ids) > groupsize + 2:
			result.append(ids[0:groupsize])
			ids = ids[groupsize:]
		else:
			result.append(ids[0:])
			break

	return result

Calling groups = allocate(attendees_ids) we then are returned with groups of random slack IDs. Next, let’s create a nice message we can get the App to post to our desired Slack channel.

def present(groups, meeting_links):
	""" Create post for slack channel """
	msg = []
	msg.append("Good afternoon everyone, below are this afternoons tea break groups")
	num = 0
	for i, group in enumerate(groups):
		msg.append("--------------------------------------------------------------------------------")
		msg.append(f"Group {i + 1}: ")
		msg.append(meeting_links[num])
		msg.append("--------------------------------------------------------------------------------")
		for member in group:
			msg.append(f"<@{member}>")
		num = num + 1

	return '\n'.join(map(str, msg))

Calling msg = present(groups, meeting_links) returns this string by using the groups we just created, and Skype links we posted at the start. f"<@{member}>" tags the member in the post. The last thing to do therefore is to ask the App to post to the channel.

def post_to_slack(msg):
	""" Post groups to slack channel """
	headers = {'Content-type': 'application/json'}
	payload = {'text': msg}
	payload = json.dumps(payload)
	requests.post(slack_webhook, data=payload, headers=headers)

At this point you could call post_to_slack(msg) to fire off the message to check everything works.

Ok great. So how to we automate this so someone doesn’t have to run the python scripts? Well, for this we are going to use Google Cloud Platform’s Cloud Function and Cloud Scheduler features.

Step 3. Deploy to Google Cloud Platform

Step 3.1. Cloud Function

Cloud Function is what does the bulk of our work here. It processes our two requests: post the original message, and post the groups, to the Slack channel. If you’ve read our other posts you’ll see we have used Cloud Function and Scheduler before, so we won’t go into detail here on how to set a function up. The fundamentals we need here are:

a Pub/Sub trigger (so we can use Cloud Scheduler)
python 3.7 inline editor

This isn’t a memory expensive process, so you have stick to 256 Mb option. The ask_question function can be used as is, just remember to put data, context in the function’s header (e.g., ask_question(data, context)).

For the group posting functions, copy all the parts above (or see our GitHub repository with all the code in) into the editor. Then, we need just one more process function, as Cloud Functions only runs one, single function.

def process(event, context):
	""" Pipeline process required for Cloud Function """
	users = get_users()
	posts = get_posts()
	poll = get_recent_question(posts)
	attendees_ids = get_attendees_ids(poll)
	attendees_names = get_attendees_names(attendees_ids, users)
	groups = allocate(attendees_names)
	msg = present(groups, meeting_links)
	post_to_slack(msg)

Calling post_groups will run the other functions in the chain. It’ss probably good at this stage to test both of these out and see if they work.

Step 3.2. Cloud Scheduler

Cloud Scheduler runs our functions at a time that suits us using cron. For the ask_question function we want this to run every Tuesday and Thursday at 09:00 so we set the frequency to 0 9 * * 2,4. For the post_groups function we want this to post on the same dates but at 13:00, so we set the frequency to 0 13 * * 2,4. We can test whether these work by clicking RUN NOW.

Summary

If everything has worked out well you will now have setup a nice Slack App (bot) that can ask your Slack channel users who is available for a tea break session. It will then use the responses to create the randomised tea break session groups and post to the Slack channel.

Feel free to add a nice photo to the bot in its settings, or make the channel posts a little nicer. Or, put your feat up and have a nice cup of tea.

How the Campus extracted data from the Google Mobility Reports (Covid-19)

2020-04-09T00:00:00+00:00

Updates

16/04/2020: Google have released the data in CSV format. Click here for the latest data.

10/04/2020: The mobius pipeline (click here) has been updated for the Friday 10th of April 2020 release of data. This dataset comprises a time series between Sunday 23rd February 2020 and Sunday 5th April 2020. The blog post below talks about the original data set, but the methodology remains the same.

Introduction

Figure 1: mobius logo

At 7:00 am Friday 3rd April 2020, Google released Community Mobility Reports for 131 countries and 50 US states, and the District of Columbia. These reports show the changing levels of people visiting different types of locations for areas around Great Britain and other countries. The data provides insight into the impact of social distancing measures, and are created with aggregated, anonymised data from users who have turned on the Location History setting (off by default).

The reports contain headline numbers and trend graphs (Figure 3) showing how mobility has changed over time for six categories (retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential). The graphs comprise a daily mobility percent (%) value for between 16th February 2020 and 29th March 2020, relative to a baseline period from the 3rd January 2020 to 6th February 2020. Each report has national-level (or state-level) graphs, and a number of reports have regional-level graphs.

As these reports were released in PDF format, a team of Data Scientists at the Data Science Campus were tasked with gathering the data within the reports so that the raw data could be fed-in to inform the UK government response to Covid-19. We setup a public GitHub repository, gave the project the name mobius, and began our work.

Our first job was to get the data for Great Britain (including all 150 regions, Figure 2), which we were able to provide to colleagues across government working on the Covid-19 response rapidly. This included working with ONS Geography to generate a new geographical boundary dataset that reflected Google’s.

These datasets, among others, can be found at the Campus’ Google Mobility Reports Data GitHub repository.

Figure 2: All 78 pages of the Great Britain (GB) Google Mobility Report. The first two pages show national-level headline numbers and graphs. Subsequent pages comprise each of the 150 regions data.

Methodology

The main task for the team was to extract the text and graphs from the PDF reports and convert the headline numbers and trend line graphs to data. In addition, the team aimed to align this data to ONS geographies for Great Britain. Crucially, the team undertook various quality assurance steps before publishing their data. This was to ensure that the data, to the best of our knowledge, was an accurate representation to that presented in the reports.

Extracting relevant elements from Scalable Vector Graphics

The graphs in the reports were embedded as vector graphics. This means they comprised a vector coordinates system that could theoretically be used to calculate the mobility % value for each data point on the graphs. Until we realised the graphs were embedded as vectors, we were also looking at how to extract trend data directly from the charts using image analysis techniques.

First, we saved a number of individual graphs to Scalable Vector Graphics (SVG) format. We chose SVG as we wanted to retain the vector information of the graph. Crucially, SVG is drawn from mathematically declared shapes and curves (Bézier curves) that keep the vital vector coordinates system we required.

We then began to inspect these SVG files to see what information we could use to calculate the original data. It was found that the most important elements of each graph were the top (80%), middle (baseline), and bottom (-80%) horizontal lines, and the trend line (Figure 3). Both had unique stroke-widths and stroke colours that meant they could be easily filtered from the rest of the SVG elements. To filter these elements the team used the python library svgpathtools. An example of a filtered SVG graph for the graph in Figure 3 is shown below.

Figure 3: A graph from the GB report

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
        <svg width="100%" height="100%" viewBox="0 0 174 87" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" xmlns:serif="http://www.serif.com/" style="fill-rule:evenodd;clip-rule:evenodd;"><rect id="Artboard1" x="0" y="0" width="173.43" height="86.648" style="fill:none;"/>
        // horizontal lines below //
        <path d="M36.542,67.535l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
        <path d="M36.542,53.246l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
        <path d="M36.542,38.956l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
        <path d="M36.542,24.666l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
        <path d="M36.542,10.377l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
        // tred line below //
        <path d="M36.542,36.7l2.722,-1.62l2.722,2.065l2.722,0.986l2.721,1.113l2.722,-0.26l2.722,-0.1l2.722,-0.91l2.722,2.325l2.722,-1.166l2.721,-0.968l2.722,0.342l2.722,-0.561l2.722,-1.525l2.722,0.17l2.722,0.715l2.721,1.33l2.722,-1.502l2.722,2.046l2.722,-0.308l2.722,0.873l2.722,-1.226l2.721,-0.522l2.722,-0.248l2.722,1.104l2.722,1.897l2.722,-0.42l2.722,3.241l2.721,-1.596l2.722,-0.832l2.722,4.205l2.722,2.135l2.722,1.405l2.722,2.574l2.721,7.795l2.722,1.053l2.722,-2.437l2.722,7.716l2.722,1.389l2.722,0.461l2.721,0.062l2.722,2.397l2.722,-0.347" style="fill:none;stroke:#4285f4;stroke-width:1.14px;"/>
        </svg>

Changing vector coordinates to data values

Once the relevant elements were filtered from the SVG file(s), the team calculated a scaling metric. This scaling metric would be used to convert distances in the vector coordinate system to mobility % values. To do this, the distance between the +/- 80% lines and the baseline were used, and the average distance stored. We then calculated the distance between the trend line and the baseline in the y dimension for each vertex of the trend line. Each vertex in this case reflects a data point. Once these values had been found, the scaling value could then be used to calculate the mobility % value.

Getting data for an entire document

After we found a solution for a single graph, we needed to scale the process to calculate mobility % values for all graphs in a document.

We explored command line and python tools including pdftocairo and PyMuPDF, however, the SVGs created with these methods resulted in a number of issues. One issue is that they did not flatten the SVG image. This means that the vector coordinate system in the SVG path did not represent its true location compared to other graphs and elements of graphs. We would therefore need to use the transform matrix stored within the attributes of the SVG element to find the distances between the horizontal lines and the trend line. As data was required as soon as possible, we therefore decided against these approaches. Furthermore, for points on the graphs, these approaches did not close the path, which caused errors loading the SVG using svgpathtools.

Due to the time constraints of this work the team converted all of the PDF documents to SVG documents using Affinity Designer. Affinity Designer is a low-fee vector graphics editor developed for macOS, iOS, and Microsoft Windows. It provides an option to flatten the SVG graphs, as well as automatically closing paths (needed to maintain single data points in graphs).

The SVGs and PDFs were then uploaded to Google Cloud Platform (GCP) so we could add them to the download function in the pipeline. For the SVG documents, graphs are arranged as follows:

 ------------   ------------   
| Page 1 & 2 | |  Page 3 +  | 
|  1         | |  1  2  3   | 
|  2         | |  4  5  6   | 
|  3         | |  --------  |
|            | |  7  8  9   | 
|            | |  10 11 12  | 
 -----------    ------------

Thus, the first six graphs in a document were from pages one and two, and related to the national-level graphs for the six categories. The subsequent pages comprised two sets of six graphs, each set relating to a region.

Quality Assurance

In addition to developing a reusable pipeline to extract the data from the reports then team undertook comprehensive quality assurance on all the data uploaded to the Campus’ Google Mobility Reports Data GitHub repository. Whilst the team were working to include quality assurance steps within their code (see below), they used R code written by Cabinet office colleague, Matt Kerlogue. This code enabled us to extract the position number of the graph within each report, in addition to the place name and category. This code also returned the headline value for each graph which allowed us to sense-check the data being extracted from the graphs. Using this the team could flag any inconsistencies where the SVGs had been extracted in a different order, and gaps within both graphs and the extracted text due to missing data.

Adding Quality Assurance to our code

Using python, the team then developed a method to read the headline numbers in the text of the report. These numbers relate to the overall increase or decline in mobility at the end of the time series (compared to the baseline). By parsing the PDF, textual information about the region name, and category type could also be found. This information was also combined with the graph trend line dataset to provide the user with all the textual and numeric information from the reports. This information also allowed the team to automatically cross-check the data from the graph trends, and also align the extracted data with the regions it refered to.

Summary

The team at the Campus worked at speed to provide analysts from across government with data from Google’s Community Mobility PDF Reports rapidly. The team also created a public python tool so that data from any Community Mobility Report can be gathered, helping analysts from all over the world understand changes in mobility over time.

Since its release, our code and data has been used by academics, government workers and members of the public in countries around the world, such as Poland and the US. The Campus’ Google Mobility Reports Data GitHub repository contains quality checked data for all G20 countries (except Russia and China, where reports were not available).

We hope that Google releases future updates, as the community mobility profiles are a valuable addition to the data we have on mobility. Please keep an eye on our code repository for updates.

Figure 4: A GIF video of the pipeline in action (at 2x speed)

How to Setup a Google Cloud VM, pull a GitHub repo, install docker, and run a script at regular intervals using cron (in 22 steps)

2020-03-20T00:00:00+00:00

This is a short guide on how to setup a Google Cloud Platform VM, pull a GitHub repository and let cron run a script on regular intervals. There probably are some bad practices in this guide, such as not using a virtual environment or conda environment, and instead using the base python environment. However, the aim of this was to create a single-use VM for a specific task - which we could leave run chugging away.

Step 1: To setup VM on GCP first go to `Compute Engine`, then `VM instances`, then `Create Instance`. Name the VM instance. For this example, all default settings should be fine, except:

set region to europe-west2 (London)
set zone to europe-west2-b
you may want to increase the Boot Disk
allow HTTP and HTTPS traffic on the Firewall

Step 2: Click on the down arrow next to SSH and choose `view Cloud command`. Copy the command to a terminal window and run.

Step 3: Run `gcloud init` in your SSH’d terminal window and sign in with a your personal Google GCP account (option 2). Follow the instructions and set the required `project` and `region`.

Step 4: Step 4 and 5 were sourced from here. Install a few crucial packages for Debian.

```bash
$ sudo apt-get update 
$ sudo apt-get install bzip2 git libxml2-dev
```

Step 5: Install miniconda

```bash
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh 
$ bash Miniconda3-latest-Linux-x86_64.sh 
$ rm Miniconda3-latest-Linux-x86_64.sh 
$ source .bashrc 
$ conda install scikit-learn pandas jupyter ipython
```

Step 6: Set your global git credentials

```bash
$ git config --global user.name 'User Name' 
$ git config --global user.email 'User Email'
```

Step 7: Steps 7 to 9 were sourced from here. Generate a new SSH key for GitHub. First, paste the text below into your terminal window. Use the default location and no password.

```bash
$ ssh-keygen -t rsa -b 4096 -C 'User Email'
```

Step 8: Start the ssh-agent in the background.

```bash
$ eval "$(ssh-agent -s)"
```

Step 9: Add your SSH private key to the ssh-agent. If you created your key with a different name, or if you are adding an existing key that has a different name, replace id_rsa in the command with the name of your private key file.

```bash
$ ssh-add ~/.ssh/id_rsa
```

Step 10: Steps 10 and 11 were sourced from here. Paste your SSH key to terminal and then copy it to clipboard

```bash
$ cat < ~/.ssh/id_rsa.pub
```

Step 11: Go to https://github.com/settings/ssh/new and add a title for your key, then paste in the clipboard text into the key box. Click `Add Key` and then enter your password.

Step 12: Make a GitProjects directory and then cd to it.

```bash
$ mkdir GitProjects 
$ cd GitProjects
```

Step 13: Clone your GitHub repo using the SSH link

```bash
$ git clone repo
```

Step 14: Install docker (steps 14 to XX were sourced from here). Install the packages necessary to add a new repository over HTTPS.

```bash
$ sudo apt update
$ sudo apt install apt-transport-https ca-certificates curl software-properties-common gnupg2
```

Step 15: Import the repository’s GPG key using the following curl command:

```bash
$ curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
```

Step 16: Add the stable Docker APT repository to your system’s software repository list.

```bash
$sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
```

Step 17: Update the apt package list and install the latest version of Docker CE (Community Edition).

```bash
$ sudo apt update
$ sudo apt install docker-ce
```

Step 18: Once the installation is completed the Docker service will start automatically. To verify it type in.

```bash
$ sudo systemctl status docker
```

Step 19: Run a docker image.

```bash
$ sudo docker run -p port:port <image_name>
```

Step 20: Create a cron job. Select nano as the editor (option 1)

```bash
$ crontab -e
```

Step 21: Paste into your crontab the cron jobs you want to run. The below example goes to the project folder and runs `example.py` each hour of every day. Log outputs are saved to `cron.out`.

```text
$ 0 * * * * cd ~/GitProjects/project && /home/username/miniconda3/bin/python ~/GitProjects/project/example.py >> ~/GitProjects/project/cron.out 2>&1
```

Step 22: Use `^X` to save the crontab.

Your python script should now run at XX:00 every hour. See crontab guru for more information about setting cron times.

Data Science Campus

Ensuring privacy using downsampling of streamed images

Identifying legible and disclosive example number plates

Measuring the distribution of image resolutions

Downsampling images

Time series comparison before and after downsampling

Conclusion

Using custom parameters to view multiple metrics on the same page

How to setup custom parameters

Using custom queries to quickly compare two time series in Google Data Studio

How to use custom queries to connect to BigQuery

Blending Data Using Choromap

1 Why not just a BigQuery join?

2 Data Studio Blend

3 COVID-19 Cases Example

Harder (Easier), better, faster, stronger; choromap-lite for Google Data Studio

1. Method

2. JavaScript Join

3. Using in Data Studio

4. Known Limitations

Creating choropleth maps in Google Data Studio

1. Data: Fitting a square peg in a round hole

2. Under the hood of choromap: D3.js

3. Using choromap in Data Studio

4. Example usage

Summary

A list of the Campus’ Open Source tools

graphite

green-spaces

mobius

optimus

propeR

pyGrams

readpyne

tron

woffle

Mx Tea: a tea break session planning bot using Slack and Google Cloud Platform

Updates

The tea break

Approach

You created what?

Ok, but, really how?

Step 1. Create the Slack Bot

Step 2. Post and gather responses with the Slack App using python

Step 2.1. Posting your question in the channel

Step 2.2. Creating the groups and posting them in the channel

Step 3. Deploy to Google Cloud Platform

Step 3.1. Cloud Function

Step 3.2. Cloud Scheduler

Summary

How the Campus extracted data from the Google Mobility Reports (Covid-19)

Updates

Introduction

Methodology

Extracting relevant elements from Scalable Vector Graphics

Changing vector coordinates to data values

Getting data for an entire document

Quality Assurance

Adding Quality Assurance to our code

Summary

How to Setup a Google Cloud VM, pull a GitHub repo, install docker, and run a script at regular intervals using cron (in 22 steps)

Step 1: To setup VM on GCP first go to Compute Engine, then VM instances, then Create Instance. Name the VM instance. For this example, all default settings should be fine, except:

Step 2: Click on the down arrow next to SSH and choose view Cloud command. Copy the command to a terminal window and run.

Step 3: Run gcloud init in your SSH’d terminal window and sign in with a your personal Google GCP account (option 2). Follow the instructions and set the required project and region.

Step 4: Step 4 and 5 were sourced from here. Install a few crucial packages for Debian.

Step 5: Install miniconda

Step 6: Set your global git credentials

Step 7: Steps 7 to 9 were sourced from here. Generate a new SSH key for GitHub. First, paste the text below into your terminal window. Use the default location and no password.

Step 8: Start the ssh-agent in the background.

Step 9: Add your SSH private key to the ssh-agent. If you created your key with a different name, or if you are adding an existing key that has a different name, replace id_rsa in the command with the name of your private key file.

Step 10: Steps 10 and 11 were sourced from here. Paste your SSH key to terminal and then copy it to clipboard

Step 11: Go to https://github.com/settings/ssh/new and add a title for your key, then paste in the clipboard text into the key box. Click Add Key and then enter your password.

Step 12: Make a GitProjects directory and then cd to it.

Step 13: Clone your GitHub repo using the SSH link

Step 14: Install docker (steps 14 to XX were sourced from here). Install the packages necessary to add a new repository over HTTPS.

Step 15: Import the repository’s GPG key using the following curl command:

Step 16: Add the stable Docker APT repository to your system’s software repository list.

Step 17: Update the apt package list and install the latest version of Docker CE (Community Edition).

Step 18: Once the installation is completed the Docker service will start automatically. To verify it type in.

Step 19: Run a docker image.

Step 1: To setup VM on GCP first go to `Compute Engine`, then `VM instances`, then `Create Instance`. Name the VM instance. For this example, all default settings should be fine, except:

Step 2: Click on the down arrow next to SSH and choose `view Cloud command`. Copy the command to a terminal window and run.

Step 3: Run `gcloud init` in your SSH’d terminal window and sign in with a your personal Google GCP account (option 2). Follow the instructions and set the required `project` and `region`.

Step 11: Go to https://github.com/settings/ssh/new and add a title for your key, then paste in the clipboard text into the key box. Click `Add Key` and then enter your password.

Step 21: Paste into your crontab the cron jobs you want to run. The below example goes to the project folder and runs `example.py` each hour of every day. Log outputs are saved to `cron.out`.

Step 22: Use `^X` to save the crontab.