Jekyll2024-01-12T14:10:20+00:00https://datasciencecampus.github.io/atom.xmlData Science CampusData science campus githubEnsuring privacy using downsampling of streamed images2021-06-30T00:00:00+00:002021-06-30T00:00:00+00:00https://datasciencecampus.github.io/privacy-via-image-downsampling<p>In this blog we describe how we remove personal information (vehicle number plates) from streamed traffic camera images, in accordance with <a href="https://www.ons.gov.uk/aboutus/transparencyandgovernance/datastrategy/dataprinciples">ONS’ ethical and privacy standards</a>. We are only interested in aggregate, non-identifiable time series, so removing identifiable information is the first part of our processing.</p>
<p>We focus on the process of removing legible and disclosive number plates from downstream processing as part of the <a href="https://datasciencecampus.ons.gov.uk/projects/estimating-vehicle-and-pedestrian-activity-from-town-and-city-traffic-cameras/">Traffic Cameras project</a> published by the Data Science Campus in September 2020. The Traffic Cameras project emerged from the need to better understand changing patterns in mobility and behavior in real time in response to the coronavirus pandemic (COVID-19). The project utilizes publicly available traffic camera imagery, deep learning algorithms, and cloud services to produce in almost real time counts of pedestrians and vehicles in select city centres. Although traffic cameras are not designed to capture number plates, in rare circumstances they may be legible in camera imagery. To prevent us accidentally capturing identifiable number plates, we employ an image downsampling method to reduce image resolutions and therefore detail, which obfuscates number plates in any processed images.</p>
<p>This blog details how we evaluated the impact downsampling had on the performance of object detection algorithms through the assessment of object count time series, where we wished to ensure the time series was of sufficient quality whilst obfuscating number plates. This concludes development work and we have now released the source code to GitHub under the “<a href="https://github.com/datasciencecampus/chrono_lens">chrono_lens</a>” repository.</p>
<h2 id="identifying-legible-and-disclosive-example-number-plates">Identifying legible and disclosive example number plates</h2>
<p>Collecting, storing, and processing traffic camera images of public spaces poses significant challenges. Our focus here is the accidental disclosure of vehicle registration number plates. The images we collect, store, and analyze are captured at various cameras settings, locations, illuminations and weather conditions. Of the thousands of images recorded at 10-minute intervals, a proportion of which are high definition. Being able to definitively recognize a number plate from these images is rare given that numerous factors must coincide at the same time. Such factors include:</p>
<ol>
<li>Camera resolution.</li>
<li>Number plate angle and distance relative to the camera.</li>
<li>Speed of the vehicle.</li>
<li>Weather conditions (rain, fog, and/or sun glare).</li>
</ol>
<p>While all these factors aligning perfectly is rare, our continuous sampling means we may record dozens of such disclosive images daily. To address this issue, we proposed downsampling all images upon being downloaded that exceeded a given resolution threshold. We had to consider what resolution would suffice to remove all disclosive number plates and secondly measure the impact downsampling images would have on the performance of object detection algorithms and thus object counts.</p>
<p>Downsizing too aggressively would have a needlessly detrimental impact on objects counts. Therefore, an optimal resolution threshold had to be determined. Figure 1 illustrates modified examples of disclosive vehicle number plates recorded on public traffic cameras.</p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig1-example-plates.jpg" alt="Example images showing disclosive number plates, with actual plate replaced with "DSC ONS"" /><br />
<em>Figure 1: Three modified examples of disclosive number plates captured by public traffic cameras (where the original number plate has been replaced with “ONS DSC”)</em></p>
<h2 id="measuring-the-distribution-of-image-resolutions">Measuring the distribution of image resolutions</h2>
<p>We assessed the resolutions of downloaded images aiming to identify the frequency of high-resolution images. Furthermore, we could identify specific locations supplying higher resolution images. To achieve this, for a 24-hour period we recorded the meta data of images provided from all our traffic camera images providers including Transport for Greater Manchester (TfGM). We specifically recorded each image’s width, height, location, and camera ID. Figure 2 illustrates the resolution of images from TfGM, in addition to the number of observations for each resolution.</p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig2-TfGM_X_Y_chart.png" alt="Scatter plot showing quantities of images against image width and height" />
<em>Figure 2: TfGM traffic camera images resolutions and number of observations</em></p>
<p>Figure 2 shows that a noticeable number of images have widths approaching 1000 pixels and heights approaching 450 and 550 pixels. We see that higher resolution images account for a good proportion of total images. Number plates shown in figure 1 are all sourced from similar higher resolution images. The greater number of higher resolution images increases the probability of recording legible vehicle number plates.</p>
<h2 id="downsampling-images">Downsampling images</h2>
<p>Applying the same methodology to create Figure 2, we recorded resolutions of images for all available locations noting that the most common image width was 440 pixels. Given that no legible number plates had been identified with images of 440 pixels width, 440 pixels in width was chosen as our threshold to deem an image as being of ‘too high resolution’ if exceeded.</p>
<p>We decided to choose a 24-hour period from all locations to run our downsampling code on. These sample 24-hour periods would later be the subject of our before and after downsampling object count time series analysis. However, choosing a random 24-hour periods would not suffice. Traffic camera ‘uptime’ varies significantly by location, and even week-to-week for any specific location. Luckily, the team had previously developed scripts that recorded the ‘uptime’ of cameras for a given location to accompany our weekly faster indicators release. We could use the output charts to identify healthy weeks and subsequently 24-hour periods to run our analysis on.</p>
<p>Figure 3 illustrates what percentage of images for a given location were recorded as faulty, missing, or present. We see that Northern Ireland and TfGM images recorded greater than 90% of images being present. Conversely, Durham images shows greater than 50% of images as faulty, approximately 20% and 25% as missing and present, respectively. The week ending 27 December 2020 (2020-12-27) is likely a viable candidate from which to choose a 24-hour period for Northern Ireland, TfGM, and Transport for London (TfL).</p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig3-status_report_20201228.png" alt="Stacked bar chart showing present, missing and faulty ratios of cameras per region" />
<em>Figure 3: Cameras status report per location for week ending 2020-12-27</em></p>
<p>Next, we had to assess 24-hour periods within a given week to find the healthiest 24-hour period. Figure 4 shows one week of data identifying periods where individual cameras were either present, missing, or faulty. For TfGM, we see the 22nd of December 00:00 – 23:50 was a healthy 24-hour period. This process of selecting healthy 24-hour periods was repeated for all locations to determine a best possible 24-hour period to conduct our analysis on. Table 1 shows the chosen 24-hour periods per location.</p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig4-TfGM-camera-status.png" alt="Horizontal stacked bar chart showing faulty/missing/present status of a selection of TfGM cameras across the week ending 2020-12-27" />
<em>Figure 4: Camera status per camera for TfGM week ending 2020-12-27</em></p>
<table>
<thead>
<tr>
<th>Location</th>
<th>Candidate week ending</th>
<th>Chosen day</th>
</tr>
</thead>
<tbody>
<tr>
<td>TfGM</td>
<td>2020-12-27</td>
<td>2020-12-22</td>
</tr>
<tr>
<td>Northern Ireland</td>
<td>2020-12-27</td>
<td>2020-12-24</td>
</tr>
<tr>
<td>North East</td>
<td>2021-01-10</td>
<td>2021-01-05</td>
</tr>
<tr>
<td>Durham</td>
<td>2021-01-17</td>
<td>2021-01-05</td>
</tr>
<tr>
<td>Reading</td>
<td>2020-12-27</td>
<td>2021-12-23</td>
</tr>
<tr>
<td>TfL</td>
<td>2020-12-27</td>
<td>2020-12-22</td>
</tr>
<tr>
<td>Southend</td>
<td>2020-12-27</td>
<td>2020-12-22</td>
</tr>
</tbody>
</table>
<p><em>Table 1: Identified 24-hour periods of good camera uptime for each location</em></p>
<p>Lastly, we ran an iterative script which duplicated 24-hour periods, with higher resolution images (width exceeding 440 pixels) being downsampled to 440 pixels width. Figure 5 shows number plates before and after downsampling, where we employed the OpenCV ‘resize’ function with the interpolation method set to ‘INTER_AREA’.</p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig5a-sample-number-plate.png" alt="Example nunmber plate, replaced with "ONS DSC" to show obfuscation" /><br />
<em>a. Before (1920x1080) and after (440x274)</em></p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig5b-sample-number-plate.png" alt="Example nunmber plate, replaced with "ONS DSC" to show obfuscation" /><br />
<em>b. Before (1280x1024) and after (440x352).</em></p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig5c-sample-number-plate.png" alt="Example nunmber plate, replaced with "ONS DSC" to show obfuscation" /><br />
<em>c. Before (1280x1024) and after (440x352).</em></p>
<p><em>Figure 5: Comparison of number plates before and after downsampling</em></p>
<h2 id="time-series-comparison-before-and-after-downsampling">Time series comparison before and after downsampling</h2>
<p>Using the meta data described above in the Section ‘Measuring the distribution of image resolutions’, we gathered lists of ‘guilty cameras’ that were known to provide higher resolution images. The lists of ‘guilty’ cameras allowed us to only analyze object counts for images provided by higher resolution cameras. This method served two purposes, (1) to reduce processing time having to only analyze a portion of images instead of the whole collection, and (2) allowed fair before and after object count comparisons between locations as including lower resolution cameras, which make up the majority of images, would bias the result if included. Figure 6 illustrates a 10-period moving average of summed object counts from ‘guilty’ cameras for each 10-minute snapshot for before and after downsampling per location over 24-hour periods. Reading and TfL are excluded from Figure 6 as no cameras from these locations provided images that exceeded our high-resolution threshold.</p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig6a-Durham_before_and_after_counts_plot.png" alt="Summed object counts for Durham before and after downsampling, plotted over 10 minute intervals with 10 period moving average" /><br />
<em>Figure 6a: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for Durham with an applied 10 period moving average</em></p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig6b-Southend_before_and_after_counts_plot.png" alt="Summed object counts for Southend before and after downsampling, plotted over 10 minute intervals with 10 period moving average" /><br />
<em>Figure 6b: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for Southend with an applied 10 period moving average</em></p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig6c-TfGM_before_and_after_counts_plot.png" alt="Summed object counts for TfGM before and after downsampling, plotted over 10 minute intervals with 10 period moving average" /><br />
<em>Figure 6c: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for TfGM with an applied 10 period moving average</em></p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig6d-Northern-Ireland_before_and_after_counts_plot.png" alt="Summed object counts for Northern Ireland before and after downsampling, plotted over 10 minute intervals with 10 period moving average" /><br />
<em>Figure 6d: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for Northern Ireland with an applied 10 period moving average</em></p>
<p><img src="/images/blog/privacy-via-image-downsampling/fig6e-North-East-Travel-Data_before_and_after_counts_plot.png" alt="Summed object counts for the North East before and after downsampling, plotted over 10 minute intervals with 10 period moving average" /><br />
<em>Figure 6e: Summed object counts from ‘guilty’ cameras for each 10-minute snapshot over a 24-hour period before and after downsampling for the North East with an applied 10 period moving average</em></p>
<p>When comparing object counts between “native” image resolution and downsampled resolution, we noted two main points. Firstly, downsampling higher resolution images does impact object counts for all categories and locations.
Secondly, before and after counts track closely together, showing the same trends, peaks, and troughs. Before and after counts for person & cyclist and bus deviate the greatest across all locations. Generally, object counts are negatively impacted by downsampling as the blue lines trend slightly lower, with exceptions such as Southend-van, Southend-car, and Northern Ireland-van. Given that Figure 6 only shows guilty cameras, the impact on the overall counts for locations will be minimal.
We investigate this further with correlation analysis, as presented in Table 2.</p>
<table>
<thead>
<tr>
<th>Pearson Correlation</th>
<th>Before: Car</th>
<th>Before: Van</th>
<th>Before: Bus</th>
<th>Before: Truck</th>
<th>Before: Person/Cyclist</th>
</tr>
</thead>
<tbody>
<tr>
<td>After: Car</td>
<td><em>0.96</em></td>
<td>0.34</td>
<td>0.06</td>
<td>0.14</td>
<td>0.05</td>
</tr>
<tr>
<td>After: Van</td>
<td>0.34</td>
<td><em>0.87</em></td>
<td>0.08</td>
<td>0.24</td>
<td>0.05</td>
</tr>
<tr>
<td>After: Bus</td>
<td>0.06</td>
<td>0.05</td>
<td><em>0.86</em></td>
<td>0.10</td>
<td>0.12</td>
</tr>
<tr>
<td>After: Truck</td>
<td>0.14</td>
<td>0.17</td>
<td>0.15</td>
<td><em>0.80</em></td>
<td>0.01</td>
</tr>
<tr>
<td>After: Person/Cyclist</td>
<td>0.05</td>
<td>0.04</td>
<td>0.11</td>
<td>0.02</td>
<td><em>0.86</em></td>
</tr>
</tbody>
</table>
<p><em>Table 2: Correlation table for global before and after downsampling objects counts</em></p>
<p>Table 2 demonstrates the global Pearson correlation table for all data shown in Figure 6. The high Pearsons correlation scores in Table 2 reflects the close relationship between before and after downsampling object counts in Figure 6.
Globally (across all locations), car counts hold up the best with a 0.96 correlation, van second with 0.87, bus and person & cyclist tied third with 0.86, and truck with 0.80.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We demonstrate the implementation of a simple downsampling technique to remove legible vehicle registration plates to satisfy disclosure control checks. Furthermore, we show that time series before and after downsampling have shown significant close correlation across all locations and object types. This enables our system to simply download images into memory and immediately downsample them (discarding the higher resolution original), prior to passing on the smaller image for subsequent processing. No other sections of our analysis pipeline were impacted, and potentially disclosive images are never stored.</p>
<p>The full source code of the traffic cameras project is now available on GitHub (“<a href="https://github.com/datasciencecampus/chrono_lens">chrono_lens</a>”), where the system creates a Google Cloud Platform project, scaling out to process cameras as required. The system can pull publicly available imagery and records detected object counts into a BigQuery database, which is then used to generate weekly reports and a related CSV file of aggregate counts for export and further analysis.</p>Ryan Lewisryan.lewis@ons.gov.ukHow we remove personal information (vehicle number plates) from streamed traffic camera images, in accordance with ONS’ ethical and privacy standards.Using custom parameters to view multiple metrics on the same page2021-06-16T00:00:00+00:002021-06-16T00:00:00+00:00https://datasciencecampus.github.io/using-custom-parameters<p>On every Google data studio chart, there is the option to add ‘Optional Metrics’ which adds additional metrics to the settings of any chart. However, this can cause problems with a page full of charts needing optional metrics, or if you are using community data visualizations which may not have optional metrics built in. If, for example, you wanted a single drop-down box that would filter every chart on a report including community visualizations, this would be possible using calculated fields and parameters in Data Studio. These calculated fields link together to provide a user-friendly filtering for a range of metrics at whole page level.</p>
<p>We’ve used custom parameters a lot within the Data Science Campus, mostly to filter charts and tables so they can be seen across multiple metrics or downloaded to a user’s needs. To demonstrate its use, we shall explore COVID-19 open data from the <a href="https://coronavirus.data.gov.uk/">coronavirus dashboard</a>. In order to show how it is built, I will set up a custom parameter to filter and choose between all COVID-19 metrics available on a basic time-series graph and bar chart.</p>
<p><img src="/images/blog/using-custom-parameters/figure1.png" alt="A time series chart and bar chart added using an optional metrics menu." /><br />
<em>Figure 1 Two charts with metrics added via ‘optional metrics’ menu. Filtering is only allowed at one chart level.</em></p>
<p>Firstly, let’s address optional metrics and why we steered away from this built-in function. In both charts above I’ve added all the metrics to optional metrics. Within the Data Science campus, the goal of many of our Data Studio pages is to behave like a synced report, with all charts connecting via controls. As you can see above, the issue with optional metrics is you can only filter by one chart, ignoring other parts of the page and therefore not good enough for a page with more than one chart.</p>
<p><img src="/images/blog/using-custom-parameters/figure2.png" alt="A time series chart and bar chart added using a custom metrics menu." /><br />
<em>Figure 2 Two charts with a custom metric selector.</em></p>
<p>In this dashboard page, like many pages used within the Data Science Campus, a custom parameter is used to allow exploration across all charts presented. The custom parameters give the user an efficient tool to allow better understanding of the data visualisations.</p>
<h2 id="how-to-setup-custom-parameters">How to setup custom parameters</h2>
<p>In Data Studio, the first step is to create a parameter with a list of all the metrics you want to choose between. In my example I have called it Custom Metric Selector and added the available metrics from the government dashboard.</p>
<p><img src="/images/blog/using-custom-parameters/figure3.png" alt="Parameter creation in Google Data Studio" /></p>
<p>When this parameter is completed, create a calculated field that that links to the parameter by adding the created parameter as the field’s formula.</p>
<p><img src="/images/blog/using-custom-parameters/figure4.png" alt="Creating a calculated field" /></p>
<p>Now that the field and parameter have been created, the final calculated field is the metric that will be added to chart. It is a CASE statement that links the list in the parameter to the metrics of your dataset.</p>
<p><img src="/images/blog/using-custom-parameters/figure5.png" alt="Using a CASE statement to create a calculated field taht will act as a custom parameter" /></p>
<p>Finally, you can add these calculated fields to a page instead of normal metrics to make life easier for any user. Add the final calculated field to the metrics of a chart and add the parameter to a drop-down control filter for the custom metrics to work together on a whole page as shown below.</p>
<p><img src="/images/blog/using-custom-parameters/figure6a.png" alt="Adding the calculated field to the chart" />
<img src="/images/blog/using-custom-parameters/figure6b.png" alt="Adding the control filter with custom parameter to the page" /></p>
<p>We have shown in this post how we can use Google Data Studio’s custom parameters in order to bypass the limitations of the optional metrics function and to be able to view multiple metrics with community visualisations. To find out more about custom parameters, take a look at Google Data Studio’s <a href="https://support.google.com/datastudio/answer/9002005?hl=en#zippy=%2Cin-this-article">documentation</a>. You can also <a href="https://datastudio.google.com/s/opW6eX16BX4">explore the report</a> described in this post.</p>Gabe Shiresgabriel.shires@ons.gov.ukMake use of Google Data Studio's parameters to provide user-friendly filtering for a range of metrics.Using custom queries to quickly compare two time series in Google Data Studio2021-06-02T00:00:00+00:002021-06-02T00:00:00+00:00https://datasciencecampus.github.io/using-custom-queries<p>Google Data Studio makes it easy to pull data from Big Query and display it as a simple visualisation, but sometimes it is not clear exactly what data you need until you are looking at a report. In those situations, a <a href="https://support.google.com/datastudio/answer/6370296?hl=en#zippy=%2Cin-this-article">custom query</a> can be built with parameters that allow for pre-filtering, on-the-fly calculations and even introducing an offset to your time series data so that you can compare two traces at a glance. Custom queries are written in SQL within Data Studio, to select, join, filter and perform other transformations on data across any number of tables in Big Query.</p>
<p>We make use of custom queries for a lot of our work using Data Studio in the Data Science Campus, in order to compare several time series data sets quickly. To demonstrate the use of custom queries we shall explore COVID-19 open data from the <a href="https://coronavirus.data.gov.uk/">UK Governments Coronavirus Dashboard</a>. For the demonstration we will display the number of new cases (people with a positive test result) alongside the number of deaths within 28 days of a positive test, for each day in the first two weeks of 2021.</p>
<p><em>Disclaimer: this is a cut of official data and our use here as part of the Office for National Statistics is purely for visualisation purposes and not to comment on the data or trends.</em></p>
<p>Two challenges to overcome when visually comparing time series data is that noise can obscure trends, and that the data can be on vastly different scales (such as COVID-19 cases and deaths). To overcome these challenges, we used Boolean parameters in our custom query. For the user, choosing a value for a Boolean parameter is as simple as checking a box, but behind the scenes this introduces branching logic to the SQL query that allows calculations to be performed on demand – in this case, smoothing (calculating a seven-day moving average), and min-max normalisation as shown in Figure 1.</p>
<p><img src="/images/blog/using-custom-queries/1a_before_normalisation.PNG" alt="Time series chart of cases and deaths for the period 1 Jan 2021 - 15 Jan 2021, before normalisation is applied" />
<img src="/images/blog/using-custom-queries/1b_after_normalisation.PNG" alt="Time series chart of cases and deaths for the period 1 Jan 2021 - 15 Jan 2021, after normalisation is applied" />
<em>Figure 1: Time series before and after min-max normalisation is applied using the tick box. The data in the above figures is for Hampshire in the first two weeks of 2021. Source: <a href="https://coronavirus.data.gov.uk/">Government Coronavirus Dashboard</a></em></p>
<p>The idea of this dashboard page was to provide a tool to quickly explore two or more time series data sets. To allow this, we introduced an offset parameter to the custom query. The user selects any integer and the time series in question is “shifted” by that many days, as shown in Figure 2. If there is a similarity between the two data sets being observed, this offset capability makes it easy to find visually. This process highlights data sets that should be explored and compared more closely.</p>
<p>We used this process for different data sets, but for the purposes of demonstration you can see it below for the open COVID-19 data.</p>
<p><img src="/images/blog/using-custom-queries/2a_before_offset.PNG" alt="Time series chart of cases and deaths for the period 1 Jan 2021 - 15 Jan 2021, smoothed, without any offset" />
<img src="/images/blog/using-custom-queries/2b_after_offset.PNG" alt="Time series chart of cases and deaths for the period 1 Jan 2021 - 15 Jan 2021, smoothed, without any offset" />
<em>Figure 2: Introducing an offset allows the user to visually inspect two time series in search of patterns. In the second image, the deaths series has been shifted by four days. The data in the above figures is for Hampshire in the first two weeks of 2021. Source: <a href="https://coronavirus.data.gov.uk/">Government Coronavirus Dashboard</a></em></p>
<h2 id="how-to-use-custom-queries-to-connect-to-bigquery">How to use custom queries to connect to BigQuery</h2>
<p>From Data Studio, in edit mode, select <strong>Resource > Manage Added Data Sources</strong> and <strong>+ Add a data source</strong>. Select <strong>BigQuery</strong> as your source, and on the left-hand side of the screen select <strong>CUSTOM QUERY</strong>. From here, select the project you want to connect to and you will be presented with a custom query input box as well as the <strong>Add Parameter</strong> button.</p>
<p><img src="/images/blog/using-custom-queries/3_custom_query.PNG" alt="The custom query screen within Google Data Studio, in which the user can provide a query to bring in data from BigQuery" />
<em>Figure 3: Creating a custom query within Google Data Studio</em></p>
<p>The parameters you create can be Boolean, string, or numeric, and you can either choose to allow any value as input (as we did for offset) or limit input to pre-defined options.</p>
<p><img src="/images/blog/using-custom-queries/4_parameter.PNG" alt="A parameter in the custom query editor of Google Data Studio, demonstrating how the user can limit the input to specific values (in this case UK nations)" />
<em>Figure 4: Creating a parameter for a custom query. This is a string parameter that is limited to take on only four values, the UK nations.</em></p>
<p>To incorporate your parameter into your query, simply type <code class="language-plaintext highlighter-rouge">@[parameter name]</code>, as shown below:</p>
<pre><code class="language-SQL">SELECT
date,
area_name,
cases
FROM `data.covid_uk_incidence`
-- filter by nation
WHERE area_name = @nation
</code></pre>
<p>To introduce branching logic with Boolean parameters, you would use a CASE statement:</p>
<pre><code class="language-SQL">SELECT
date,
area_name,
CASE @smooth
WHEN false THEN cases
WHEN true THEN
AVG(cases) OVER (
PARTITION BY (area_name)
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
END AS cases,
FROM `data.covid_uk_incidence`
WHERE area_name = @nation
</code></pre>
<p>And to introduce an offset to time series data, create a parameter of type <strong>Number(whole)</strong> and insert it into a <strong>DATE_ADD</strong> function in your query.</p>
<pre><code class="language-SQL">SELECT
DATE_ADD(date, INTERVAL @offset DAY) AS date,
area_name,
CASE @smooth
WHEN false THEN cases
WHEN true THEN
AVG(cases) OVER (
PARTITION BY (area_name)
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
END AS cases,
FROM `data.covid_uk_incidence`
WHERE area_name = @nation
</code></pre>
<p>There are many other ways to make use of parameters in custom queries. For example, using a parameter to filter data (as shown above) prevents data studio from loading an entire dataset only to filter it again afterwards.</p>
<p>To find out more about using custom queries with parameters, take a look at Data Studio’s <a href="https://support.google.com/datastudio/answer/6370296?hl=en#parameters">documentation</a>. You can also explore the report described in this <a href="https://datastudio.google.com/s/opW6eX16BX4">post</a>.</p>Mia Hattonmia.hatton@ons.gov.ukWith custom queries you can pre-filter your data and perform on-the-fly calculations and transformations.Blending Data Using Choromap2021-05-19T00:00:00+00:002021-05-19T00:00:00+00:00https://datasciencecampus.github.io/blending-data-using-choromap<p>Previously, we — the Data Science Campus at the Office for National Statistics — created blog posts on how to use our Google Data Studio Community Visualisations <a href="https://datasciencecampus.github.io/creating-choropleth-maps-in-google-data-studio/">choromap</a> and <a href="https://datasciencecampus.github.io/easier-better-faster-stronger-choromap-lite-for-Google-Data-Studio/">choromap-lite</a>. Both tools have been very successful, not only for projects we run at the Data Science Campus, but also for others around the world looking for a choropleth map solution to their Data Studio visualisation needs. Since releasing these tools, we have spoken to people around the world on how to use choromap and choromap-lite. We have also incorporated outside collaborators’ improvements via our <a href="https://github.com/datasciencecampus/community-visualizations">GitHub page</a>.</p>
<p>Until now, the choice of whether to use choromap or choromap-lite depended on the user’s needs. choromap-lite is useful for small GeoJSON files and time series data; whereas choromap can handle larger GeoJSON files.</p>
<p>However, in this post, we show how the heavyweight choromap can use <a href="https://support.google.com/datastudio/answer/9061420?hl=en">Google Data Studio’s Blend</a> to:</p>
<ol>
<li>improve run-time and efficiency when displaying time series data, and</li>
<li>allow the user to change the GeoJSON boundaries displayed.</li>
</ol>
<p>That is, the user could view data at a national level, then switch to exploring the data at a more granular level. With choromap you can put the power of data exploration on choropleth maps in the hands of your user.</p>
<h2 id="1-why-not-just-a-bigquery-join">1 Why not just a BigQuery join?</h2>
<p>Suppose we have a time series data set comprising multiple geography levels (we call this the Metric data set). Each geography area has a single data point for each day of the week. The total number of data points is n x m, where n is the number of polygons in all geography levels and m is the number of days. On its own, this data set may not be large. For a time period of around two weeks the data may only be 1 MB or less. However, the Geometry (GeoJSON) data for all these polygons may be considerable larger, at around 10 MB for even the most generalised shapes (we call this the Geometry data set). If we joined those two data sets together so that each data point had a geometry cell (as required in choromap) our resulting table would be over 200 MB in size. This is because of the replication of the geometry information for each time series data point. For a year’s worth of data, this would be several GBs of data when joined.</p>
<p><img src="/images/blog/blending-data-using-choromap/stitch-image.jpeg" alt="An image showing a sewing machine" />
<em>Figure 1: An image showing a sewing machine</em></p>
<p>Thus, by joining the two data sets, all we have done is make a larger data table before it has been passed to Data Studio and choromap. A useful thing about Data Studio is that metric aggregations are done before the data is passed to a Community Visualisation like choromap. So, what if we could only load the Metric and Geometry data we want to visualise and then join the filtered data sets?</p>
<p>That’s where blending in Data Studio comes in.</p>
<h2 id="2-data-studio-blend">2 Data Studio Blend</h2>
<p>Before blending, both Metric and Geometry data sets need to be added to the report. After which, choromap needs to be added from the Community visualisations and components menu. When they are added, the Metric and Geometry data sets can be blended by clicking (+) Blend Data below data source on the right-hand side when editing the choromap widget. In the first column, the Metric data set can be selected; the second column can then comprise the Geometry data set. The number of join keys depends on whether the Metric data set comprises multiple geography levels. If it is a single geography level, then the join key can just be the unique area ID, common to both data sets; if the data comprises multiple geography levels (national, county, city etc) then the join key should be the unique ID and a field specifying the geography level. This is because in some cases different geography level areas may share the same code, e.g. a county which is also a city. Finally, the relevant dimensions and metrics can be included in the blend.</p>
<h2 id="3-covid-19-cases-example">3 COVID-19 Cases Example</h2>
<p>COVID-19 cases data is an ideal example of time series data presented at multiple geography levels. The UK Government make a number of statistics available for the general public at <a href="https://coronavirus.data.gov.uk/">coronavirus.data.gov.uk</a>. Here, we have used their API to download COVID-19 case data (by specimen date) for the time period 1st January 2021 to 15th January 2021. We have also downloaded the data for the UK, each UK nation (England, Scotland, Wales and Northern Ireland) and for each English region (London, East of England, South East, Midlands, North West, South West, and North East and Yorkshire).</p>
<p><em>Disclaimer: this is a cut of official data and our use here as part of the Office for National Statistics is purely for visualisation purposes and not to comment on the data or trends.</em></p>
<p>The downloaded COVID-19 case data was uploaded to Google BigQuery alongside a GeoJSON CSV file comprising the geometries for each geography level. These geometries were downloaded from <a href="https://geoportal.statistics.gov.uk/">geoportal.statistics.gov.uk</a>, converted to WGS84 reference system using QGIS and then saved as a CSV as shown in <em>Section 1. Data: Fitting a square peg in a round hole</em> of <a href="https://datasciencecampus.github.io/creating-choropleth-maps-in-google-data-studio/">this blog post</a>. Both data sources were added to a Data Studio report and then blended, as shown below.</p>
<p><img src="/images/blog/blending-data-using-choromap/blend.png" alt="An image showing how the two data sources were blended" />
<em>Figure 2 An image showing how the two data sources were blended</em></p>
<p>choromap can then use the geometry information from the Geometry data set and the data from the Metric data set to create the visualisation. You can then filter the data based on a date and/or the geography level you’re interested in. The final result is shown below in a GIF.</p>
<p><img src="/images/blog/blending-data-using-choromap/blend.gif" alt="A GIF showing how the user can choose different dates and geography levels to present the data at on a map " />
<em>Figure 3: A GIF showing how the user can choose different dates and geography levels to present the data at on a map</em></p>
<p>We have shown in this post how we can use Google Data Studio’s Blend functionality with choromap to make engaging and dynamic choropleth maps. A report for this work has been created <a href="https://datastudio.google.com/s/opW6eX16BX4">here</a>.</p>Michael Hodgemichael.hodge@ons.gov.ukMake use of Google Data Studio's Blend to get more out of choromap.Harder (Easier), better, faster, stronger; choromap-lite for Google Data Studio2020-11-17T00:00:00+00:002020-11-17T00:00:00+00:00https://datasciencecampus.github.io/easier-better-faster-stronger-choromap-lite-for-Google-Data-Studio<p>In <a href="https://datasciencecampus.github.io/creating-choropleth-maps-in-google-data-studio/">this article</a> I introduced a tool on how to create choropleth maps in Google’s Data Studio. I was really chuffed with choromap, but it soon became apparent it wasn’t sustainable for certain data sets. Take time series, geospatial data sets for example, say where you have a metric per day, per region. In the original choromap approach each row of your data set would have to have the geometry information needed to plot. This is problematic for two reasons:</p>
<ol>
<li>Geometry information within the raw data increases your file size. With time series data the amount of rows translates to t * a where t is the number of time series steps and a the number of areas. Essentially we only need the area information for a but choromap forces us to increase t fold. This is a waste of computation and storage.</li>
<li>D3 is essentially taking all those layers of the same geometry and stacking them. This again is a waste of computation.</li>
</ol>
<p>Instead of having the geometry information within the data, a better approach is to have the geometry information separate and combine the two. This is what <a href="https://datastudio.google.com/reporting/d0c938d0-9e49-4781-8cab-ec159c233499">choromap-lite does</a> (click for Google Data Studio example report).</p>
<p><img src="/images/blog/choromap/choromap-lite1.gif" alt="choromap-lite-gif" /></p>
<h2 id="1-method">1. Method</h2>
<p>Data Studio doesn’t allow the user to load data externally. This means we cannot host geometry data externally and load it into our D3 code — boo. Instead all the information must be passed directly to D3 using Data Studio. Whereas before we used the <em>DATA</em> dimensions to pass through the geometry information, here instead we rely on ingesting the geometry in the <em>STYLE</em> tab.
As we don’t need to combine the data prior to data ingest, this process is much, much easier. All we need is our structured data file, which comprises at minimum the following:</p>
<ul>
<li>An ID for the area (e.g. 1, 2, 3)</li>
<li>A name for the area (e.g. The Place, The Good Place, The Bad Place)</li>
<li>A metric for the area (e.g. review: 3, 5, 1)</li>
</ul>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id | name | review
1 | The Place | 3
2 | The Good Place | 5
3 | The Bad Place | 1
</code></pre></div></div>
<p>You could also include a timestamp for time series data sets, and additional metrics.</p>
<p>For plotting the geometries we need a GeoJSON file with a CRS (Coordinate Reference System) set to WGS84. In the properties field of the GeoJSON object there should be:</p>
<ul>
<li>An ID for the area</li>
<li>A name for the area</li>
</ul>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
“type”: “FeatureCollection”,
“name”: “some_polygons”,
“crs”: { “type”: “name”, “properties”: { “name”: “urn:ogc:def:crs:OGC:1.3:CRS84” } },
“features”: [
{ “type”: “Feature”, “properties”: { “id”: “1”, “name”: “That Place”}, “geometry”: { “type”: “MultiPolygon”, “coordinates”: [ [ [ [ . . .
</code></pre></div></div>
<p>As you can see, it’s then a simple case of joining the data and the GeoJSON based on the ID.</p>
<h2 id="2-javascript-join">2. JavaScript Join</h2>
<p>Whilst the rest of our code remains similar to choromap, choromap-lite uses a JavaScript method to combine the data and geoJSON:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>geojson.features.forEach(val => {
let { properties } = val
let newProps = new_data[properties.id]
val.properties = { ...properties, ...newProps }
if (typeof val.properties.met === "undefined") {
val.properties.met = null
}
})
</code></pre></div></div>
<p>In essence, this takes the var named geojson, which is our GeoJSON object, and then for each geometry it appends our DATA metric to the object. Nice and easy*.</p>
<p><em>*There are hidden steps in the code to turn all field names in properties to those required above.</em></p>
<h2 id="3-using-in-data-studio">3. Using in Data Studio</h2>
<p>To use choromap-lite import it into your Data Studio report click custom visualisations and components and select choromap-lite, or if it isn’t there:</p>
<ol>
<li>click Explore More</li>
<li>click Build your own visualisation</li>
<li>in the manifest path type gs://choromap-lite/</li>
</ol>
<p>Then, to get this to work in Data Studio all we need to do is paste the GeoJSON text into the STYLE tab field named Input GeoJSON. Below this are two boxes where you enter the ID and name of each geometry within the GeoJSON object. For example for:</p>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input GeoJSON
{
“type”: “FeatureCollection”,
“name”: “some_polygons”,
“crs”: { “type”: “name”, “properties”: { “name”: “urn:ogc:def:crs:OGC:1.3:CRS84” } },
“features”: [
{ “type”: “Feature”, “properties”: { “id”: “1”, “name”: “That Place”}, “geometry”: { “type”: “MultiPolygon”, “coordinates”: [ [ [ [ . . .
</code></pre></div></div>
<p>The <em>ID</em> field would be <em>id</em> and the <em>name</em> field would be <em>name</em>.</p>
<p>In the <em>DATA</em> tab the top dimension should be the ID within the data set that links the data to the geometry ID, and the bottom dimension should be the name that links the data to the geometry name.</p>
<p>You can then select up to 3 metrics you want to plot. By default the top metric will plot, but the user can use the dropdown box to choose the metric they want (new feature, not in choromap). In addition, you can <em>zoom</em> and <em>pan</em> choromap-lite!</p>
<p>Below is an example of it in action. Note, after you paste the GeoJSON in STYLE you’ll need to refresh the page.</p>
<p><img src="/images/blog/choromap/choromap-lite2.gif" alt="choromap-lite example" /></p>
<p><img src="/images/blog/choromap/choromap-lite3.png" alt="choromap-lite example" /></p>
<blockquote>
<p>An example of the output from using choromap-lite</p>
</blockquote>
<p>And that’s it…just check out the STYLE tab for aesthetics.</p>
<p>Thanks for reading.</p>
<h2 id="4-known-limitations">4. Known Limitations</h2>
<p>So, choromap-lite is super good at doing a lot, but if you throw it some mega-big GeoJSON (10MB+) it’ll probably crash your dashboard. For large, complex geometries it’s probably best to use the original choromap tool. If you have time series data though I do recommend choromap-lite, it’ll save on BigQuery costs if you use them; just simplify your geometries a little first.</p>
<p>This post has also been posted to <a href="https://medium.com/@michaelstvnhodge/harder-easier-better-faster-stronger-choromap-lite-for-google-data-studio-cf60625fe241">medium</a>.</p>Michael Hodgemichael.hodge@ons.gov.ukA new version of the choromap tool, choromap-lite, which takes a GeoJSON string and appends to dataCreating choropleth maps in Google Data Studio2020-07-13T00:00:00+00:002020-07-13T00:00:00+00:00https://datasciencecampus.github.io/creating-choropleth-maps-in-google-data-studio<p>Over the past few months I’ve been using <a href="https://datastudio.google.com/u/0/">Google Data Studio</a> in my role at the Data Science Campus, Office for National Statistics. Having used data visualisation tools like Shiny, D3 and Dash to create bespoke dashboards in other projects I did groan when asked to use Google’s late entry into the dashboard market. Things got off to an amicable start when using the built-in line and bar chart tools; however, there was a noticeable lack customisable mapping tools in Data Studio. Ok, so it isn’t fair to point the finger just at Google in this market, because as far as I can see only Tableau offers a decent mapping capability. Even Google’s latest acquisition, Looker, <a href="https://docs.looker.com/exploring-data/visualizing-query-results/interactive-map-options">seems to be limited</a>. It’s just that after using libraries such as leaflet, D3 or deck gl, I was a little disappointed that Google would not have ported their geo-based infrastructure over to their Data Studio product. Credit to Google though, Data Studio is free, whereas others in the market are not.</p>
<p>Currently, at the time of writing, Google Data Studio only has two mapping tools: Geo Map and Google Maps. Geo Map allows for the creation of <a href="https://en.wikipedia.org/wiki/Choropleth_map">choropleth maps</a> at country and sub-country level. For the US, this is US and state level. For the UK, this is UK and devolved nation level (England, Wales, Scotland and Northern Ireland). Choropleths cannot be created for smaller areas than these. The Google Maps tool in Data Studio allows for the plotting of Bubble Maps, whose centroid can be tied to areas as large as a country, or specific latitude and longitudes. Currently, there is no ability to visualise polygons in Google Maps in Data Studio, meaning that choropleths cannot be created.</p>
<p><img src="/images/blog/choromap/choro-post-img1.png" alt="choropleth map" /></p>
<blockquote>
<p>What we really want to be able to do in Google Data Studio (source: https://cartographicperspectives.org/index.php/journal/article/view/cp81-butler/1437)</p>
</blockquote>
<p>Thankfully, Google haven’t shipped Data Studio as a closed product. Their intention seems to see how others use it and are willing to received feedback. In line with this, they allow data visualisation enthusiasts to create ‘community visualisations’. The only community visualisation mapping tool currently on offer is <a href="https://datastudio.google.com/u/0/reporting/09fd83d4-4e99-41bb-853a-30e720a716dc/page/cY2z">r42’s hexmap</a>, made by Ralph Spandl. It’s an excellent tool and will definitely be useful for visualising metrics of dense point data. However, no one had cracked (or found the need to) custom choropleth mapping in community visualisations as of yet.</p>
<p>Here, I present a new community visualisation tool called <strong><a href="https://github.com/datasciencecampus/community-visualizations/tree/master/choromap">choromap</a></strong> that creates choropleth maps based on user-created polygons. <a href="https://datastudio.google.com/reporting/4617cbac-3514-4c8d-a999-a3cb6683e579">Here is an example of choromap in action in a Data Studio report</a>.</p>
<p><img src="/images/blog/choromap/choro-post-img2.png" alt="choromap" /></p>
<blockquote>
<p>Choromap allows the user to choose the boundaries they want to visualise. Here are Counties and Unitary Authorities for England, UK.</p>
</blockquote>
<h3 id="1-data-fitting-a-square-peg-in-a-round-hole">1. Data: Fitting a square peg in a round hole</h3>
<p>The first challenge was to get Google Data Studio to recognise geographical data. To start, Google Data Studio does not recognise BigQuery GEOGRAPHY data. A side note here, BigQuery GEOGRAPHY data from public data sets does not work well in choromap even after converting it to GeoJSON, see the Known Limitations section of this post. All geographical data must be loaded into Data Studio as a STRING. Here, we need to create a STRING GeoJSON type data set, save to a CSV file, which can then be read into Data Studio through one of its many ingestion methods (Cloud Storage, Google Sheets, BigQuery, file upload, to name a few).</p>
<p>For the complete guide on how to create the GeoJSON type CSV file, please refer to <a href="https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8">Lak Lakshmanan’s excellent Medium post on ‘How to load geographic data like shapefiles into BigQuery’</a>. Below is a simplified version of the method, which requires <code class="language-plaintext highlighter-rouge">gdal-bin</code>:</p>
<ol>
<li>download a shapefile (.shp) of the geometries you wish to plot. For example <a href="http://geoportal.statistics.gov.uk/datasets/counties-december-2017-super-generalised-clipped-boundaries-in-england">Counties (December 2017) Super Generalised Clipped Boundaries in England by ONS</a></li>
<li>Convert, if not already, the CRS (Coordinate Reference System) is WGS84</li>
<li><em>unzip</em> the folder.</li>
<li><em>cd</em> to the folder.</li>
<li>run <code class="language-plaintext highlighter-rouge">ogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from <filename>" <filename>.csv <filename>.shp</code> where <code class="language-plaintext highlighter-rouge"><filename></code> is the filename of the shapefile.</li>
</ol>
<p>A pre-created, sample data set for England’s Counties and Unitary Authorities can be found <a href="https://storage.cloud.google.com/choromap/choromap.csv?organizationId=425126312691&supportedpurview=project">here (6.97 MB)</a>. The schema of this sample data is as follows:</p>
<table>
<thead>
<tr>
<th>Column Name</th>
<th>Description</th>
<th>type</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row</td>
<td>The row number</td>
<td>STRING</td>
</tr>
<tr>
<td>ctyua20cd</td>
<td>The area code</td>
<td>STRING</td>
</tr>
<tr>
<td>ctyua20nm</td>
<td>The area name</td>
<td>STRING</td>
</tr>
<tr>
<td>geometry</td>
<td>The geometry</td>
<td>STRING</td>
</tr>
<tr>
<td>st_areashape</td>
<td>The shape area</td>
<td>float</td>
</tr>
<tr>
<td>st_lengthshape</td>
<td>The shape length</td>
<td>float</td>
</tr>
</tbody>
</table>
<p>The important column is <code class="language-plaintext highlighter-rouge">geometry</code>. Below is an extract of the geometry for <code class="language-plaintext highlighter-rouge">The City of London</code> in the UK:</p>
<pre><code class="language-txt">{"type":"MultiPolygon","coordinates":[[[[-0.096786300655468,51.52332130413849],[-0.096469831048681,51.52282154015623],[-0.095088768971737,51.52313723284216],[-0.094344739975396,51.5214831278268],[-0.092516830405343,51.52148578000455],[-0.092374579177695,51.52102758467574],[-0.08969358569076,51.52071506110647],[-0.090004405037251,51.51997008801451],[-0.086227541891935,51.51880878001998],[-0.085217909720332,51.52033453455192],[-0.083325592193781,51.51981439149066],[-0.081762362787683,51.52075732827536],[-0.081050452990135,51.52195339914494],[-0.078471489442032,51.52151013413389],[-0.079429829306423,51.51884510407231],[-0.078082679099946,51.51896786642452],[-0.078146943602546,51.51846889558505],[-0.076876952796642,51.51665852905],[-0.073969190592163,51.51445357376097],[-0.073063262728713,51.5118083055404],[-0.072781089875862,51.51029829096138],[-0.074550758806615,51.50995867750763],[-0.075584143211524,51.5097499891425],[-0.076285605682929,51.5105438078804],[-0.077789551314662,51.51011438065631],[-0.078882667624717,51.50941192157917],[-0.079099326627114,51.50905757810306],[-0.078721305098214,51.50882744549881],[-0.079395354287654,51.50781128259451],[-0.080360355480934,51.50808169862811],[-0.085479453429595,51.50860342872304],[-0.087115487530159,51.50898448206557],[-0.088668830381722,51.50896992503621],[-0.091976761860799,51.50942135075249],[-0.095234497930682,51.51017176588185],[-0.095201094749157,51.51061514125803],[-0.096162542706015,51.51026430527382],[-0.099899320495398,51.51082545323712],[-0.108470620393255,51.51087126554509],[-0.11158056489403,51.51083164484565],[-0.111567244164817,51.51173049399255],[-0.112414895757562,51.51276926532587],[-0.111738730695415,51.51319547361804],[-0.111980747153716,51.51368491737404],[-0.111101534210641,51.51382547859707],[-0.111606871170285,51.51533799647236],[-0.113821109173385,51.51825760445579],[-0.107826700590834,51.51776531637376],[-0.105349963286698,51.51854099504494],[-0.101820882919502,51.5196655764016],[-0.100301084969584,51.52012831764441],[-0.097670260922139,51.5207223492556],[-0.097624204154032,51.52103184600052],[-0.097403288303114,51.5215930126154],[-0.097972548740085,51.52287738232243],[-0.096786300655468,51.52332130413849]]],[[[-0.10423511672847,51.5086262019039],[-0.104688136951949,51.50840920893765],[-0.104701932690223,51.50863143466984],[-0.10423511672847,51.5086262019039]]]]}
</code></pre>
<p>As you see it’s just a JSON object masked as a STRING. To combine this file with a data set comprising useful information you can then either JOIN locally and upload, or upload as is and use Data Studio’s BLEND tool.</p>
<h3 id="2-under-the-hood-of-choromap-d3js">2. Under the hood of choromap: D3.js</h3>
<p>Choromap was created using a medley of D3 guides. I took inspiration and code from <a href="https://www.d3-graph-gallery.com/graph/choropleth_basic.html">most basic choropleth map in d3.js
</a>, <a href="https://www.d3-graph-gallery.com/graph/choropleth_hover_effect.html">choropleth map with hover effect in d3.js
</a>, and <a href="https://observablehq.com/@d3/choropleth">Observable’s choropleth guide</a> for example, although our approach doesn’t use an asynchronous method presented in those guides. Of course, I also spent hours on StackOverflow..</p>
<p>I also largely followed the <a href="https://codelabs.developers.google.com/codelabs/community-visualization/#0">Create Custom Javascript Visualizations in Data Studio Codelab guide</a>, which provides a good basic example of creating a bar chart.</p>
<p>Choromap in essence uses a lot of pre-existing D3 libraries, such as the tool-tip and legend libraries. It relies on the user-defined information specified in Data Studio to create a GeoJSON type object, which it can then plot.</p>
<h3 id="3-using-choromap-in-data-studio">3. Using choromap in Data Studio</h3>
<p>To use choromap first import it into your Data Studio report click custom visualisations and components and select <strong>choromap</strong>, or if it isn’t there:</p>
<ul>
<li>click Explore More</li>
<li>click Build your own visualisation</li>
<li>in the manifest path type <code class="language-plaintext highlighter-rouge">gs://choromap/</code></li>
</ul>
<p>Then add the chart to your report. In the <code class="language-plaintext highlighter-rouge">DATA</code> tab, for the <code class="language-plaintext highlighter-rouge">Dimensions</code> make sure the first is the name of code of the area and the second the STRING geometry column. You can then select a <code class="language-plaintext highlighter-rouge">Metric</code> you want the colormap to plot against. This should be all you need to do (oh, and make sure Community visualisations access is ‘on’ for the data source).</p>
<p>In the <code class="language-plaintext highlighter-rouge">STYLE</code> tab you can then customise:</p>
<ul>
<li>the minimum, middle and maximum value plot colours (and null values)</li>
<li>the boundary line colour and width</li>
<li>the text colour</li>
<li>a number of legend options including orientation, position and title</li>
<li>mapping to linear, quantile or custom colours breaks</li>
</ul>
<p>To see choromap in action, <a href="https://datastudio.google.com/reporting/4617cbac-3514-4c8d-a999-a3cb6683e579">this is an example of the Counties and Unitary Authorities data set</a>.</p>
<h3 id="4-example-usage">4. Example usage</h3>
<p>We can calculate the roundness (circularity) of an object by using the following formula:</p>
<p>Roundness = π * Area / Perimeter ^ 2</p>
<p>When we apply this to England’s Counties and Unitary Authorities, we show that Enfield in Greater London is England’s most round (circular) authority with a value of 0.61. The last round authority is Oxfordshire with a value of 0.19. The infographic below was created using Data Studio and choromap. The legend shows a linear colourmap between grey and hot pink. Then the two polygons at the bottom were created using Data Studio’s filter option and filtering on Enfield and Oxfordshire.</p>
<p><img src="/images/blog/choromap/choro-post-img3.png" alt="choromap" /></p>
<blockquote>
<p>England’s authorities based on their ‘roundness’.</p>
</blockquote>
<ol>
<li>Known Limitations</li>
</ol>
<p>As all things in life, things aren’t as simple as they appear. Below are the current known limitations of using and plotting data with choromap.</p>
<ul>
<li>Because of something called a <a href="https://github.com/d3/d3-geo#d3-geo">winding convention order</a> raw GeoJSON files will not work. The file must be a pseudo-GeoJSON CSV file created using the method above.</li>
<li>Unfortunately, BigQuery public data sets also use the same convention as GeoJSON and therefore cannot be converted easily to use with this tool.</li>
<li>Both the above limitations can be resolved largely by <em>simplifying</em> the geometries using <code class="language-plaintext highlighter-rouge">ST_SIMPLIFY</code> but this ultimately makes the geometries more <em>simple</em>. For plotting this isn’t too much of an issue, but complex geometries are important for computational geometry.</li>
</ul>
<h3 id="summary">Summary</h3>
<p>This post shows how I created a community visualisation tool called <a href="https://github.com/datasciencecampus/community-visualizations/tree/master/choromap">choromap</a> for Google’s Data Studio, which allows users to create choropleth maps for bespoke boundaries.</p>
<p>This post has also been posted to <a href="https://link.medium.com/a4QqECWd57">medium</a>.</p>Michael Hodgemichael.hodge@ons.gov.ukUsing D3 to create a community visualisation in Google’s Data Studio to plot choropleth maps.A list of the Campus’ Open Source tools2020-05-13T00:00:00+00:002020-05-13T00:00:00+00:00https://datasciencecampus.github.io/tools<ul>
<li><a href="#graphite">graphite</a></li>
<li><a href="#green-spaces">green-spaces</a></li>
<li><a href="#mobius">mobius</a></li>
<li><a href="#optimus">optimus</a></li>
<li><a href="#proper">propeR</a></li>
<li><a href="#pygrams">pyGrams</a></li>
<li><a href="#readpyne">readpyne</a></li>
<li><a href="#woffle">woffle</a></li>
</ul>
<hr />
<h2 id="graphite">graphite</h2>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/graphite">link ⧉</a></p>
<p>Building an OpenTripPlanner server locally using a Java Virtual Machine (JVM) or Docker.</p>
<hr />
<h2 id="green-spaces">green-spaces</h2>
<p>GitHub pages: <a href="https://datasciencecampus.github.io/green-spaces/">link ⧉</a></p>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/green-spaces">link ⧉</a></p>
<p>The Green Spaces project comprises a python tool that can render GeoJSON polygons over aerial imagery and analyse pixels contained within the polygons</p>
<p><img src="/images/tools/green-spaces.png" alt="green-spaces" /></p>
<hr />
<h2 id="mobius">mobius</h2>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/mobius">link ⧉</a></p>
<p>A python tool for extracting data from the Google Mobility Reports published to show how mobility has changed during the COVID-19 pandemic.</p>
<p><img src="/images/tools/mobius.png" alt="optimus" /></p>
<hr />
<h2 id="optimus">optimus</h2>
<p>Github pages: <a href="https://datasciencecampus.github.io/optimus/">link ⧉</a></p>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/optimus">link ⧉</a></p>
<p>A text processing pipeline for turning unstructured text data into hierarchical datasets.</p>
<p><img src="/images/tools/optimus.png" alt="optimus" /></p>
<hr />
<h2 id="proper">propeR</h2>
<p>GitHub pages: <a href="https://datasciencecampus.github.io/proper/">link ⧉</a></p>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/proper">link ⧉</a></p>
<p>This R package was created to analyse multimodal transport for a number of research projects at the Data Science Campus, Office for National Statistics.</p>
<hr />
<p><img src="/images/tools/proper.png" alt="proper" /></p>
<h2 id="pygrams">pyGrams</h2>
<p>GitHub pages: <a href="https://datasciencecampus.github.io/pygrams/">link ⧉</a></p>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/pygrams">link ⧉</a></p>
<p>This python-based app is designed to extract popular or emergent n-grams/terms (words or short phrases) from free text within a large (>1,000) corpus of documents. Example corpora of granted patent document abstracts are included for testing purposes.</p>
<hr />
<p><img src="/images/tools/pygrams.png" alt="pygrams" /></p>
<h2 id="readpyne">readpyne</h2>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/readpyne">link ⧉</a></p>
<p>Toolkit for extracting relevant lines from receipts or similar Image data.</p>
<p><br /><br /></p>
<hr />
<h2 id="tron">tron</h2>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/tron">link ⧉</a></p>
<p>Detecting collective human behaviour in internet bandwidth consumption data</p>
<hr />
<h2 id="woffle">woffle</h2>
<p>GitHub repo: <a href="https://github.com/datasciencecampus/woffle">link ⧉</a></p>
<p>woffle is a project template which aims to allows you to compose various NLP tasks via a common interface using each of the most popular currently available tools. This includes spaCy, fastText and flair.</p>
<p><br /><br /></p>Michael Hodgemichael.hodge@ons.gov.ukA list of the Campus' Open Source toolsMx Tea: a tea break session planning bot using Slack and Google Cloud Platform2020-04-21T00:00:00+00:002020-04-21T00:00:00+00:00https://datasciencecampus.github.io/creating-tea-breaks-on-slack<h3 id="updates">Updates</h3>
<p>12/05/2020: Thanks to S.Tietz for his recommendations to grouping and spotting mistakes in this post.
25/02/2021: Mr Tea is now Mx Tea. Channel.history is depreciated and replaced by conversations.history.</p>
<p>Since the COVID-19 lockdown came into force, we’ve been working on keeping the Data Science Campus team community going, looking after our wellbeing and making sure we’re all generally ok. And some of that involves novel and fun ways of making working-from-home feel less like working-on-your-own.</p>
<p>As well as the usual slack channels filled with pictures of pets, the outside world, and children’s first steps, we’ve been reinventing <strong>tea breaks</strong>. Here, we explain how this has developed into an automated system that could be used elsewhere. <a href="https://github.com/datasciencecampus/mx-tea">Our code repository for this can be found here ⧉.</a>.</p>
<h2 id="the-tea-break">The tea break</h2>
<p>Each Tuesday and Thursday at 14:00, members of the Campus dial into a meeting to have a chat with colleagues whilst sipping a cup of tea. The novelty about these tea break sessions is that the meeting comprises four-to-five randomly assigned colleagues. The benefits to this are two-fold. Firstly, this allows the sessions to have a group size sufficiently small enough for all attendees to engage in the conversations. Secondly, it allows for a rotation of colleagues which one chats to. So far, we have seen a real positive uptake of these sessions.</p>
<p><img src="/images/blog/mxtea/tea.jpg" alt="pexels image of a laptop and tea mug on a table" /></p>
<blockquote>
<p>Source: <a href="https://www.pexels.com/photo/silver-laptop-computer-next-to-ceramic-cup-42408/">pexels</a></p>
</blockquote>
<h2 id="approach">Approach</h2>
<p>The initial approach was to have a dedicated Slack channel. In the morning, users would then say whether they were available for the afternoons tea break. From this, the list of attendees could be randomised into small groups. To achieve this, we created a simple python script. A list of meeting links could be created to host the tea breaks sessions. It’s up to you which blend of meeting hosting applications you choose here. For ease, we created a number of recurring meetings every Tuesday and Thursday at 14:00 so that the meeting links do not change.</p>
<p>Ok, so 20-minutes work every Tuesday and Thursday isn’t exactly hard-graft, but we always wanted to make this more automated so it didn’t rely on one person to be the tea administrator.</p>
<p>Therefore, we created Mx Tea.</p>
<p><img src="/images/blog/mxtea/mxtea.gif" alt="a gif of the mx tea logo" /></p>
<blockquote>
<p>Source: Data Science Campus</p>
</blockquote>
<h2 id="you-created-what">You created what?</h2>
<p>Mx Tea is a <a href="https://api.slack.com/bot-users">Slack Bot</a> who asks a simple question…</p>
<p><img src="/images/blog/mxtea/question.png" alt="Mx tea asking a question in a channel" /></p>
<blockquote>
<p>Source: Data Science Campus</p>
</blockquote>
<p>Keen tea enthusiasts then give the post a thumbs up if they want to participate. We then use the list of people who gave a thumbs up to create groups for the tea break.</p>
<p><img src="/images/blog/mxtea/group.png" alt="Mx tea posting thee group results to a channel" /></p>
<blockquote>
<p>Source: Data Science Campus</p>
</blockquote>
<p>Simple!</p>
<h3 id="ok-but-really-how">Ok, but, really how?</h3>
<p>Ok, so I may have skipped a few steps there on how Mx Tea actually works. Essentially it boils down to three steps:</p>
<ol>
<li>create the Slack App</li>
<li>post and gather responses with the Slack App using python</li>
<li>deploy to Google Cloud Platform</li>
</ol>
<h4 id="step-1-create-the-slack-bot">Step 1. Create the Slack Bot</h4>
<p>The first of you need to <a href="https://api.slack.com/start">create a Slack App</a> and set the appropriate <strong>Scopes</strong> (minimum: <code class="language-plaintext highlighter-rouge">channels:history</code>, <code class="language-plaintext highlighter-rouge">channels:read</code>, <code class="language-plaintext highlighter-rouge">incoming-webhook</code>, <code class="language-plaintext highlighter-rouge">users:read</code>).</p>
<p>For posting messages to a Slack channel, we use <a href="https://api.slack.com/messaging/webhooks">Slack webhooks</a>. For getting the responses back from the channel we use an <code class="language-plaintext highlighter-rouge">OAuth Access Token</code>.</p>
<p>To use the App, you need to both <em>install</em> it to the desired channel (done in the App settings page), as well as <em>add</em> to the channel (done in the channel, like you would a real user: @app-name).</p>
<p>Now your App (bot) is ready for action.</p>
<h3 id="step-2-post-and-gather-responses-with-the-slack-app-using-python">Step 2. Post and gather responses with the Slack App using python</h3>
<p>Both posting and retrieving messages from Slack use <em>curl</em> to send a HTTP request. This can be done in your terminal, but we are going to use python here. For this we choose to use the <code class="language-plaintext highlighter-rouge">requests</code> library.</p>
<h4 id="step-21-posting-your-question-in-the-channel">Step 2.1. Posting your question in the channel</h4>
<p>Posting a question is nice and simple. For this you need the following:</p>
<ul>
<li>your <code class="language-plaintext highlighter-rouge">webhook_url</code></li>
<li>your message</li>
</ul>
<p>Then, using <code class="language-plaintext highlighter-rouge">requests</code> and <code class="language-plaintext highlighter-rouge">json</code> libraries we can make a request to the webhook.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/env python
import requests
import json
""" Your Slack webhook """
slack_webhook = 'slack_webhook'
""" Question message """
slack_question_message = "Morning all, give this post a thumbs up if you are available for a tea break this afternoon!"
def ask_question():
""" Asks the question to the Slack channel """
headers = {'Content-type': 'application/json'}
payload = {'text': slack_question_message}
payload = json.dumps(payload)
requests.post(slack_webhook, data=payload, headers=headers)
</code></pre></div></div>
<p>Calling <code class="language-plaintext highlighter-rouge">ask_question()</code> will then post the message to the channel you added the App to.</p>
<h4 id="step-22-creating-the-groups-and-posting-them-in-the-channel">Step 2.2. Creating the groups and posting them in the channel</h4>
<p>This is a little harder, but not too difficult. We will break it down into chunks.</p>
<p>First, you need the following python libraries:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/env python
import requests
import random as rnd
import math
import json
</code></pre></div></div>
<p>Then, the first thing to do is to setup all the preamble variables we will need.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>""" Post your meeting links below """
meeting_links = \
[
'link_1',
'link_2',
'etc'
]
""" Your Slack webhook """
slack_webhook = 'slack_webhook'
""" Your Slack token """
slack_token = 'slack_token'
""" Slack channel ID """
slack_channel_id = 'slack_channel_ID'
""" Slack bot ID """
slack_bot_id = 'slack_bot_ID'
""" Question message """
slack_question_message = "Morning all, give this post a thumbs up if you are available for a tea break this afternoon!"
""" Requested group size (note, the algorithm may use +1 if ir provides a better split of people """
group_size = 4
</code></pre></div></div>
<p>Next, we want to get the posts from the channel that relate to our Slack App.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def get_posts():
""" Gets posts from the channel by the bot and returns as JSON object """
url = f"https://slack.com/api/conversations.history?token={slack_token}&channel={slack_channel_id}&bot_id={slack_bot_id}"
return requests.get(url).json()
</code></pre></div></div>
<p><em>Note: We have updated the above to use conversations.history as of February 2021 channels.history is depreciated.</em></p>
<p>Let’s call <code class="language-plaintext highlighter-rouge">posts = get_posts()</code> to get these posts. Ok, now we have all the Apps posts in that channel. But what we really want is the most recent post - the post to check the availability of people for the tea break. Therefore, let’s filter the results:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def get_recent_question(posts):
""" Gets the most recent question post from the channel by the bot """
latest_poll = {}
for post in posts['messages']:
if post.get('text') == f"{slack_question_message}":
latest_poll = post
break
return latest_poll
</code></pre></div></div>
<p>Calling <code class="language-plaintext highlighter-rouge">poll = get_recent_question(posts)</code> will then bring back that specific post. Next, we want to get a list of people who gave the post a thumbs up. So we request this from the post using:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def get_attendees_ids(poll):
""" Gets a list of attendees slack IDs """
attendees_ids = []
reactions = poll.get('reactions')
for reaction in reactions:
if reaction.get('name') == "+1":
attendees_ids.append(reaction.get('users'))
return attendees_ids[0]
</code></pre></div></div>
<p>Great, we call <code class="language-plaintext highlighter-rouge">attendees_ids = get_attendees_ids(poll)</code> and we get a list of those who gave a thumbs up. Slack allows us to tag people using their ID, so let’s do that. Next, let’s allocate these IDs to a group. There are to catches here that increase the group size if needed. First, if the split of attendees is better if group size is increased by one. Second, if there are too many attendees for the number of meeting links (this will increase the group size to compensate).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def allocate(ids):
""" Allocates ids to groups """
rnd.shuffle(ids)
result = []
# increase groupsize by one if the groups are more evenly split by doing so
groupsize = group_size if abs(((len(ids) / group_size) % 1) - 0.5) >= abs(((len(ids) / (group_size + 1)) % 1) - 0.5) else group_size + 1
# increase groupsize if we don't have enough meeting links
groupsize = max(groupsize, math.ceil(len(ids) / len(meeting_links)))
while True:
# Make sure no group has less than two people
if len(ids) > groupsize + 2:
result.append(ids[0:groupsize])
ids = ids[groupsize:]
else:
result.append(ids[0:])
break
return result
</code></pre></div></div>
<p>Calling <code class="language-plaintext highlighter-rouge">groups = allocate(attendees_ids)</code> we then are returned with groups of random slack IDs. Next, let’s create a nice message we can get the App to post to our desired Slack channel.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def present(groups, meeting_links):
""" Create post for slack channel """
msg = []
msg.append("Good afternoon everyone, below are this afternoons tea break groups")
num = 0
for i, group in enumerate(groups):
msg.append("--------------------------------------------------------------------------------")
msg.append(f"Group {i + 1}: ")
msg.append(meeting_links[num])
msg.append("--------------------------------------------------------------------------------")
for member in group:
msg.append(f"<@{member}>")
num = num + 1
return '\n'.join(map(str, msg))
</code></pre></div></div>
<p>Calling <code class="language-plaintext highlighter-rouge">msg = present(groups, meeting_links)</code> returns this string by using the groups we just created, and Skype links we posted at the start. <code class="language-plaintext highlighter-rouge">f"<@{member}>"</code> tags the member in the post. The last thing to do therefore is to ask the App to post to the channel.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def post_to_slack(msg):
""" Post groups to slack channel """
headers = {'Content-type': 'application/json'}
payload = {'text': msg}
payload = json.dumps(payload)
requests.post(slack_webhook, data=payload, headers=headers)
</code></pre></div></div>
<p>At this point you could call <code class="language-plaintext highlighter-rouge">post_to_slack(msg)</code> to fire off the message to check everything works.</p>
<p>Ok great. So how to we automate this so someone doesn’t have to run the python scripts? Well, for this we are going to use Google Cloud Platform’s <code class="language-plaintext highlighter-rouge">Cloud Function</code> and <code class="language-plaintext highlighter-rouge">Cloud Scheduler</code> features.</p>
<h2 id="step-3-deploy-to-google-cloud-platform">Step 3. Deploy to Google Cloud Platform</h2>
<h3 id="step-31-cloud-function">Step 3.1. Cloud Function</h3>
<p>Cloud Function is what does the bulk of our work here. It processes our two requests: post the original message, and post the groups, to the Slack channel. If you’ve read our other posts you’ll see we have used <a href="https://datasciencecampus.github.io/github-api-to-gcp/">Cloud Function and Scheduler before</a>, so we won’t go into detail here on how to set a function up. The fundamentals we need here are:</p>
<ul>
<li>a Pub/Sub trigger (so we can use Cloud Scheduler)</li>
<li>python 3.7 inline editor</li>
</ul>
<p>This isn’t a memory expensive process, so you have stick to <code class="language-plaintext highlighter-rouge">256 Mb</code> option. The <code class="language-plaintext highlighter-rouge">ask_question</code> function can be used as is, just remember to put <code class="language-plaintext highlighter-rouge">data, context</code> in the function’s header (e.g., <code class="language-plaintext highlighter-rouge">ask_question(data, context)</code>).</p>
<p>For the group posting functions, copy all the parts above (or see our <a href="https://github.com/datasciencecampus/mx-tea">GitHub repository with all the code in</a>) into the editor. Then, we need just one more <code class="language-plaintext highlighter-rouge">process</code> function, as Cloud Functions only runs one, single function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def process(event, context):
""" Pipeline process required for Cloud Function """
users = get_users()
posts = get_posts()
poll = get_recent_question(posts)
attendees_ids = get_attendees_ids(poll)
attendees_names = get_attendees_names(attendees_ids, users)
groups = allocate(attendees_names)
msg = present(groups, meeting_links)
post_to_slack(msg)
</code></pre></div></div>
<p>Calling <code class="language-plaintext highlighter-rouge">post_groups</code> will run the other functions in the chain. It’ss probably good at this stage to test both of these out and see if they work.</p>
<h3 id="step-32-cloud-scheduler">Step 3.2. Cloud Scheduler</h3>
<p>Cloud Scheduler runs our functions at a time that suits us using <a href="https://en.wikipedia.org/wiki/Cron">cron</a>. For the <code class="language-plaintext highlighter-rouge">ask_question</code> function we want this to run every Tuesday and Thursday at 09:00 so we set the frequency to <code class="language-plaintext highlighter-rouge">0 9 * * 2,4</code>. For the <code class="language-plaintext highlighter-rouge">post_groups</code> function we want this to post on the same dates but at 13:00, so we set the frequency to <code class="language-plaintext highlighter-rouge">0 13 * * 2,4</code>. We can test whether these work by clicking <code class="language-plaintext highlighter-rouge">RUN NOW</code>.</p>
<h2 id="summary">Summary</h2>
<p>If everything has worked out well you will now have setup a nice Slack App (bot) that can ask your Slack channel users who is available for a tea break session. It will then use the responses to create the randomised tea break session groups and post to the Slack channel.</p>
<p>Feel free to add a nice photo to the bot in its settings, or make the channel posts a little nicer. Or, put your feat up and have a nice cup of tea.</p>Michael Hodgemichael.hodge@ons.gov.ukHow the Campus uses Slack and Google Cloud Platform to create tea break sessions for informal chatsHow the Campus extracted data from the Google Mobility Reports (Covid-19)2020-04-09T00:00:00+00:002020-04-09T00:00:00+00:00https://datasciencecampus.github.io/google-mobility-reports<h2 id="updates">Updates</h2>
<p><strong>16/04/2020</strong>: Google have released the data in CSV format. Click <a href="https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv">here</a> for the latest data.</p>
<p><strong>10/04/2020</strong>: The <a href="https://github.com/datasciencecampus/mobius">mobius pipeline (click here)</a> has been updated for the Friday 10th of April 2020 release of data. This dataset comprises a time series between Sunday 23rd February 2020 and Sunday 5th April 2020. The blog post below talks about the original data set, but the methodology remains the same.</p>
<h2 id="introduction">Introduction</h2>
<p><img src="/images/blog/google-mobility/logo.png" alt="mobius logo" /></p>
<p><em>Figure 1: mobius logo</em></p>
<p>At 7:00 am Friday 3rd April 2020, Google released <a href="https://www.google.com/covid19/mobility/">Community Mobility Reports</a> for 131 countries and 50 US states, and the District of Columbia. These reports show the changing levels of people visiting different types of locations for areas around Great Britain and other countries. The data provides insight into the impact of social distancing measures, and are created with aggregated, anonymised data from users who have turned on the Location History setting (off by default).</p>
<p>The reports contain <em>headline</em> numbers and <em>trend</em> graphs (Figure 3) showing how <em>mobility</em> has changed over time for six categories (retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential). The graphs comprise a daily mobility percent (%) value for between 16th February 2020 and 29th March 2020, relative to a baseline period from the 3rd January 2020 to 6th February 2020. Each report has national-level (or state-level) graphs, and a number of reports have regional-level graphs.</p>
<p>As these reports were released in PDF format, a team of Data Scientists at the Data Science Campus were tasked with gathering the data within the reports so that the raw data could be fed-in to inform the UK government response to Covid-19. We setup a <a href="https://github.com/datasciencecampus/mobius">public GitHub repository</a>, gave the project the name <strong>mobius</strong>, and began our work.</p>
<p>Our first job was to get the data for Great Britain (including all 150 regions, Figure 2), which we were able to provide to colleagues across government working on the Covid-19 response rapidly. This included working with ONS Geography to generate a new geographical boundary dataset that reflected Google’s.</p>
<p>These datasets, among others, can be found at the Campus’ <a href="https://github.com/datasciencecampus/google-mobility-reports-data">Google Mobility Reports Data GitHub repository</a>.</p>
<p><img src="/images/blog/google-mobility/GB-report.png" alt="report graph" /></p>
<p><em>Figure 2: All 78 pages of the Great Britain (GB) Google Mobility Report. The first two pages show national-level headline numbers and graphs. Subsequent pages comprise each of the 150 regions data.</em></p>
<h2 id="methodology">Methodology</h2>
<p>The main task for the team was to extract the text and graphs from the PDF reports and convert the headline numbers and trend line graphs to data. In addition, the team aimed to align this data to ONS geographies for Great Britain. Crucially, the team undertook various quality assurance steps before <a href="https://github.com/datasciencecampus/google-mobility-reports-data">publishing their data</a>. This was to ensure that the data, to the best of our knowledge, was an accurate representation to that presented in the reports.</p>
<h4 id="extracting-relevant-elements-from-scalable-vector-graphics">Extracting relevant elements from Scalable Vector Graphics</h4>
<p>The graphs in the reports were embedded as vector graphics. This means they comprised a vector coordinates system that could theoretically be used to calculate the mobility % value for each data point on the graphs. Until we realised the graphs were embedded as vectors, we were also looking at how to extract trend data directly from the charts using image analysis techniques.</p>
<p>First, we saved a number of individual graphs to Scalable Vector Graphics (SVG) format. We chose SVG as we wanted to retain the vector information of the graph. Crucially, SVG is drawn from mathematically declared shapes and curves (Bézier curves) that keep the vital vector coordinates system we required.</p>
<p>We then began to inspect these SVG files to see what information we could use to calculate the original data. It was found that the most important elements of each graph were the top (80%), middle (baseline), and bottom (-80%) horizontal lines, and the trend line (Figure 3). Both had unique stroke-widths and stroke colours that meant they could be easily filtered from the rest of the SVG elements. To filter these elements the team used the python library <a href="https://github.com/mathandy/svgpathtools">svgpathtools</a>. An example of a filtered SVG graph for the graph in Figure 3 is shown below.</p>
<p><img src="/images/blog/google-mobility/graph.png" alt="report graph" /></p>
<p><em>Figure 3: A graph from the GB report</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="100%" height="100%" viewBox="0 0 174 87" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" xmlns:serif="http://www.serif.com/" style="fill-rule:evenodd;clip-rule:evenodd;"><rect id="Artboard1" x="0" y="0" width="173.43" height="86.648" style="fill:none;"/>
// horizontal lines below //
<path d="M36.542,67.535l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
<path d="M36.542,53.246l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
<path d="M36.542,38.956l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
<path d="M36.542,24.666l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
<path d="M36.542,10.377l114.317,0" style="fill:none;stroke:#dadce0;stroke-width:0.29px;"/>
// tred line below //
<path d="M36.542,36.7l2.722,-1.62l2.722,2.065l2.722,0.986l2.721,1.113l2.722,-0.26l2.722,-0.1l2.722,-0.91l2.722,2.325l2.722,-1.166l2.721,-0.968l2.722,0.342l2.722,-0.561l2.722,-1.525l2.722,0.17l2.722,0.715l2.721,1.33l2.722,-1.502l2.722,2.046l2.722,-0.308l2.722,0.873l2.722,-1.226l2.721,-0.522l2.722,-0.248l2.722,1.104l2.722,1.897l2.722,-0.42l2.722,3.241l2.721,-1.596l2.722,-0.832l2.722,4.205l2.722,2.135l2.722,1.405l2.722,2.574l2.721,7.795l2.722,1.053l2.722,-2.437l2.722,7.716l2.722,1.389l2.722,0.461l2.721,0.062l2.722,2.397l2.722,-0.347" style="fill:none;stroke:#4285f4;stroke-width:1.14px;"/>
</svg>
</code></pre></div></div>
<h4 id="changing-vector-coordinates-to-data-values">Changing vector coordinates to data values</h4>
<p>Once the relevant elements were filtered from the SVG file(s), the team calculated a scaling metric. This scaling metric would be used to convert distances in the vector coordinate system to mobility % values. To do this, the distance between the +/- 80% lines and the baseline were used, and the average distance stored. We then calculated the distance between the trend line and the baseline in the <em>y</em> dimension for each vertex of the trend line. Each vertex in this case reflects a data point. Once these values had been found, the scaling value could then be used to calculate the mobility % value.</p>
<h4 id="getting-data-for-an-entire-document">Getting data for an entire document</h4>
<p>After we found a solution for a single graph, we needed to scale the process to calculate mobility % values for all graphs in a document.</p>
<p>We explored command line and python tools including <a href="https://www.mankier.com/1/pdftocairo">pdftocairo</a> and <a href="https://pymupdf.readthedocs.io/en/latest/">PyMuPDF</a>, however, the SVGs created with these methods resulted in a number of issues. One issue is that they did not flatten the SVG image. This means that the vector coordinate system in the SVG <code class="language-plaintext highlighter-rouge">path</code> did not represent its true location compared to other graphs and elements of graphs. We would therefore need to use the transform matrix stored within the <code class="language-plaintext highlighter-rouge">attributes</code> of the SVG element to find the distances between the horizontal lines and the trend line. As data was required as soon as possible, we therefore decided against these approaches. Furthermore, for points on the graphs, these approaches did not close the <code class="language-plaintext highlighter-rouge">path</code>, which caused errors loading the SVG using svgpathtools.</p>
<p>Due to the time constraints of this work the team converted all of the PDF documents to SVG documents using Affinity Designer. Affinity Designer is a low-fee vector graphics editor developed for macOS, iOS, and Microsoft Windows. It provides an option to flatten the SVG graphs, as well as automatically closing paths (needed to maintain single data points in graphs).</p>
<p>The <a href="https://console.cloud.google.com/storage/browser/mobility-reports/SVG">SVGs</a> and <a href="https://console.cloud.google.com/storage/browser/mobility-reports/PDF">PDFs</a> were then uploaded to Google Cloud Platform (GCP) so we could add them to the download function in the pipeline. For the SVG documents, graphs are arranged as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ------------ ------------
| Page 1 & 2 | | Page 3 + |
| 1 | | 1 2 3 |
| 2 | | 4 5 6 |
| 3 | | -------- |
| | | 7 8 9 |
| | | 10 11 12 |
----------- ------------
</code></pre></div></div>
<p>Thus, the first six graphs in a document were from pages one and two, and related to the national-level graphs for the six categories. The subsequent pages comprised two sets of six graphs, each set relating to a region.</p>
<h4 id="quality-assurance">Quality Assurance</h4>
<p>In addition to developing a reusable pipeline to extract the data from the reports then team undertook comprehensive quality assurance on all the data uploaded to the Campus’ <a href="https://github.com/datasciencecampus/google-mobility-reports-data">Google Mobility Reports Data GitHub repository</a>. Whilst the team were working to include quality assurance steps within their code (see below), they used R code written by Cabinet office colleague, <a href="https://github.com/mattkerlogue/google-covid-mobility-scrape">Matt Kerlogue</a>. This code enabled us to extract the position number of the graph within each report, in addition to the place name and category. This code also returned the headline value for each graph which allowed us to sense-check the data being extracted from the graphs. Using this the team could flag any inconsistencies where the SVGs had been extracted in a different order, and gaps within both graphs and the extracted text due to missing data.</p>
<h4 id="adding-quality-assurance-to-our-code">Adding Quality Assurance to our code</h4>
<p>Using python, the team then developed a method to read the headline numbers in the text of the report. These numbers relate to the overall increase or decline in mobility at the end of the time series (compared to the baseline). By parsing the PDF, textual information about the region name, and category type could also be found. This information was also combined with the graph trend line dataset to provide the user with all the textual and numeric information from the reports. This information also allowed the team to automatically cross-check the data from the graph trends, and also align the extracted data with the regions it refered to.</p>
<h2 id="summary">Summary</h2>
<p>The team at the Campus worked at speed to provide analysts from across government with data from Google’s <a href="https://www.google.com/covid19/mobility/">Community Mobility PDF Reports</a> rapidly. The team also created a public python tool so that data from any Community Mobility Report can be gathered, helping analysts from all over the world understand changes in mobility over time.</p>
<p>Since its release, our code and data has been used by academics, government workers and members of the public in countries around the world, such as Poland and the US. The Campus’ <a href="https://github.com/datasciencecampus/google-mobility-reports-data">Google Mobility Reports Data GitHub repository</a> contains quality checked data for all G20 countries (except Russia and China, where reports were not available).</p>
<p>We hope that Google releases future updates, as the community mobility profiles are a valuable addition to the data we have on mobility. Please keep an eye on our <a href="https://github.com/datasciencecampus/mobius">code repository</a> for updates.</p>
<p><img src="/images/blog/google-mobility/pipeline.gif" alt="the pipeline as a GIF" /></p>
<p><em>Figure 4: A GIF video of the pipeline in action (at 2x speed)</em></p>DSC Mobility Analysis Teamdata.campus@ons.gov.ukHow a team at the Campus extracted data from the Google Mobility Reports within 40 hours so it could be used by analysts around the world.How to Setup a Google Cloud VM, pull a GitHub repo, install docker, and run a script at regular intervals using cron (in 22 steps)2020-03-20T00:00:00+00:002020-03-20T00:00:00+00:00https://datasciencecampus.github.io/creating-a-gcp-vm-and-run-cron<p><img src="../../images/training/vm/vm-cron.jpg" alt="Data Science Campus GitHub" /></p>
<p>This is a short guide on how to setup a Google Cloud Platform VM, pull a GitHub repository and let cron run a script on regular intervals. There probably are some bad practices in this guide, such as not using a virtual environment or conda environment, and instead using the base python environment. However, the aim of this was to create a single-use VM for a specific task - which we could leave run chugging away.</p>
<h5 id="step-1-to-setup-vm-on-gcp-first-go-to-compute-engine-then-vm-instances-then-create-instance-name-the-vm-instance-for-this-example-all-default-settings-should-be-fine-except">Step 1: To setup VM on GCP first go to <code class="language-plaintext highlighter-rouge">Compute Engine</code>, then <code class="language-plaintext highlighter-rouge">VM instances</code>, then <code class="language-plaintext highlighter-rouge">Create Instance</code>. Name the VM instance. For this example, all default settings should be fine, except:</h5>
<ul>
<li>set region to <code class="language-plaintext highlighter-rouge">europe-west2 (London)</code></li>
<li>set zone to <code class="language-plaintext highlighter-rouge">europe-west2-b</code></li>
<li>you may want to increase the Boot Disk</li>
<li>allow <code class="language-plaintext highlighter-rouge">HTTP</code> and <code class="language-plaintext highlighter-rouge">HTTPS</code> traffic on the Firewall</li>
</ul>
<h5 id="step-2-click-on-the-down-arrow-next-to-ssh-and-choose-view-cloud-command-copy-the-command-to-a-terminal-window-and-run">Step 2: Click on the down arrow next to SSH and choose <code class="language-plaintext highlighter-rouge">view Cloud command</code>. Copy the command to a terminal window and run.</h5>
<h5 id="step-3-run-gcloud-init-in-your-sshd-terminal-window-and-sign-in-with-a-your-personal-google-gcp-account-option-2-follow-the-instructions-and-set-the-required-project-and-region">Step 3: Run <code class="language-plaintext highlighter-rouge">gcloud init</code> in your SSH’d terminal window and sign in with a your personal Google GCP account (option 2). Follow the instructions and set the required <code class="language-plaintext highlighter-rouge">project</code> and <code class="language-plaintext highlighter-rouge">region</code>.</h5>
<h5 id="step-4-step-4-and-5-were-sourced-from-here-install-a-few-crucial-packages-for-debian">Step 4: Step 4 and 5 were sourced from <a href="https://www.datacamp.com/community/tutorials/google-cloud-data-science">here</a>. Install a few crucial packages for Debian.</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ sudo apt-get update
$ sudo apt-get install bzip2 git libxml2-dev
```
</code></pre></div></div>
<h5 id="step-5-install-miniconda">Step 5: Install miniconda</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
$ rm Miniconda3-latest-Linux-x86_64.sh
$ source .bashrc
$ conda install scikit-learn pandas jupyter ipython
```
</code></pre></div></div>
<h5 id="step-6-set-your-global-git-credentials">Step 6: Set your global git credentials</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ git config --global user.name 'User Name'
$ git config --global user.email 'User Email'
```
</code></pre></div></div>
<h5 id="step-7-steps-7-to-9-were-sourced-from-here-generate-a-new-ssh-key-for-github-first-paste-the-text-below-into-your-terminal-window-use-the-default-location-and-no-password">Step 7: Steps 7 to 9 were sourced from <a href="https://help.github.com/en/enterprise/2.17/user/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent">here</a>. Generate a new SSH key for GitHub. First, paste the text below into your terminal window. Use the default location and no password.</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ ssh-keygen -t rsa -b 4096 -C 'User Email'
```
</code></pre></div></div>
<h5 id="step-8-start-the-ssh-agent-in-the-background">Step 8: Start the ssh-agent in the background.</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ eval "$(ssh-agent -s)"
```
</code></pre></div></div>
<h5 id="step-9-add-your-ssh-private-key-to-the-ssh-agent-if-you-created-your-key-with-a-different-name-or-if-you-are-adding-an-existing-key-that-has-a-different-name-replace-id_rsa-in-the-command-with-the-name-of-your-private-key-file">Step 9: Add your SSH private key to the ssh-agent. If you created your key with a different name, or if you are adding an existing key that has a different name, replace id_rsa in the command with the name of your private key file.</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ ssh-add ~/.ssh/id_rsa
```
</code></pre></div></div>
<h5 id="step-10-steps-10-and-11-were-sourced-from-here-paste-your-ssh-key-to-terminal-and-then-copy-it-to-clipboard">Step 10: Steps 10 and 11 were sourced from <a href="https://help.github.com/en/enterprise/2.17/user/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account">here</a>. Paste your SSH key to terminal and then copy it to clipboard</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ cat < ~/.ssh/id_rsa.pub
```
</code></pre></div></div>
<h5 id="step-11-go-to-httpsgithubcomsettingssshnew-and-add-a-title-for-your-key-then-paste-in-the-clipboard-text-into-the-key-box-click-add-key-and-then-enter-your-password">Step 11: Go to <a href="https://github.com/settings/ssh/new">https://github.com/settings/ssh/new</a> and add a title for your key, then paste in the clipboard text into the key box. Click <code class="language-plaintext highlighter-rouge">Add Key</code> and then enter your password.</h5>
<h5 id="step-12-make-a-gitprojects-directory-and-then-cd-to-it">Step 12: Make a GitProjects directory and then cd to it.</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ mkdir GitProjects
$ cd GitProjects
```
</code></pre></div></div>
<h5 id="step-13-clone-your-github-repo-using-the-ssh-link">Step 13: Clone your GitHub repo using the SSH link</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ git clone repo
```
</code></pre></div></div>
<h5 id="step-14-install-docker-steps-14-to-xx-were-sourced-from-here-install-the-packages-necessary-to-add-a-new-repository-over-https">Step 14: Install docker (steps 14 to XX were sourced from <a href="https://linuxize.com/post/how-to-install-and-use-docker-on-debian-10/">here</a>). Install the packages necessary to add a new repository over HTTPS.</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ sudo apt update
$ sudo apt install apt-transport-https ca-certificates curl software-properties-common gnupg2
```
</code></pre></div></div>
<h4 id="step-15-import-the-repositorys-gpg-key-using-the-following-curl-command">Step 15: Import the repository’s GPG key using the following curl command:</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
```
</code></pre></div></div>
<h4 id="step-16-add-the-stable-docker-apt-repository-to-your-systems-software-repository-list">Step 16: Add the stable Docker APT repository to your system’s software repository list.</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
```
</code></pre></div></div>
<h4 id="step-17-update-the-apt-package-list-and-install-the-latest-version-of-docker-ce-community-edition">Step 17: Update the apt package list and install the latest version of Docker CE (Community Edition).</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ sudo apt update
$ sudo apt install docker-ce
```
</code></pre></div></div>
<h4 id="step-18-once-the-installation-is-completed-the-docker-service-will-start-automatically-to-verify-it-type-in">Step 18: Once the installation is completed the Docker service will start automatically. To verify it type in.</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ sudo systemctl status docker
```
</code></pre></div></div>
<h4 id="step-19-run-a-docker-image">Step 19: Run a docker image.</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ sudo docker run -p port:port <image_name>
```
</code></pre></div></div>
<h5 id="step-20-create-a-cron-job-select-nano-as-the-editor-option-1">Step 20: Create a cron job. Select nano as the editor (option 1)</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```bash
$ crontab -e
```
</code></pre></div></div>
<h5 id="step-21-paste-into-your-crontab-the-cron-jobs-you-want-to-run-the-below-example-goes-to-the-project-folder-and-runs-examplepy-each-hour-of-every-day-log-outputs-are-saved-to-cronout">Step 21: Paste into your crontab the cron jobs you want to run. The below example goes to the project folder and runs <code class="language-plaintext highlighter-rouge">example.py</code> each hour of every day. Log outputs are saved to <code class="language-plaintext highlighter-rouge">cron.out</code>.</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```text
$ 0 * * * * cd ~/GitProjects/project && /home/username/miniconda3/bin/python ~/GitProjects/project/example.py >> ~/GitProjects/project/cron.out 2>&1
```
</code></pre></div></div>
<h5 id="step-22-use-x-to-save-the-crontab">Step 22: Use <code class="language-plaintext highlighter-rouge">^X</code> to save the crontab.</h5>
<p>Your python script should now run at XX:00 every hour. See crontab guru for more information about setting cron times.</p>Michael Hodgemichael.hodge@ons.gov.ukHow to Setup a Google Cloud VM, pull a GitHub repo, install docker, and run a script at regular intervals using cron (in 22 steps)