Introduction
The Road Data Pipeline is an unofficial Highways England WebTRIS api client. The work contained within the pipeline was informed by phil8192’s webtri.sh client. This pipeline is less flexible than the webtri.sh client acknowledged above, but includes additional data processing, outputting csv files for the user-specified date range.

The client allows querying of all available sites for specified date ranges, intended for monthly analysis.
Output Summary
Main Outputs
site_id |
Integer |
Numerical site code |
site_name |
Alphanumeric |
Long-form site code |
report_date |
Datetime |
The date of data capture |
time_period_end |
Timestamp |
The time of data capture |
interval |
Integer |
Pending |
len_x_y_cm |
Integer |
Multiple columns containing vehicle lengths |
speed_x_y_mph |
Integer |
Multiple columns, Speed of vehicle |
speed_avg_mph |
Integer |
Average speed of vehicle |
total_vol |
Integer |
Pending |
longitude |
Floating Point |
Coordinate data |
latitude |
Floating Point |
Coordinate data |
status |
Character |
Active / inactive |
type |
Character |
Site type: MIDAS, TAME or TMU |
direction |
Character |
Compass direction of traffic |
easting |
Integer |
Coordinate data |
northing |
Integer |
Coordinate data |
The output files will appear in the output_data folder as site-type_query-date.csv. There are 3 site types available, MIDAS, TAME, and TMU. Please see explanations of the different site types, as provided by phil8192:
-
Motorway Incident Detection and Automatic Signalling (MIDAS) Predominantly inductive loops (though there are a few sites where radar technology is being trialled)
-
TAME (Traffic Appraisal, Modelling and Economics) which are inductive loops
-
Traffic Monitoring Units (TMU) (loops)
-
Highways Agency’s Traffic Flow Database System (TRADS) Traffic Accident Database System (TRADS)? (legacy)
Note that TRADS is available but not being queried by this pipeline.
All Outputs
The full output of the pipeline is as follows:
./output_data/midas_daterange.csv
./output_data/tame_daterange.csv
./output_data/tmu_daterange.csv
./output_data/missing_site_IDs_daterange.txt
: This is to be replaced by the below, pending testing:
.reports/site_report_daterange.html
./logs/logfile.txt
The output data csvs contain the api output.
The ./output_data/missing_site_IDs_daterange.txt returns a sequence of missing site IDs from the queried date range, a count of api responses that were empty, and what proportion of the overall number of IDs queried this represented (rounded to 2dp).
Currently, executing the pipeline will overwrite any output_data files with the same name.
The log file.txt is important if you encounter an error. This document can be passed back to the developer in order to investigate issues encountered. New run s do not overwrite the logs, they append pre-existing logs.
Dependencies
R Studio
64-bit R (Note 32-bit will not be able to allocate the memory requirements)
100 GB of free disk space to allocate
Internet connection (preferably high speed fibre optic)
Access to CRAN packages listed below.
Git
GitHub account
Command line interface (Bash, Terminal, CMD prompt etc)
Packages and versions:
rlist 0.4.6.1
this.path 0.2.0
stackoverflow 0.7.0
dplyr 1.0.2
jsonlite 1.7.1
httr 1.4.2
log4r 0.3.2
renv 0.12.3
beepr 1.3
stringr 1.4.0
purrr 0.3.4
data.table 1.13.2
ProjectTemplate 0.9.3
Notes
Memory allocation - 100 GB of memory is requested for use in the R session. The script will stop execution if this is not available.
Internet connection - 1 month of data takes approximately 1.25 hours on a connection averaging 60 mbps download, 20 mbps upload.
Package management - this project uses renv
to manage package versions. Using renv
to ensure package versions are consistent can help to minimise the risk of breaking changes.
Data Processing
This pipeline joins 2 DataFrames ingested via the Highways England api, combo and sites.
DataFrames
combo
This DataFrame holds the site readings from the date range that the user specifies:
A full list of combo
’s column names:
Site Name
, Report Date
, Time Period Ending
, Time Interval
, 0 - 520 cm
, 521 - 660 cm
, 661 - 1160 cm
, 1160+ cm
, 0 - 10 mph
, 11 - 15 mph
, 16 - 20 mph
, 21 - 25 mph
, 26 - 30 mph
, 31 - 35 mph
, 36 - 40 mph
, 41 - 45 mph
, 46 - 50 mph
, 51 - 55 mph
, 56 - 60 mph
, 61 - 70 mph
, 71 - 80 mph
, 80+ mph
, Avg mph
, Total Volume
, site_id
.
All of the columns except the site_id
are ingested via api. The api does not respond with row level site IDs. The values for site_id
are extracted from the response url.
sites
This DataFrame holds the details and statuses of all the sites at the date of query. Some additional columns are added to the api response during the pipeline. The sources of the columns are specified below.
row_count |
Integer |
api response count (different values per site), dropped from output |
api |
sites.Id |
Integer |
Numerical site code |
api |
sites.Name |
Character |
Contextual site info, dropped from output |
api |
sites.Description |
Factor |
Motorway location |
api |
sites.Longitude |
Floating Point |
Coordinate data |
api |
sites.Latitude |
Floating Point |
Coordinate data |
api |
sites.Status |
Character |
Active / inactive |
api |
type |
Character |
Site type: MIDAS, TAME or TMU |
pipeline |
direction |
Character |
Compass direction of traffic |
pipeline |
easting |
Integer |
Coordinate data |
pipeline |
northing |
Integer |
Coordinate data |
pipeline |
Column Names
The column names and order have been adjusted to that in Output Summary
Join Integrity
The sites
DataFrame is left joined to the combo
DataFrame. For more details, please see the dplyr join documentation. The dplyr join functions are analagous to SQL joins.
Anti join is used if null matches in the site IDs are detected. This would mean that site IDs within the combo
DataFrame had no matching site ID within the sites
DataFrame. If there are any null matches detected, the rows from the sites
DataFrame are cached for reporting in the site_report
.
Download the Pipeline
To get a copy of the pipeline, you will need to download the repository from DSC road data.
Click on the Code
button to see your download options.
To access the pipeline, you will currently need to clone
this repository using Git. This will allow access to the branch that stores the required files.
It is recommended to specify HTTPS for cloning options, click on the clipboard icon to copy the required url. For assistance in configuring access to GitHub repositories from a command line interface, please consult the GitHub PAT guidance.

Using Bash, CMD prompt, terminal or whatever command line interface you prefer to use for Git interfacing, navigate to the directory you wish to run the pipeline from and run the line:
git clone https://github.com/datasciencecampus/road-data-dump.git
If the clone is executed correctly, navigate to the newly cloned repository by running:
cd road-data-dump
Once you have arrived at the roads-data-dump folder, you will need to check out to the r-pipeline branch. In order to do that, run the line:
git checkout r-pipeline
If you successfully checked the r-pipeline out, you should now have a directory that looks like this:

If your directory looks like this, you are ready to begin first time configuration. Please proceed to First Time Run.
If you require additional support for running Git and the Git commands, then please refer to this Towards Data Science Complete Beginner’s Guide.
Using the Pipeline
Some guidance before starting:
It’s advisable to shut all non essential processes down prior to running the pipeline on a full month. Parallel processing is now being used and this puts extra demand on system reources.
If you have any of the output data files open, I advise closing them prior to running the pipeline. If R tries to overwrite a file with the same name, it will error and halt execution. If this happens, close the file, use R Studio to open munge/15-write.R
and re-run this script (either click on the source
button at the top of the script in R Studio, or press Ctrl
+ shift
+ enter
).
Your system may ask for permission to open additional R Studio sessions during pipeline execution. This will only happen at script number 8 (appears in the console as 08-GET_daily_reports.R), so it is advisable to wait until this script has been initiated before leaving your workstation to proceed with the pipeline.
It is also best to start from a blank slate every time you run the pipeline. To do this, select Session
> Clear Workspace
from the toolbar at the top of R Studio.
This pipeline is intended to produce monthly data for analysis. Specifying longer time periods than a month may result in memory limits being exceeded. Therefore execution of the scripts is halted if 31 days limit is exceeded.
Always ensure you are working from the R Project file. This help to ensure a consistent environment every time you run the pipeline.
The project file is called road-data-dump.Rproj
.
First Time Run
On your first run, you need to configure your R Studio set up and then test the pipeline.
Configuration
renv
The renv
package helps to ensure the pipeline is using the same version of all packages every time it runs. This takes a little set up.
- Ensure
renv
is installed by running: install.packages("renv")
.
Successful installation should look something like this:
The downloaded binary packages are in
/var/folders/d3/cjvn_l1n13z5z3t6nz554p8r0000gq/T//RtmpnEKjds/downloaded_packages
- Once
renv
has been successfully installed, you will need to build a local package library for this pipeline. The packages will all have the required versions and this step will not affect your other R projects. to do this, execute the line: renv::restore()
.
You will be asked:
Do you want to proceed? [y/N]: y
Enter y
and press enter.
This will go to CRAN for all the required package version dependencies. This may take some time. Keep a close eye on R Studio’s console for any error warnings and note the packages that fail. Retry installing any package that failed by running install.packages("insert_package_name")
. You can also try: install.packages("insert_package_name", type = "win.binary", dependencies = TRUE)
.
- Successful loading of required package versions will look something like this:
The following package(s) have been updated:
glue [installed version 1.4.2 != loaded version 1.4.1 ]
data.table [installed version 1.13.2 != loaded version 1.12.8]
rlang [installed version 0.4.9 != loaded version 0.4.8 ]
generics [installed version 0.1.0 != loaded version 0.0.2 ]
magrittr [installed version 2.0.1 != loaded version 1.5 ]
vctrs [installed version 0.3.5 != loaded version 0.3.1 ]
pillar [installed version 1.4.7 != loaded version 1.4.4 ]
tibble [installed version 3.0.4 != loaded version 3.0.1 ]
dplyr [installed version 1.0.2 != loaded version 1.0.0 ]
renv [installed version 0.12.3 != loaded version 0.12.2]
stringi [installed version 1.5.3 != loaded version 1.4.6 ]
Consider restarting the R session and loading the newly-installed packages.
Notice that the installed versions and loaded versions are different. This is what we want for the pipeline, we will be using the specified package versions instead of the most up to date version installed on our machine.
Also note that I have been asked to restart R. If prompted, do this prior to running the pipeline. Select Session
> Restart R
from the menu at the top of R Studio.
- Before moving on to the next section, check to ensure all the required packages have been installed correctly. Run this command again to ensure correct package versions:
renv::restore()
. If everything is good, the console should print:
* The library is already synchronized with the lockfile.
- If this is not the case, hopefully the
restore
function would have gone ahead and brought the packages up to the required versions. If not, you will require support with renv
, please contact me by clicking the mail icon.
ProjectTemplate
The ProjectTemplate
(note thecamel case, install.packages()
is case sensitive) is a robust framework for compartmentalising code in pipelines. It allows us to add logging, run sequential scripts and more. In order to run this pipeline, you will need to have ProjectTemplate
installed.
Run the line install.packages("ProjectTemplate”)
.
If at any point you see an error message that looks like this:
Error in .check.version(config) :
Your configuration is compatible with version 0.9.3 of the ProjectTemplate package.
Please upgrade ProjectTemplate to version 0.9.3 or later
You will need to re-run the line install.packages("ProjectTemplate”)
. Version 0.93 is what the pipeline expects, you can check the version you are running by looking at the Packages
pane in the R Studio interface.

You will notice that the image shows a column called Version
which shows the version of the package available in your global R library. This is not the version we are using in the pipeline.
- There should be another column called
lockfile
, shown in the image above. This is the record of all the local package versions we will be using for the pipeline. If the lockfile
column is not appearing within the Packages pane, then please run the line: renv::snapshot()
. This should cause it to appear. If the problem persists, then please contact me by clicking the mail button.
Important
The lockfile is a text file stored within the project directory.
Never manually adjust the contents of the lockfile. This may cause the pipeline to break.
Test Pipeline
Open the road-data-dump.Rproj
project file.
Using R Studio’s Files pane, navigate to the app
folder.

- Open either the
server.R
or ui.R
scripts. This script should now open within R Studio. Ensure you have clicked on the tab for the script you have just opened. You should now see a Run App
button at the top of the script.
- Click on the drop down arrow next to the
Run App
button. Ensure the configuration is set up as the below diagram. This will ensure the app launches in your default internet browser. This app will look best in Chrome. Please ensure Run External
is selected.

Click on Run App
and the App should now launch within your default browser.
On your first time run, it would be advisable to click on the Take a tour
button. This will guide you through the different elements of the app and offers some advice on what the app expects you to do.
On the first run of the pipeline, we just want to do a quick test to see if we can get one site ID for one day. To do this, enter a valid Email address, ensure Testing
is selected. Notice the messages on the right-hand-side of the app should appear as below.
- If everything looks good, click on the
Go!
button. A dialogue box should now appear as below. Click OK
to hide it.

- If the test was successful, you will hear a chime (please ensure your system volume is turned up) and the spinner that appears in the top right-hand corner will disappear. You will also notice that the pipeline status will change to
Pipeline executed.

The time difference will vary depending on your system and connection. If this step was successful, you are now ready to move on to Subsequent Runs. If the test was not successful, then please consult the Troubleshooting section.
Subsequent Runs
If you have arrived here, you have successfully configured the pipeline environment and have tested the pipeline successfully from the app interface.
Please note, once you have tested the pipeline to ensure its functionality and are ready to query larger volumes of data, I advise closing R Studio between subsequent queries. Particularly when querying whole months. This helps to ensure the environment is configured as expected and can help to avoid memory issues.
Run the app again, as you did when testing the pipeline.
Ensure that you have entered a valid Email and this time select Not Testing
. 
This will activate the date selection widgets. You should now be able to click on the start and end dates to specify your own values, as below. When selecting the dates, ensure that the start date precedes the end date and that no more than 31 days difference is selected. If either of these rules are detected by the pipeline, it will throw an intentional error. This is to limit sending bad requests to the api and to limit the need for additional memory allocation.

- Once you have completed the required fields, observe the messages on the right of the app. They should look similar to the screenshot below.
Clicking on Go!
will now execute a full run of the pipeline. The api will be queried for all available site IDs for the date range that you have selected. You will see a spinner appear at the top right-hand side of the app while it is busy.
Querying a full month takes just under an hour with a good internet connection. On completion, the spinner will disappear, the pipeline status will change to Pipeline executed.
and a chime will sound if your system volume is turned up.
The output .csvs will appear within the output_data
folder. The site report will appear within the reports
folder. If an error is encountered, please consult the Troubleshooting section.
Troubleshooting
This section will be used to document any errors encountered as the pipeline is used. In this way, I would hope to produce an extensive guide to troubleshooting the pipeline. To submit an issue for inclusion, either submit an issue on the GitHub repository or email me by clicking the mail icon. Likewise if this section does not resolve your particular issue, please Email me, ensuring you attach your logfile.txt
found within the logs
folder.
Issue 001
document error: Error in fwrite(tmu, "output_data/tmu.csv", row.names = F, quote = F) :
Permission denied: 'output_data/tmu.csv'. Failed to open existing file for writing. Do you have write permission to it? Is this Windows and does another process such as Excel have it
open?
Reason: This means R can’t write to csv as the file is open.
Action: Please close any Excel files and re-run the pipeline.
Issue 002
Your computer crashes, freezes or the application times out (goes grey).
Reason: Likely to be memory issues.
Action: 1. Close and re open the project file. Re-run the pipeline. 2. If the above did not resolve the issue, inspect the free disk space available on your machine. Please see the dependencies guidance.