The human footprint in internet traffic data

In 2016, Professor Sir Charles Bean published an Independent review of UK economic statistics, commissioned by the Chancellor of the Exchequer and the Minister for the Cabinet Office. The report points out some important shortcomings of traditional methodologies such as GDP in measuring the rapidly evolving digital economy. The report quotes Will Page, director of economics at Spotify as saying “GDP faces a ‘square peg, round hole’ dilemma in that it was originally designed to measure tangible manufactured goods which are losing relevance in the modern economy”.

As part of our ongoing research, we are currently exploring the wider theme of deriving social/economic indicators from alternative and novel data sources for use in new and experimental statistical outputs. Our first project to emerge from this exciting research area aims to explore if it is possible to extract social and economic insights from real time internet traffic volume data.

The internet is omnipresent and now extends into nearly every facet of society. It has become a necessity and the facilitating medium for much of the economy. A (now dated) 2010 study estimated the Internet’s contribution to UK GDP to be as much as 5.6%. This contribution can be broken down into a number of components, including, but not limited to, the internet value chain, which includes online services, enabling technologies, including cloud computing, connectivity and devices. Now, in 2019, this contribution is likely to be significantly higher and complex with contributions from new industries and emerging technologies. As the internet grows and new industries are spawned, it is important to consider new ways to measure it’s effect on the economy and society.

Individuals and organisations interact with the internet in some form throughout the working day. The sum of these interactions can be loosely described as social and economic related activity. This activity may include streaming services, inter-business communication, inter-process communication, financial transactions, through to web browsing and traditional email use, to name but a few. In addition, these services, processes and interactions are heterogeneous, in the sense that they are delivered using different protocols and layers as per the OSI model. While these services vary greatly in the Application layer, they all share one thing in common: the transmission and flow of data in some form.

In light of this shared characteristic, our proposal is to attempt to measure and study data driven economic activity at the Data link layer: by analysing the amount of data flowing through a network at a given time point, we hope to observe a footprint of economic activity. This proposal is inline with an important observation made in the Bean review, in which section 3.23 states:

The production and use of data is a fundamental element of economic activity, in parallel to the production and consumption of goods and services. This idea leads naturally to focusing directly on measuring data generation, flows, use and storage as routes into understanding digitally-based economic activity.

Data sources

Due to the distributed nature of the internet and it’s network topology, it is difficult to understand geographic characteristics of network usage. A data packet destined for London from elsewhere in the UK may be routed through any number of physical locations both in and outside of the country. Inter -continental data originating from the US and destined for mainland Europe may in turn be routed via the UK. As such, it is difficult to attribute varying amounts of internet traffic to a specific location or even continent.

While the internet is a distributed network, there exists a physical topology of internet backbone service providers. One specific type of provider, an Internet eXchange Point (IXP) aims to “keep local traffic local” by operating network infrastructure designed to carry internet traffic between local businesses and traffic originating from or destined to a local area from connected peer networks.

There exist 483 active IXP’s globally (see Wikipedia list and PeeringDB) The largest number of IXP’s is in Europe (197), followed by Asia (101), North America (91), Latin America (35), Africa (35), Oceania (12) and the Middle East (12). The largest IXP in Europe, Deutscher Commercial Internet Exchange (DE-CIX), located in Frankfurt, Germany, operates at a peak throughput exceeding 6 Tbps (terabits per second) and connects with more than 700 Internet Service Providers (ISPs).

In the UK, there are 10 active IXP’s (according to Wikipedia) operated by different providers, serving different cities:

Name City
IX Brighton Brighton
LINX Cardiff Cardiff
IX Leeds Leeds / Western Yorkshire
IX Liverpool Liverpool
LINX LON1 London
LONAP London
LINX LON2 London
Net-IX London
LINX Manchester Manchester
LINX Scotland Scotland

Of these, the largest IXP is operated by the London Internet eXchange (LINX), which operates exchanges in IXPs in London, Manchester, Edinburgh and Cardiff.

Focusing on London, the LINX network consists of two IXPs, namely LON1 and LON2 which are similar in terms of physical network topology, however differ in terms of network switch hardware. The two IXPs are distributed over multiple London data centre locations and also interconnected by dark fibre for resilience. The two IXPs differ somewhat in terms of connected peers (694 LON1, 315 LON2), which include Internet Service Providers (ISPs), Content Delivery Networks (CDNs) and other entities. The topology of this network can be explored using PeeringDB, which also offers a JSON API. Using this service, the top 20 connected peers to the LON1 (left) and LON2 (right) exchanges by global IXP connectivity are:

Organisation Traffic levels Organisation Traffic levels
Hurricane Electric 10 Tbps+ Facebook Inc 1 Tbps+
Facebook Inc 1 Tbps+ Google LLC  
Cloudflare   Microsoft  
Akamai Technologies 10 Tbps+ Amazon.com  
Packet Clearing House 1-5Gbps Limelight Networks Global 1 Tbps+
Google LLC   Verizon Digital Media Services (EdgeCast Networks) 10 Tbps+
Microsoft   Valve Corporation 1 Tbps+
Packet Clearing House AS42 1-5Gbps SoftLayer Technologies, Inc. (an IBM Company) 1 Tbps+
Yahoo!   Riot Games 100-200 Gbps
Amazon.com   OVH 1 Tbps+
Netflix 10 Tbps+ G-Core Labs S.A. 1 Tbps+
Fastly, Inc. 1 Tbps+ Forcepoint Cloud  
Apple Inc.   IPTP Networks 1 Tbps+
Twitch   SG.GS 10-20Gbps
Netnod 100-1000Mbps IX Reach 300-500 Gbps
Limelight Networks Global 1 Tbps+ Dropbox  
ISC   Blizzard Entertainment  
Verizon Digital Media Services (EdgeCast Networks) 10 Tbps+ Colt 500-1000 Gbps
Twitter, Inc.   OpenDNS, Inc.  
Valve Corporation 1 Tbps+ Anexia  

Other notable organisations include British Broadcasting Corporation (BBC), British Telecom (BT), Her Majesties Revenue and Customs (HMRC), Janet (UK research network), Sony (Playstation), TalkTalk, Verizon, Vodafone, Virgin media, VISA (international), booking.com, eBay, and Sky broadband.

Most IXPs provide publicly available network throughput statistics and historical data showing the amount of bandwidth measured in mega, giga and tera -bits per time unit. For example, the Amsterdam Internet eXchange (AMS-IX), Moscow Internet eXchange (MSK-IX) and London Internet eXchange (LINX) all provide network traffic statistics at 5 minute intervals. Given it’s size and points of presence in some major UK cities, we have so far chosen to explore the network use statistics belonging to LINX.

Data characteristics

To date, we have explored data available on the LINX network statistics portal. This data, consists of average IXP network throughput for a given time period measured in bits per second. The highest resolution data available is in 5 minute buckets and extends back as far as June 2015. At daily frequency, the time-series data for the LON2 (London) IXP extends back to February 2011 as shown in the following graph:

IXP daily throughput

The overall trend is inline with the growth of the internet and can be explained by increased bandwidth, network capacity, size and number of IXP connected peers, increased internet use and the rise of streaming and data intensive services. According to a recent ONS publication, as of 2018, 90% of adults in the UK were recent internet users, which varies between age groups. Nearly all (99%) of adults aged 16-34 were reported as recent internet users.

Low frequency characteristics

On a monthly scale, the data indicates increased internet traffic during the winter months and decreased traffic during the summer.

Monthly mean

In the above figure, monthly internet traffic since 2011 for a London Internet eXchange Point has been averaged which indicates a yearly parabolic pattern of use. Interestingly, this observation is contrary to UK road traffic reported in the DfT road traffic estimates, which show an increase in the number of cars parsing a point during summer months, and an decrease during winter months.

If we consider purely leisure related internet use, the increase in traffic during winter months in this data may be due to reduced outdoor activity which may in turn be influenced by reduced daylight hours. If we consider “work” related internet use, the summer drop in traffic may also be further reduced by summer vacations and perhaps the aggregation of longer lunch time breaks during spells of good weather. In general, this month-by-month pattern is interesting in it’s own right, as it may be representative of general human seasonal activity.

LON2 daily throughput 2016 LON2 daily throughput 2017

Looking at daily internet traffic, many interesting patterns emerge. The above figure shows daily (London) traffic levels from the IXP data for 2016 and 2017 respectively. Each row corresponds to the day of week and each column corresponds to one of the 52 weeks in a year. Each cell represents the average internet use for each day in each week for the respective year - with bright cells representing increased internet use.

From this visualisation it is clear that weekdays are different from weekends and that internet traffic tends to be lower in the summer months. It is also clear that certain days exhibit significantly less traffic volume.

LON2 daily throughput 2017

The visualisation above shows 2 years of internet traffic data for the London IXP data. The vertical axis shows the relative intensity of traffic throughout the day, starting at 12am (bottom of the chart), through to 11:59pm (top of chart). Each point on the horizontal axis corresponds to one of the of 730 days in a period covering 2016 to 2017. The dark band at the bottom of the visualisation corresponds to the morning hours (2am - 7am), while the thin vertical dark bands correspond to lower internet traffic during the weekends.

LON2 2017 April/may bank holidays

If we zoom in to April - May 2017, shown above, three bank holidays are visible.

  1. Friday 14 April (Good Friday) - 17 April (Easter Monday)
  2. Monday 1 May (Early May bank holiday)
  3. Monday 29 May (Spring bank holiday)

While not surprising, the fact that the internet traffic in this data varies according to work and non-work-days is an indication that the data may be useful for measuring working patterns and anomalous events.

High frequency characteristics

During the average week, weekends tend to have lower traffic with mid-week days exhibiting the highest amounts of traffic.

Average week

The graph above shows the average amount of traffic for all 5 minute time points in a week for a London IXP, with the mean daily value shown by the orange dashed line. The inter-day variation could be explained by a number of factors. However, it is interesting to note that this variation is also present in the reported working hours in the Centre for Time Use Research 2014-2015 time use survey, as shown on the right hand side in the below graph.

Daily mean

In the time-use survey, 3,523 participants logged their work status over a 7 day period in 15 minute intervals. By aggregating this data by day of week, we can see that Wednesday is the busiest day of the week, with Monday and Friday being significantly lower. Sunday is the least busiest day (as expected). Comparing with the internet traffic shown on the left, Wednesday is the busiest, with lower internet use at the weekends. This observation points in the direction that the internet use data may be influenced by economic activity (in this case, worked hours), and may also be of interest as an alternative or complimentary way to measure (internet related) time use.

Perhaps one of the most interesting features of the data are the differences between the inter-day average internet usage for each hour of each day of the week. The below graph shows the average hourly London IXP usage (in Gbps) for Tuesday-Thursday (purple), Monday (pink), Friday (orange), Saturday (dashed), Sunday (dotted).

Average day

The weekdays follow a similar pattern, with each low point occurring at approximately 4.30am, followed by a rise toward 10am. From 10am, traffic rises at a slower rate toward approximately 4.30pm, where it drops off at approximately 6pm before rising again to a daily peak at 9pm.

An interesting feature is the difference between Fridays and the rest of the working week: Friday internet traffic tends to drop sooner, at a faster rate, and also tends to be lower during the 6pm low point. We speculate that this may be due to a tendency for people to finish work earlier on Friday and to be more likely engaging in non-internet related leisure during the evening.

For all weekdays, there exists a temporary drop in internet use which seems to reach a minima at approximately 6pm each day. Overall, this is one of the most notable characteristics and could be due to a number of reasons which we will discuss shortly.

The average weekend usage is distinct from the weekday pattern. Besides being somewhat lower, Saturday and Sunday differ due to the missing 6.30pm dip. We speculate that this phenomena is due to a change in commuting activity during the weekend period: the 6pm dip during the weekday may be due to large-scale post work commuting behavior, where bandwidth consumption may change due to absence of use (e.g., when driving a vehicle), be reduced while using mobile data, or change to an ISP which is not connected to the IXP.

In addition, Saturday and Sunday converge at approximately 1pm, at which point, Sunday’s (dotted line) usage tends to approach weekday evening peak use, while Saturday evening forms the lowest evening use in comparison to the rest of the week. Similar to Friday evening, it may be that the Saturday evening drop in use may likely be due to increased (non-internet) leisure activity which defaults to the inter-week evening pattern on Sunday due to the next day being a work-day.

Peak times

In addition to differences in the average daily traffic volume, there are also differences in the timing and extent of extremes during different time periods. The above graph shows the daily difference for traffic volume (left column) and low/high point time (right column) for the early morning low point (first row), late afternoon high point (second row), and finally, the evening peak (third row).

Starting with the early morning low point shown in the first row, we can see that, on average, Monday morning at 4.45am has the lowest internet use in our London sample data. As we move to mid-week, the early morning traffic volume increases, peaking on Wednesday before gradually declining toward the weekend. Interestingly, the time at which this low point occurs is symmetrically earliest mid-week (4am) and latest at the beginning and end of the week, where traffic is lowest at 4.45am on average.

The next notable feature is the extent and timing of the late afternoon peak which occurs just before the 6pm temporary drop in traffic each day. On average, the observed volume at this point is highest mid-week, inline with the overall average daily traffic and starting from Tuesday, occurs earlier each day toward the weekend. We speculate that this point delimits the end of the average working day and beginning of the daily commuting period.

Finally, the highest traffic levels can be observed each day at 9pm, with the exception of Sunday, where this peak time occurs one hour earlier on average. While the overall average daily internet traffic is highest on Wednesday, the evening volume is highest on Tuesdays.

Traffic vs commuting

We had previously shown that the daily average internet traffic appears to exhibit a similar pattern as the 2014-2015 time use survey. Next, we compare average weekday traffic with various reported time-use activities. The above graph compares average daily internet traffic (dashed line) with number of people reporting to be commuting or eating for each point in the day. Commuting leads internet traffic use in the early hours of the morning (as expected). At 9am, as reported commuting activity declines, the rate of change in internet traffic slows, and by 11am, both internet traffic and commuting activity begin to climb toward the 4.30pm late afternoon peak. Commuting activity proceeded by eating activity seems to account for the 6pm temporary low point.

Traffic vs working

The time use survey contains an option to indicate use of a device (in combination) with other activities. The definition of a device includes use of a computer and as such, this reported activity appears to be highly similar to the average internet use. In the above figure, we overlay the average number of people reporting to be working at each point in the day along with reported device use. Note that the 2014-2015 survey is from a different time period with respect to our data which extends into 2018, although it is interesting to note the similarity between the two series. The time and rate of change in internet use appears to coincide with working and device use during the morning hours. In addition, the 6pm drop and 9pm peak are inline with reported device use.

Research direction

We are very excited by this data and are currently exploring various research questions.

Relationship with economic activity

We are currently exploring the relationship between this data and existing economic indicators. We have so far studied the relationship between the IXP internet traffic use and monthly Gross Value Added (GVA) and have found some promising initial findings which suggest that the data may be correlated with specific sectors of the economy during specific time periods. This is a challenging direction of research since we might expect little relationship with components of GDP given that GDP does not measure certain aspects of the data/internet driven economy.

Beyond studying the correlation between this data and existing indicators, we are also investigating the potential to use this (or similar) data as a means to measure the impact of anomalous events such as adverse weather conditions. For example, we may observe a rise/or fall in traffic during heavy snow as people switch to working from home.

We are also exploring the potential for various derivative indicators. For example, we noted before the difference in Friday evening internet activity compared with the rest of the week. It may be possible to use the extent of this difference as a proxy for evening (non internet related) social activity which may in turn be used in the context of understanding the night time economy.

Relationship with road traffic data

In addition to exploring the relationship with economic activity, we are also exploring an idea that the IXP internet traffic data could be used to indirectly measure road traffic, and more generally, commuting behavior.

This idea is based on the observation that the 4-6pm drop in internet traffic coincides with the commuting period. If we can measure the depth, breadth and timing of this temporary drop in internet traffic, we expect to find a relationship with average road traffic speeds and public transport use.

Internet traffic vs Road traffic

The above figure shows the relationship with UK-wide average weekday road traffic reported as an index of average car miles in the Department for Transport’s Road traffic estimates compared with (London) IXP internet traffic data. The reported road traffic data are inline with the results from the time-use survey and furthermore exhibit similar characteristics to the internet traffic data. On Friday’s in particular, there is overall increase in road traffic data coupled with a decrease in internet use data. We intend to follow up with a technical article detailing our findings in this area in the very near future.

Conclusion

In this post we have described an interesting data source which we are using as a step in the direction to explore scalable methods for measuring data-driven social-economic activity. The purpose of this post has been to discuss some of our observations and initial ideas and to describe the direction of our current research in this exciting area. We are currently actively exploring this (and related) data sources and plan to produce a series of updates on this blog as our research progresses.

Thanks for reading.


Phil Stubbings

By

Data scientist at ONS Data Science Campus. Contact: philip.stubbings AT ons.gov.uk

Updated