3. GTFS

Tutorial
Learn how to inspect, validate, and clean General Transit Feed Specification inputs using the transport_performance.gtfs module.
Modified

2024-06-06

Introduction

Outcomes

In this tutorial we will learn how to validate and clean General Transit Feed Specification (GTFS) feeds. This is an important step to ensure quality in the inputs and reduce the computational cost of routing operations.

While working towards this outcome, we will:

  • Download open source GTFS data.
  • Carry out some basic checks across the entire GTFS feed.
  • Visualise the GTFS feed’s stop locations on an interactive map.
  • Filter the GTFS feed to a specific bounding box.
  • Filter the GTFS feed to a specific date range.
  • Check if our filter operations have resulted in an empty feed.
  • Reverse-engineer a calendar.txt if it is missing.
  • Create summary tables of routes and trips in the feed.
  • Attempt to clean the feed.
  • Write the filtered feed out to file.

Requirements

To complete this tutorial, you will need:

  • python 3.9
  • Stable internet connection
  • Installed the transport_performance package (see the getting started explanation for help)
  • The following requirements:
requirements.txt
geopandas
plotly
shapely
. # ensure transport_performance is installed

Working With GTFS

Let’s import the necessary dependencies:

import datetime
import os
import pathlib
import subprocess
import tempfile

import geopandas as gpd
import plotly.express as px
from shapely.geometry import Polygon

from transport_performance.gtfs.multi_validation import MultiGtfsInstance

We require a source of public transit schedule data in GTFS format. The French government publish all of their data, along with may useful validation tools to the website transport.data.gouv.fr.

Searching through this site for various regions and data types, you may be able to find an example of GTFS for an area of interest. Make a note of the transport modality of your GTFS, is it bus, rail or something else?

You may wish to manually download at least one GTFS feed and store somewhere in your file system. Alternatively you may programmatically download the data, as in the solution here.

BUS_URL = "<INSERT_SOME_URL_TO_BUS_GTFS>"
RAIL_URL = "<INSERT_SOME_URL_TO_RAIL_GTFS>"

BUS_PTH = "<INSERT_SOME_PATH_FOR_BUS_GTFS>"
RAIL_PTH = "<INSERT_SOME_PATH_FOR_RAIL_GTFS>"

subprocess.run(["curl", BUS_URL, "-o", BUS_PTH])
subprocess.run(["curl", RAIL_URL, "-o", RAIL_PTH])
BUS_URL = "https://tsvc.pilote4.cityway.fr/api/Export/v1/GetExportedDataFile?ExportFormat=Gtfs&OperatorCode=RTM"
RAIL_URL = "https://eu.ftp.opendatasoft.com/sncf/gtfs/export-intercites-gtfs-last.zip"
# using tmp for tutorial but not necessary
tmp_path = tempfile.TemporaryDirectory()
bus_path = os.path.join(tmp_path.name, "rtm_gtfs.zip")
rail_path = os.path.join(tmp_path.name, "intercity_rail_gtfs.zip")
subprocess.run(["curl", BUS_URL, "-o", bus_path])
subprocess.run(["curl", RAIL_URL, "-o", rail_path])
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  1 4773k    1 92351    0     0  87974      0  0:00:55  0:00:01  0:00:54 87953 96 4773k   96 4620k    0     0  2215k      0  0:00:02  0:00:02 --:--:-- 2214k100 4773k  100 4773k    0     0  1716k      0  0:00:02  0:00:02 --:--:-- 1715k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 86185  100 86185    0     0   173k      0 --:--:-- --:--:-- --:--:--  173k
CompletedProcess(args=['curl', 'https://eu.ftp.opendatasoft.com/sncf/gtfs/export-intercites-gtfs-last.zip', '-o', '/tmp/tmpk4wzxccp/intercity_rail_gtfs.zip'], returncode=0)

Now that we have ingested the GTFS feed(s), you may wish to open the files up on your file system and inspect the contents. GTFS feeds come in compressed formats and contain multiple text files. These files can be read together, a bit like a relational database, to produce a feed object that is useful when undertaking routing with public transport modalities.

To do this, we will need to use a class from the transport_performance package called MultiGtfsInstance. Take a look at the MultiGtfsInstance API documentation for full details on how this class works. You may wish to keep this page open for reference in later tasks.

MultiGtfsInstance; as the name sounds; can cope with multiple GTFS feeds at a time. If you have chosen to download several feeds, then point the path parameter at a directory that contains all of the feeds. If you have chosen to download a single feed, then you may pass the full path to the feed.

Instantiate a feed object by pointing the MultiGtfsInstance class at a path to the GTFS feed(s) that you have downloaded. Once you have successfully instantiated feed, inspect the correct attribute in order to confirm the number of separate feeds instances contained within it.

gtfs_pth = "<INSERT_PATH_TO_GTFS>"
feed = MultiGtfsInstance(path=gtfs_pth)
print(len(feed.<INSERT_CORRECT_ATTRIBUTE>))
gtfs_pth = pathlib.Path(tmp_path.name) # need to use pathlib for tmp_path
feed = MultiGtfsInstance(path=gtfs_pth)
print(f"There are {len(feed.instances)} feed instances")
There are 2 feed instances

Each individual feed can be accessed separately. Their contents should confirm their file contents on disk. The GtfsInstance api documentation can be used to view the methods and attributes available for the following task.

By accessing the appropriate attribute, print out the first 5 stops of the first instance within the feed object.

feed.<INSERT_CORRECT_ATTR>[0].feed.stops.<INSERT_CORRECT_METHOD>(5)

These records will match the contents of the stops.txt file within the feed that you downloaded.


feed.instances[0].feed.stops.head(5)
stop_id stop_name stop_desc stop_lat stop_lon zone_id stop_url location_type parent_station
0 StopArea:OCE87393009 Versailles Chantiers NaN 48.795826 2.135883 NaN NaN 1 NaN
1 StopPoint:OCEOUIGO-87393009 Versailles Chantiers NaN 48.795826 2.135883 NaN NaN 0 StopArea:OCE87393009
2 StopArea:OCE87393579 Massy-Palaiseau NaN 48.726421 2.257528 NaN NaN 1 NaN
3 StopPoint:OCEOUIGO-87393579 Massy-Palaiseau NaN 48.726421 2.257528 NaN NaN 0 StopArea:OCE87393579
4 StopArea:OCE87394007 Chartres NaN 48.448202 1.481313 NaN NaN 1 NaN

Checking Validity

Transport routing operations require services that run upon a specified date. It is a useful sanity check to confirm that the dates that you expect to perform routing on exist within the GTFS feed. To do this, we can use the get_dates() method to print out the first and last date in the available date range, as below.

s0, e0 = feed.get_dates()
print(f"Feed starts at: {s0}\nFeed ends at: {e0}")
Feed starts at: 20240624
Feed ends at: 20240923

How can we have this method print out the full list of dates available within the feed?

Examine the MultiGtfsInstance docstring printed earlier and find the name of the parameter that controls the behaviour of get_dates().

feed.get_dates(return_range=False)
['20240624',
 '20240625',
 '20240626',
 '20240627',
 '20240628',
 '20240629',
 '20240630',
 '20240701',
 '20240702',
 '20240703',
 '20240704',
 '20240705',
 '20240706',
 '20240707',
 '20240708',
 '20240709',
 '20240710',
 '20240711',
 '20240712',
 '20240713',
 '20240714',
 '20240715',
 '20240716',
 '20240717',
 '20240718',
 '20240719',
 '20240720',
 '20240721',
 '20240722',
 '20240723',
 '20240724',
 '20240725',
 '20240726',
 '20240727',
 '20240728',
 '20240729',
 '20240730',
 '20240731',
 '20240801',
 '20240802',
 '20240803',
 '20240804',
 '20240805',
 '20240806',
 '20240807',
 '20240808',
 '20240809',
 '20240810',
 '20240811',
 '20240812',
 '20240813',
 '20240814',
 '20240815',
 '20240816',
 '20240817',
 '20240818',
 '20240819',
 '20240820',
 '20240821',
 '20240822',
 '20240823',
 '20240824',
 '20240825',
 '20240826',
 '20240827',
 '20240828',
 '20240829',
 '20240830',
 '20240831',
 '20240901',
 '20240902',
 '20240903',
 '20240904',
 '20240905',
 '20240906',
 '20240907',
 '20240908',
 '20240909',
 '20240910',
 '20240911',
 '20240912',
 '20240913',
 '20240914',
 '20240915',
 '20240916',
 '20240917',
 '20240918',
 '20240919',
 '20240920',
 '20240921',
 '20240922',
 '20240923']

Openly published GTFS feeds from a variety of different providers have varying degrees of quality and not all feeds strictly adhere to the defined specification for this type of data. When working with new sources of GTFS, it is advisable to investigate the types of errors or warnings associated with your particular feed(s).

Check if the feed you’ve instantiated contains valid GTFS.

Check the docstring we printed earlier for an appropriate method. Depending upon the source of the GTFS, you may need to ensure that checking of fast travel is switched off. To turn off fast travel checks, pass the following dictionary to the method’s appropriate argument: {"far_stops": False}. See the open issue #183.

feed.is_valid(validation_kwargs={"far_stops": False})
  0%|          | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip:  50%|█████     | 1/2 [00:00<00:00,  5.84it/s]Validating GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip:  50%|█████     | 1/2 [00:00<00:00,  5.84it/s]           Validating GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:04<00:00,  2.46s/it]Validating GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:04<00:00,  2.11s/it]
type message table rows GTFS
0 warning Unrecognized column feed_id feed_info [] /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip
1 warning Unrecognized column conv_rev feed_info [] /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip
2 warning Unrecognized column plan_rev feed_info [] /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip
3 warning Repeated pair (trip_id, departure_time) stop_times [236, 239, 241, 243, 245, 248, 251, 253, 254, ... /tmp/tmpk4wzxccp/rtm_gtfs.zip
4 warning Stop has no stop times stops [2134] /tmp/tmpk4wzxccp/rtm_gtfs.zip

Note that it is common to come across multiple warnings when working with GTFS. Many providers include additional columns that are not part of the GTFS. This typically poses no problem when using the feed for routing operations.

In certain feeds, you may notice errors flagged due to unrecognised route types. This is because certain providers publish feeds that conform to Google’s proposed GTFS extension. Although flagged as an error, valid codes associated with the proposed extension typically do not cause problems with routing operations.

Viz Stops

A sensible check when working with GTFS for an area of interest, is to visualise the stop locations of your feed.

By accessing an appropriate method for your feed, plot the stop locations on an interactive folium map.

Inspect the MultiGtfsInstance docstring for the appropriate method.

feed.viz_...()
feed.viz_stops()
Make this Notebook Trusted to load map: File -> Trust Notebook

By inspecting the location of the stops, you can visually assess that they concur with the road network depicted on the folium basemap.

Filtering GTFS

Cropping GTFS feeds can help optimise routing procedures. GTFS feeds can often be much larger than needed for smaller, more constrained routing operations. Holding an entire GTFS in memory may be unnecessary and burdensome. In this section, we will crop our feeds in two ways:

  • Spatially by restricting the feed to a specified bounding box.
  • Temporally by providing a date (or list of dates).

Before undertaking the filter operations, examine the size of our feed on disk:

out = subprocess.run(
    ["du", "-sh", tmp_path.name], capture_output=True, text=True)
size_out = out.stdout.strip().split("\t")[0]
print(f"Unfiltered feed is: {size_out}")
Unfiltered feed is: 4.8M

By Bounding Box

To help understand the requirements for spatially cropping a feed, inspect the API documentation for the filter_to_bbox() method.

To perform this crop, we need to get a bounding box. This could be:

  • The boundary of the urban centre calculated using the transport_performance.urban_centres module. Or
  • Any boundary from an open service such as klokantech in csv format.

The bounding box should be in EPSG:4326 projection (longitude & latitude).

Below I define a bounding box and visualise for context. Feel free to update the code with your own bounding box values.

BBOX = [4.932916,43.121441,5.644253,43.546931] # crop around Marseille
xmin, ymin, xmax, ymax = BBOX
poly = Polygon(((xmin,ymin), (xmin,ymax), (xmax,ymax), (xmax,ymin)))
poly_gdf = gpd.GeoDataFrame({"geometry": poly}, crs=4326, index=[0])
poly_gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

Crop your feed to the extent of your bounding box.

Pass the BBOX list in [xmin, ymin, xmax, ymax] order to the filter_to_bbox() method.

feed.filter_to_bbox(BBOX)
  0%|          | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]           Filtering GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00,  8.19it/s]Filtering GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00,  8.17it/s]

Notice that a progress bar confirms the number of successful filter operations performed depending on the number of separate GTFS zip files passed to MultiGtfsInstance.

Below I plot the filtered feed stop locations in relation to the bounding box used to restrict the feed’s extent.

imap = feed.viz_stops()
poly_gdf.explore(m=imap)
Make this Notebook Trusted to load map: File -> Trust Notebook

Although there should be fewer stops observed, you will likely observe that stops outside of the bounding box you provided remain in the filtered feed. This is to be expected, particularly where GTFS feeds contain long-haul schedules that intersect with the bounding box that you provided.

By Date

If the routing analysis you wish to perform takes place over a specific time window, we can further reduce the GTFS data volume by restricting to dates. To do this, we need to specify either a single datestring, or a list of datestrings. The format of the date should be “YYYYMMDD”.

today = datetime.datetime.today().strftime(format="%Y%m%d")
print(f"The date this document was updated at would be passed as: {today}")
The date this document was updated at would be passed as: 20240625

Filter your GTFS feed to a date or range of dates.

Pass either a single date string in “YYYYMMDD” format, or a list of datestrings in this format, to the filter_to_date method. Print out the new start and end dates of your feed by calling the get_dates() method once more.

feed.filter_to_date(today)
print(f"Filtered GTFS feed to {today}")
  0%|          | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]           Filtering GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00,  2.26it/s]Filtering GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00,  2.26it/s]
Filtered GTFS feed to 20240625
s1, e1 = feed.get_dates()
print(f"After filtering to {today}\nstart date: {s1}\nend date: {e1}")
After filtering to 20240625
start date: 20240624
end date: 20240823

Notice that even if we specify a single date to restrict the feed to, filter_to_date() may still return a range of dates. The filtering method restricts the GTFS to stops, trips or shapes active upon the specified date(s). If your GTFS contains trips/routes that are still active across a range of dates including the date you wish to restrict to, you will return the full range of that stop’s dates.

Check Empty Feeds

After performing the filter operations on GTFS, it is advisable to check in case any of the filtered feeds are now empty. Empty feeds can cause errors when undertaking routing analysis. Empty feeds can easily arise when filtering GTFS to mismatched dates or bounding boxes.

We check for empty feeds in the following way:

feed.validate_empty_feeds()
[]

Create Calendar

Occasionally, providers will publish feeds that use a calendar-dates.txt file rather than a calendar.txt. This is permitted within GTFS and is an alternative approach to encoding the same sort of information about the feed timetable.

However, missing calendar.txt files currently cause exceptions when attempting to route with these feeds with r5py. To avoid this issue, we can use a calendar-dates.txt to populate the required calendar.txt file.

We can check whether any of our feed instances have no calendar file:

for i, inst in enumerate(feed.instances):
    if inst.feed.calendar is None:
        problem_ind = i
print(f"Feed instance {i} has no calendar.txt")
Feed instance 1 has no calendar.txt

If any of your feeds are missing calendars, ensure that these files are created from the calendar-dates files. Once complete, print out the head of the calendar table to ensure it is populated.

Examine the MultiGtfsInstance docstring to find the appropriate method. Access the calendar DataFrame attribute from the same feed and print the first few rows.

feed.<INSERT_CORRECT_METHOD>()
print(feed.instances[<INDEX_OF_MISSING_CALENDAR>].feed.calendar.head())
feed.ensure_populated_calendars()
/home/runner/work/transport-network-performance/transport-network-performance/src/transport_performance/gtfs/validation.py:297: UserWarning:

No calendar found for /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip. Creating from calendar dates
print("Newly populated calendar table:")
print(feed.instances[problem_ind].feed.calendar.head())
Newly populated calendar table:
  service_id  monday  tuesday  wednesday  thursday  friday  saturday  sunday  \
0     000005       0        1          0         0       0         0       0   
1     000026       0        1          0         0       0         0       0   
2     000042       0        1          0         0       0         0       0   
3     000063       0        1          0         0       0         0       0   
4     000106       0        1          0         0       0         0       0   

  start_date  end_date  
0   20240625  20240625  
1   20240625  20240625  
2   20240625  20240625  
3   20240625  20240625  
4   20240625  20240625  

Trip and Route Summaries

Now that we have ensured all GTFS instances have a calendar table, we can calculate tables of counts and other summary statistics on the routes and trips within the feed.

Print 2 summary tables:

  1. Counts for routes on every date in the feed.
  2. Statistics for trips on every day in the feed.

Examine the docstring help for MultiGtfsInstance. Use the appropriate methods to produce the summaries.

  1. Use the default behaviour to produce the first table.
  2. Ensure that the appropriate method allowing stats for days of the week is toggled to True for the trips summary.
feed.summarise_routes()
date route_type route_count
0 2024-06-24 0 3
1 2024-06-24 1 2
2 2024-06-24 3 116
3 2024-06-24 4 4
7 2024-06-25 3 117
... ... ... ...
155 2024-08-21 4 1
156 2024-08-22 3 24
157 2024-08-22 4 1
158 2024-08-23 3 21
159 2024-08-23 4 1

160 rows × 3 columns

feed.summarise_trips(to_days=True)
day route_type trip_count_max trip_count_mean trip_count_median trip_count_min
0 monday 0 986 986.0 986.0 986
1 monday 1 816 816.0 816.0 816
2 monday 3 10507 3192.0 2148.0 1665
3 monday 4 231 164.0 156.0 156
4 tuesday 3 10512 3193.0 2148.0 1665
5 tuesday 0 986 986.0 986.0 986
6 tuesday 4 231 164.0 156.0 156
7 tuesday 1 816 816.0 816.0 816
8 tuesday 2 12 12.0 12.0 12
9 wednesday 3 10500 3191.0 2148.0 1665
10 wednesday 4 231 164.0 156.0 156
11 wednesday 0 986 986.0 986.0 986
12 wednesday 1 816 816.0 816.0 816
13 thursday 4 231 164.0 156.0 156
14 thursday 3 10500 3191.0 2148.0 1665
15 thursday 1 816 816.0 816.0 816
16 thursday 0 986 986.0 986.0 986
17 friday 0 986 986.0 986.0 986
18 friday 4 231 164.0 156.0 156
19 friday 3 10500 3176.0 2148.0 1529
20 friday 1 816 816.0 816.0 816
21 saturday 4 156 156.0 156.0 156
22 saturday 3 3654 2278.0 1906.0 1665
23 saturday 1 816 816.0 816.0 816
24 saturday 0 986 986.0 986.0 986
25 sunday 1 816 816.0 816.0 816
26 sunday 0 986 986.0 986.0 986
27 sunday 4 156 156.0 156.0 156
28 sunday 3 3654 2278.0 1906.0 1665

From these summaries we can also create visualisations, such as a timeseries plot of trip counts by route type and date:

# sort by route_type and date to order plot correctly
df = feed.summarise_trips().sort_values(["route_type", "date"])
fig = px.line(
    df,
    x="date",
    y="trip_count",
    color="route_type",
    title="Trip Counts by Route Type and Date Across All Input GTFS Feeds",
)

# set y axis min to zero, improve y axis formatting, and overall font style
fig.update_yaxes(rangemode="tozero", tickformat=",.0f")
fig.update_layout(
    font_family="Arial",
    title_font_family="Arial",
)
fig.show()

Visualisations like this can be very helpful when reviewing the quality of the input GTFS feeds and determining a suitable routing analysis date.

Clean Feed

We can attempt to remove common issues with GTFS feeds by running the clean_feeds() method. This may remove problems associated with trips, routes or where the specification is violated.

feed.clean_feeds()
feed.is_valid(validation_kwargs={"far_stops": False})
  0%|          | 0/2 [00:00<?, ?it/s]Cleaning GTFS from path /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]Cleaning GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]           Cleaning GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00,  2.48it/s]Cleaning GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00,  2.48it/s]
  0%|          | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip:   0%|          | 0/2 [00:00<?, ?it/s]           Validating GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:01<00:00,  1.40it/s]Validating GTFS from path /tmp/tmpk4wzxccp/rtm_gtfs.zip: 100%|██████████| 2/2 [00:01<00:00,  1.39it/s]
KeyError. Feed was not cleaned.
type message table rows GTFS
0 warning Unrecognized column feed_id feed_info [] /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip
1 warning Unrecognized column conv_rev feed_info [] /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip
2 warning Unrecognized column plan_rev feed_info [] /tmp/tmpk4wzxccp/intercity_rail_gtfs.zip
3 warning Repeated pair (trip_id, departure_time) stop_times [236, 239, 241, 243, 245, 248, 251, 253, 254, ... /tmp/tmpk4wzxccp/rtm_gtfs.zip

Write Filtered Feed

Once we have finished the filter and cleaning operations, we can now go ahead and write the feed out to disk, for use in future routing operations.

Write your filtered feed out to a new location on disk. Confirm that the size of the filtered feed on disk is smaller than that of the original feed.

  1. Pass a string or a pathlike object to the save_feeds() method of MultiGtfsInstance.
  2. Once the feed is written successfully, check the disk usage of the new filtered feed.
filtered_pth = os.path.join(tmp_path.name, "filtered_feed")
try:
    os.mkdir(filtered_pth)
except FileExistsError:
    pass
feed.save_feeds(filtered_pth, overwrite=True)
  0%|          | 0/2 [00:00<?, ?it/s]Saving at /tmp/tmpk4wzxccp/filtered_feed/intercity_rail_gtfs_new.zip:   0%|          | 0/2 [00:00<?, ?it/s]Saving at /tmp/tmpk4wzxccp/filtered_feed/rtm_gtfs_new.zip:   0%|          | 0/2 [00:00<?, ?it/s]           Saving at /tmp/tmpk4wzxccp/filtered_feed/rtm_gtfs_new.zip: 100%|██████████| 2/2 [00:01<00:00,  1.85it/s]Saving at /tmp/tmpk4wzxccp/filtered_feed/rtm_gtfs_new.zip: 100%|██████████| 2/2 [00:01<00:00,  1.85it/s]

Check filtered file size

out = subprocess.run(
    ["du", "-sh", filtered_pth], capture_output=True, text=True)
filtered_size = out.stdout.strip().split("\t")[0]
print(f"After filtering, feed size reduced from {size_out} to {filtered_size}")
After filtering, feed size reduced from 4.8M to 1.6M

Conclusion

Congratulations, you have successfully completed this tutorial. We have successfully examined the features, errors and warnings within a GTFS feed. We have also filtered the feed by bounding box and by date in order to reduce its size on disk.

To continue learning how to work with the transport_performance package, it is suggested that you continue with the OpenStreetMap tutorials

For any problems encountered with this tutorial or the transport_performance package, please open an issue on our GitHub repository.