import datetime
import os
import pathlib
import subprocess
import tempfile
import geopandas as gpd
import plotly.express as px
from shapely.geometry import Polygon
from assess_gtfs.multi_validation import MultiGtfsInstance
GTFS
assess-gtfs
package.
Introduction
Outcomes
In this tutorial we will learn how to validate and clean General Transit Feed Specification (GTFS) feeds. This is an important step to ensure quality in the inputs and reduce the computational cost of routing operations.
While working towards this outcome, we will:
- Download open source GTFS data.
- Carry out some basic checks across the entire GTFS feed.
- Visualise the GTFS feed’s stop locations on an interactive map.
- Filter the GTFS feed to a specific bounding box.
- Filter the GTFS feed to a specific date range.
- Check if our filter operations have resulted in an empty feed.
- Reverse-engineer a calendar.txt if it is missing.
- Create summary tables of routes and trips in the feed.
- Attempt to clean the feed.
- Write the filtered feed out to file.
Requirements
To complete this tutorial, you will need:
- python 3.9
- Stable internet connection
- Installed the
assess-gtfs
package (see the getting started explanation for help)
Working With GTFS
Let’s import the necessary dependencies:
We require a source of public transit schedule data in GTFS format. The French government publish all of their data, along with may useful validation tools to the website transport.data.gouv.fr.
Searching through this site for various regions and data types, you may be able to find an example of GTFS for an area of interest. Make a note of the transport modality of your GTFS, is it bus, rail or something else?
You may wish to manually download at least one GTFS feed and store somewhere in your file system. Alternatively you may programmatically download the data, as in the solution here.
= "<INSERT_SOME_URL_TO_BUS_GTFS>"
BUS_URL = "<INSERT_SOME_URL_TO_RAIL_GTFS>"
RAIL_URL
= "<INSERT_SOME_PATH_FOR_BUS_GTFS>"
BUS_PTH = "<INSERT_SOME_PATH_FOR_RAIL_GTFS>"
RAIL_PTH
"curl", BUS_URL, "-o", BUS_PTH])
subprocess.run(["curl", RAIL_URL, "-o", RAIL_PTH]) subprocess.run([
= "https://tsvc.pilote4.cityway.fr/api/Export/v1/GetExportedDataFile?ExportFormat=Gtfs&OperatorCode=RTM"
BUS_URL = "https://eu.ftp.opendatasoft.com/sncf/gtfs/export-intercites-gtfs-last.zip"
RAIL_URL # using tmp for tutorial but not necessary
= tempfile.TemporaryDirectory()
tmp_path = os.path.join(tmp_path.name, "rtm_gtfs.zip")
bus_path = os.path.join(tmp_path.name, "intercity_rail_gtfs.zip")
rail_path "curl", BUS_URL, "-o", bus_path])
subprocess.run(["curl", RAIL_URL, "-o", rail_path]) subprocess.run([
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 68 3931k 68 2685k 0 0 1681k 0 0:00:02 0:00:01 0:00:01 1681k100 3931k 100 3931k 0 0 2273k 0 0:00:01 0:00:01 --:--:-- 2273k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 91397 100 91397 0 0 195k 0 --:--:-- --:--:-- --:--:-- 196k
CompletedProcess(args=['curl', 'https://eu.ftp.opendatasoft.com/sncf/gtfs/export-intercites-gtfs-last.zip', '-o', '/tmp/tmpxzeeqykk/intercity_rail_gtfs.zip'], returncode=0)
Now that we have ingested the GTFS feed(s), you may wish to open the files up on your file system and inspect the contents. GTFS feeds come in compressed formats and contain multiple text files. These files can be read together, a bit like a relational database, to produce a feed object that is useful when undertaking routing with public transport modalities.
To do this, we will need to use a class from the assess-gtfs
package called MultiGtfsInstance
. Take a look at the MultiGtfsInstance
API documentation for full details on how this class works. You may wish to keep this page open for reference in later tasks.
MultiGtfsInstance
; as the name sounds; can cope with multiple GTFS feeds at a time. If you have chosen to download several feeds, then point the path
parameter at a directory that contains all of the feeds. If you have chosen to download a single feed, then you may pass the full path to the feed.
Instantiate a feed
object by pointing the MultiGtfsInstance
class at a path to the GTFS feed(s) that you have downloaded. Once you have successfully instantiated feed
, inspect the correct attribute in order to confirm the number of separate feeds instances contained within it.
= "<INSERT_PATH_TO_GTFS>"
gtfs_pth = MultiGtfsInstance(path=gtfs_pth)
feed print(len(feed.<INSERT_CORRECT_ATTRIBUTE>))
= pathlib.Path(tmp_path.name) # need to use pathlib for tmp_path
gtfs_pth = MultiGtfsInstance(path=gtfs_pth)
feed print(f"There are {len(feed.instances)} feed instances")
There are 2 feed instances
Each individual feed can be accessed separately. Their contents should confirm their file contents on disk. The GtfsInstance
api documentation can be used to view the methods and attributes available for the following task.
By accessing the appropriate attribute, print out the first 5 stops of the first instance within the feed
object.
<INSERT_CORRECT_ATTR>[0].feed.stops.<INSERT_CORRECT_METHOD>(5) feed.
These records will match the contents of the stops.txt file within the feed that you downloaded.
0].feed.stops.head(5) feed.instances[
stop_id | stop_name | stop_desc | stop_lat | stop_lon | zone_id | stop_url | location_type | parent_station | |
---|---|---|---|---|---|---|---|---|---|
0 | StopArea:OCE87393009 | Versailles Chantiers | NaN | 48.795826 | 2.135883 | NaN | NaN | 1 | NaN |
1 | StopPoint:OCEOUIGO-87393009 | Versailles Chantiers | NaN | 48.795826 | 2.135883 | NaN | NaN | 0 | StopArea:OCE87393009 |
2 | StopArea:OCE87393579 | Massy-Palaiseau | NaN | 48.726421 | 2.257528 | NaN | NaN | 1 | NaN |
3 | StopPoint:OCEOUIGO-87393579 | Massy-Palaiseau | NaN | 48.726421 | 2.257528 | NaN | NaN | 0 | StopArea:OCE87393579 |
4 | StopArea:OCE87394007 | Chartres | NaN | 48.448202 | 1.481313 | NaN | NaN | 1 | NaN |
Checking Validity
Transport routing operations require services that run upon a specified date. It is a useful sanity check to confirm that the dates that you expect to perform routing on exist within the GTFS feed. To do this, we can use the get_dates()
method to print out the first and last date in the available date range, as below.
= feed.get_dates()
s0, e0 print(f"Feed starts at: {s0}\nFeed ends at: {e0}")
Feed starts at: 20240703
Feed ends at: 20241002
How can we have this method print out the full list of dates available within the feed?
Examine the MultiGtfsInstance
api reference and find the name of the parameter that controls the behaviour of get_dates()
.
=False) feed.get_dates(return_range
['20240703',
'20240704',
'20240705',
'20240706',
'20240707',
'20240708',
'20240709',
'20240710',
'20240711',
'20240712',
'20240713',
'20240714',
'20240715',
'20240716',
'20240717',
'20240718',
'20240719',
'20240720',
'20240721',
'20240722',
'20240723',
'20240724',
'20240725',
'20240726',
'20240727',
'20240728',
'20240729',
'20240730',
'20240731',
'20240801',
'20240802',
'20240803',
'20240804',
'20240805',
'20240806',
'20240807',
'20240808',
'20240809',
'20240810',
'20240811',
'20240812',
'20240813',
'20240814',
'20240815',
'20240816',
'20240817',
'20240818',
'20240819',
'20240820',
'20240821',
'20240822',
'20240823',
'20240824',
'20240825',
'20240826',
'20240827',
'20240828',
'20240829',
'20240830',
'20240831',
'20240901',
'20240902',
'20240903',
'20240904',
'20240905',
'20240906',
'20240907',
'20240908',
'20240909',
'20240910',
'20240911',
'20240912',
'20240913',
'20240914',
'20240915',
'20240916',
'20240917',
'20240918',
'20240919',
'20240920',
'20240921',
'20240922',
'20240923',
'20240924',
'20240925',
'20240926',
'20240927',
'20240928',
'20240929',
'20240930',
'20241001',
'20241002']
Openly published GTFS feeds from a variety of different providers have varying degrees of quality and not all feeds strictly adhere to the defined specification for this type of data. When working with new sources of GTFS, it is advisable to investigate the types of errors or warnings associated with your particular feed(s).
Check if the feed you’ve instantiated contains valid GTFS.
Check the api reference for validation.GtfsInstance
for an appropriate method.
feed.is_valid()
0%| | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip: 50%|█████ | 1/2 [00:00<00:00, 4.80it/s]Validating GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 50%|█████ | 1/2 [00:00<00:00, 4.80it/s] Validating GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:02<00:00, 1.70s/it]Validating GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:02<00:00, 1.48s/it]
type | message | table | rows | GTFS | |
---|---|---|---|---|---|
0 | warning | Unrecognized column feed_id | feed_info | [] | /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip |
1 | warning | Unrecognized column conv_rev | feed_info | [] | /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip |
2 | warning | Unrecognized column plan_rev | feed_info | [] | /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip |
3 | warning | Repeated pair (route_short_name, route_long_name) | routes | [16, 17, 18, 19, 20] | /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip |
4 | warning | Repeated pair (trip_id, departure_time) | stop_times | [1352, 1361, 1363, 1367, 1372, 1375, 1377, 138... | /tmp/tmpxzeeqykk/rtm_gtfs.zip |
5 | warning | Stop has no stop times | stops | [146, 1498, 1499, 2130] | /tmp/tmpxzeeqykk/rtm_gtfs.zip |
Note that it is common to come across multiple warnings when working with GTFS. Many providers include additional columns that are not part of the GTFS. This typically poses no problem when using the feed for routing operations.
In certain feeds, you may notice errors flagged due to unrecognised route types. This is because certain providers publish feeds that conform to Google’s proposed GTFS extension. Although flagged as an error, valid codes associated with the proposed extension typically do not cause problems with routing operations.
Viz Stops
A sensible check when working with GTFS for an area of interest, is to visualise the stop locations of your feed.
By accessing an appropriate method for your feed, plot the stop locations on an interactive folium map.
Inspect the MultiGtfsInstance
api reference for the appropriate method.
feed.viz_...()
feed.viz_stops()
By inspecting the location of the stops, you can visually assess that they concur with the road network depicted on the folium basemap.
Filtering GTFS
Cropping GTFS feeds can help optimise routing procedures. GTFS feeds can often be much larger than needed for smaller, more constrained routing operations. Holding an entire GTFS in memory may be unnecessary and burdensome. In this section, we will crop our feeds in two ways:
- Spatially by restricting the feed to a specified bounding box.
- Temporally by providing a date (or list of dates).
Before undertaking the filter operations, examine the size of our feed on disk:
= subprocess.run(
out "du", "-sh", tmp_path.name], capture_output=True, text=True)
[= out.stdout.strip().split("\t")[0]
size_out print(f"Unfiltered feed is: {size_out}")
Unfiltered feed is: 4.0M
By Bounding Box
To help understand the requirements for spatially cropping a feed, inspect the API documentation for the filter_to_bbox()
method.
To perform this crop, we need to get a bounding box. This could be any boundary from an open service such as klokantech in csv format.
The bounding box should be in EPSG:4326 projection (longitude & latitude).
Below I define a bounding box and visualise for context. Feel free to update the code with your own bounding box values.
= [4.932916,43.121441,5.644253,43.546931] # crop around Marseille
BBOX = BBOX
xmin, ymin, xmax, ymax = Polygon(((xmin,ymin), (xmin,ymax), (xmax,ymax), (xmax,ymin)))
poly = gpd.GeoDataFrame({"geometry": poly}, crs=4326, index=[0])
poly_gdf poly_gdf.explore()
Crop your feed to the extent of your bounding box.
Pass the BBOX
list in [xmin, ymin, xmax, ymax] order to the filter_to_bbox()
method.
feed.filter_to_bbox(BBOX)
0%| | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s] Filtering GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00, 5.81it/s]Filtering GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00, 5.80it/s]
Notice that a progress bar confirms the number of successful filter operations performed depending on the number of separate GTFS zip files passed to MultiGtfsInstance
.
Below I plot the filtered feed stop locations in relation to the bounding box used to restrict the feed’s extent.
= feed.viz_stops()
imap =imap) poly_gdf.explore(m
Although there should be fewer stops observed, you will likely observe that stops outside of the bounding box you provided remain in the filtered feed. This is to be expected, particularly where GTFS feeds contain long-haul schedules that intersect with the bounding box that you provided.
By Date
If the routing analysis you wish to perform takes place over a specific time window, we can further reduce the GTFS data volume by restricting to dates. To do this, we need to specify either a single date string, or a list of date strings. The format of the date should be “YYYYMMDD”.
= datetime.datetime.today().strftime(format="%Y%m%d")
today print(f"The date this document was updated at would be passed as: {today}")
The date this document was updated at would be passed as: 20240704
Filter your GTFS feed to a date or range of dates.
Pass either a single date string in “YYYYMMDD” format, or a list of date strings in this format, to the filter_to_date
method. Print out the new start and end dates of your feed by calling the get_dates()
method once more.
feed.filter_to_date(today)print(f"Filtered GTFS feed to {today}")
0%| | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s]Filtering GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s] Filtering GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00, 3.28it/s]Filtering GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00, 3.28it/s]
Filtered GTFS feed to 20240704
= feed.get_dates()
s1, e1 print(f"After filtering to {today}\nstart date: {s1}\nend date: {e1}")
After filtering to 20240704
start date: 20240703
end date: 20240901
Notice that even if we specify a single date to restrict the feed to, filter_to_date()
may still return a range of dates. The filtering method restricts the GTFS to stops, trips or shapes active upon the specified date(s). If your GTFS contains trips/routes that are still active across a range of dates including the date you wish to restrict to, you will return the full range of that stop’s dates.
Check Empty Feeds
After performing the filter operations on GTFS, it is advisable to check in case any of the filtered feeds are now empty. Empty feeds can cause errors when undertaking routing analysis. Empty feeds can easily arise when filtering GTFS to mismatched dates or bounding boxes.
We check for empty feeds in the following way:
feed.validate_empty_feeds()
[]
Create Calendar
Occasionally, providers will publish feeds that use a calendar-dates.txt file rather than a calendar.txt. This is permitted within GTFS and is an alternative approach to encoding the same sort of information about the feed timetable.
However, missing calendar.txt files currently cause exceptions when attempting to route with these feeds with r5py. To avoid this issue, we can use a calendar-dates.txt to populate the required calendar.txt file.
We can check whether any of our feed instances have no calendar file:
for i, inst in enumerate(feed.instances):
if inst.feed.calendar is None:
= i
problem_ind print(f"Feed instance {i} has no calendar.txt")
Feed instance 1 has no calendar.txt
If any of your feeds are missing calendars, ensure that these files are created from the calendar-dates files. Once complete, print out the head of the calendar table to ensure it is populated.
Examine the MultiGtfsInstance
api reference to find the appropriate method. Access the calendar DataFrame attribute from the same feed and print the first few rows.
<INSERT_CORRECT_METHOD>()
feed.print(feed.instances[<INDEX_OF_MISSING_CALENDAR>].feed.calendar.head())
feed.ensure_populated_calendars()
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/assess_gtfs/validation.py:294: UserWarning:
No calendar found for /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip. Creating from calendar dates
print("Newly populated calendar table:")
print(feed.instances[problem_ind].feed.calendar.head())
Newly populated calendar table:
service_id monday tuesday wednesday thursday friday saturday sunday \
0 000027 0 0 0 1 0 0 0
1 000243 0 0 0 1 0 0 0
2 000247 0 0 0 1 0 0 0
3 000267 0 0 0 1 0 0 0
4 000309 0 0 0 1 0 0 0
start_date end_date
0 20240704 20240704
1 20240704 20240704
2 20240704 20240704
3 20240704 20240704
4 20240704 20240704
Trip and Route Summaries
Now that we have ensured all GTFS instances have a calendar table, we can calculate tables of counts and other summary statistics on the routes and trips within the feed.
Print 2 summary tables:
- Counts for routes on every date in the feed.
- Statistics for trips on every day in the feed.
Examine the api reference help for MultiGtfsInstance
. Use the appropriate methods to produce the summaries.
- Use the default behaviour to produce the first table.
- Ensure that the appropriate method allowing stats for days of the week is toggled to
True
for the trip summary.
feed.summarise_routes()
date | route_type | route_count | |
---|---|---|---|
0 | 2024-07-03 | 0 | 3 |
1 | 2024-07-03 | 1 | 2 |
2 | 2024-07-03 | 3 | 104 |
3 | 2024-07-03 | 4 | 3 |
4 | 2024-07-04 | 0 | 3 |
... | ... | ... | ... |
136 | 2024-08-30 | 3 | 84 |
137 | 2024-08-30 | 4 | 4 |
138 | 2024-08-31 | 3 | 20 |
139 | 2024-08-31 | 4 | 4 |
140 | 2024-09-01 | 4 | 4 |
141 rows × 3 columns
=True) feed.summarise_trips(to_days
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/assess_gtfs/multi_validation.py:456: FutureWarning:
The provided callable <function min at 0x7ff92c0d28e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/assess_gtfs/multi_validation.py:456: FutureWarning:
The provided callable <function max at 0x7ff92c0d27a0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/assess_gtfs/multi_validation.py:456: FutureWarning:
The provided callable <function mean at 0x7ff92c0d31a0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/assess_gtfs/multi_validation.py:456: FutureWarning:
The provided callable <function median at 0x7ff91cf91800> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
day | route_type | trip_count_max | trip_count_mean | trip_count_median | trip_count_min | |
---|---|---|---|---|---|---|
0 | monday | 4 | 165 | 165.0 | 165.0 | 165 |
1 | monday | 3 | 8792 | 7700.0 | 7544.0 | 7544 |
2 | monday | 1 | 816 | 816.0 | 816.0 | 816 |
3 | monday | 0 | 986 | 986.0 | 986.0 | 986 |
4 | tuesday | 4 | 165 | 165.0 | 165.0 | 165 |
5 | tuesday | 0 | 986 | 986.0 | 986.0 | 986 |
6 | tuesday | 3 | 8628 | 7680.0 | 7544.0 | 7544 |
7 | tuesday | 1 | 816 | 816.0 | 816.0 | 816 |
8 | wednesday | 4 | 165 | 156.0 | 165.0 | 87 |
9 | wednesday | 0 | 986 | 986.0 | 986.0 | 986 |
10 | wednesday | 1 | 816 | 816.0 | 816.0 | 816 |
11 | wednesday | 3 | 9267 | 7856.0 | 7544.0 | 7544 |
12 | thursday | 3 | 9557 | 7888.0 | 7544.0 | 7544 |
13 | thursday | 2 | 12 | 12.0 | 12.0 | 12 |
14 | thursday | 1 | 816 | 816.0 | 816.0 | 816 |
15 | thursday | 4 | 165 | 165.0 | 165.0 | 165 |
16 | thursday | 0 | 986 | 986.0 | 986.0 | 986 |
17 | friday | 1 | 816 | 816.0 | 816.0 | 816 |
18 | friday | 3 | 8826 | 7792.0 | 7544.0 | 7408 |
19 | friday | 4 | 165 | 165.0 | 165.0 | 165 |
20 | friday | 0 | 986 | 986.0 | 986.0 | 986 |
21 | saturday | 3 | 8792 | 7023.0 | 7544.0 | 1606 |
22 | saturday | 1 | 816 | 816.0 | 816.0 | 816 |
23 | saturday | 0 | 986 | 986.0 | 986.0 | 986 |
24 | saturday | 4 | 165 | 165.0 | 165.0 | 165 |
25 | sunday | 3 | 8792 | 7700.0 | 7544.0 | 7544 |
26 | sunday | 1 | 816 | 816.0 | 816.0 | 816 |
27 | sunday | 0 | 986 | 986.0 | 986.0 | 986 |
28 | sunday | 4 | 165 | 165.0 | 165.0 | 165 |
From these summaries we can also create visualisations, such as a timeseries plot of trip counts by route type and date:
# sort by route_type and date to order plot correctly
= feed.summarise_trips().sort_values(["route_type", "date"])
df = px.line(
fig
df,="date",
x="trip_count",
y="route_type",
color="Trip Counts by Route Type and Date Across All Input GTFS Feeds",
title
)
# set y axis min to zero, improve y axis formatting, and overall font style
="tozero", tickformat=",.0f")
fig.update_yaxes(rangemode
fig.update_layout(="Arial",
font_family="Arial",
title_font_family
) fig.show()
Visualisations like this can be very helpful when reviewing the quality of the input GTFS feeds and determining a suitable routing analysis date.
Clean Feed
We can attempt to remove common issues with GTFS feeds by running the clean_feeds()
method. This may remove problems associated with trips, routes or where the specification is violated.
feed.clean_feeds() feed.is_valid()
0%| | 0/2 [00:00<?, ?it/s]Cleaning GTFS from path /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s]/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/gtfs_kit/cleaners.py:80: FutureWarning:
DataFrame.applymap has been deprecated. Use DataFrame.map instead.
Cleaning GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s] /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/gtfs_kit/cleaners.py:80: FutureWarning:
DataFrame.applymap has been deprecated. Use DataFrame.map instead.
Cleaning GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00, 3.92it/s]Cleaning GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:00<00:00, 3.92it/s]
KeyError. Feed was not cleaned.
0%| | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s]Validating GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 0%| | 0/2 [00:00<?, ?it/s] Validating GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:01<00:00, 1.85it/s]Validating GTFS from path /tmp/tmpxzeeqykk/rtm_gtfs.zip: 100%|██████████| 2/2 [00:01<00:00, 1.85it/s]
type | message | table | rows | GTFS | |
---|---|---|---|---|---|
0 | warning | Unrecognized column feed_id | feed_info | [] | /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip |
1 | warning | Unrecognized column conv_rev | feed_info | [] | /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip |
2 | warning | Unrecognized column plan_rev | feed_info | [] | /tmp/tmpxzeeqykk/intercity_rail_gtfs.zip |
3 | warning | Repeated pair (trip_id, departure_time) | stop_times | [1352, 1361, 1363, 1367, 1372, 1375, 1377, 138... | /tmp/tmpxzeeqykk/rtm_gtfs.zip |
You may note warnings printed to the console and a statement about whether a feed was successfully cleaned.
Write Filtered Feed
Once we have finished the filter and cleaning operations, we can now go ahead and write the feed out to disk, for use in future routing operations.
Write your filtered feed out to a new location on disk. Confirm that the size of the filtered feed on disk is smaller than that of the original feed.
- Pass a string or a pathlike object to the
save_feeds()
method ofMultiGtfsInstance
. - Once the feed is written successfully, check the disk usage of the new filtered feed.
= os.path.join(tmp_path.name, "filtered_feed")
filtered_pth try:
os.mkdir(filtered_pth)except FileExistsError:
pass
=True) feed.save_feeds(filtered_pth, overwrite
0%| | 0/2 [00:00<?, ?it/s]Saving at /tmp/tmpxzeeqykk/filtered_feed/intercity_rail_gtfs_new.zip: 0%| | 0/2 [00:00<?, ?it/s]Saving at /tmp/tmpxzeeqykk/filtered_feed/rtm_gtfs_new.zip: 0%| | 0/2 [00:00<?, ?it/s] Saving at /tmp/tmpxzeeqykk/filtered_feed/rtm_gtfs_new.zip: 100%|██████████| 2/2 [00:00<00:00, 2.24it/s]Saving at /tmp/tmpxzeeqykk/filtered_feed/rtm_gtfs_new.zip: 100%|██████████| 2/2 [00:00<00:00, 2.23it/s]
Check filtered file size
= subprocess.run(
out "du", "-sh", filtered_pth], capture_output=True, text=True)
[= out.stdout.strip().split("\t")[0]
filtered_size print(f"After filtering, feed size reduced from {size_out} to {filtered_size}")
After filtering, feed size reduced from 4.0M to 1.5M
Conclusion
Congratulations, you have successfully completed this tutorial. We have successfully examined the features, errors and warnings within a GTFS feed. We have also filtered the feed by bounding box and by date in order to reduce its size on disk.
For any problems encountered with this tutorial or the assess-gtfs
package, please open an issue on our GitHub repository.