Analyzing Millions of Taxi Trips in the City of Chicago

The City of Chicago’s open data portal provides a large amount of human mobility data, including taxi trips, TNP rideshare trips, Divvy bikeshare trips, and E-scooter trips (data in 2019, data in 2020, data since 2022). There is a brief summary (see the following table) of annual trips of these travel modes in the City of Chicago, depending on the data availability. More detailed statistical analysis (e.g., daily and monthly aggregates) of taxi and ridehailing usage in Chicago can be found in this post, including total trips/farebox and average trip duration/distance/fare/speed.

Year	Taxi trips	Rideshare trips	Divvy trips	E-scooter trips
2013	21.6M		760K
2014	32.0M		2.45M
2015	27.3M		3.18M
2016	26.7M		3.60M
2017	22.0M		3.83M
2018	18.7M		3.60M
2019	14.6M	96.9M	3.82M	711K
2020	3.46M	42.1M	3.54M	631K
2021	3.37M	42.2M	5.60M
2022	5.47M	57.3M	5.67M	1.49M
2023	5.79M	68.7M	-M	2.31M

Taxi data is available since January 2013, TNP rideshare data since November 2018 (see the download pages of trips (2018 - 2022) and trips (2023 -)). As of 2019 there are three licensed TNPs in Chicago: Uber, Lyft, and Via. The TNP dataset does not identify which company provided each trip.

Notably, despite the taxi and ridehailing usage in Chicago, one can also check out the taxi and ridehailing usage in New York City. Below is a summary of NYC taxi (e.g., yellow taxi) and rideshare (e.g., high volume for-hire vehicle (HVFHV)) trips over the past few years, depending on the data availability at the TLC trip record data.

Year	Taxi trips	Rideshare trips
2011	176.9M
2012	171.4M
2013	171.8M
2014	165.4M
2015	146.0M
2016	131.1M
2017	113.5M
2018	102.9M
2019	84.6M	234.6M
2020	24.6M	143.3M
2021	30.9M	174.6M
2022	39.7M	212.4M
2023	38.3M	232.5M

Visualizing Boundaries of Community Areas in Chicago

The data can be viewed on the Chicago Data Portal with a web browser, see 77 community areas in Chicago. To use these data, one can export and download the data in the Shapefile format. In this post, we rename four files of the Shapefile data as follows,

areas.dbf
areas.prj
areas.shp
areas.shx

and place these files at the folder Boundaries_Community_Areas.

Then it is not hard to use the geopandas and matplotlib packages in Python to visualize the boundaries of community areas.

import geopandas as gpd
import matplotlib.pyplot as plt

fig = plt.figure(figsize = (14, 8))
ax = fig.subplots(1)
shape = gpd.read_file("Boundaries _Community_Areas/areas.shp")
shape.plot(cmap = 'RdYlBu_r', ax = ax)
plt.xticks([])
plt.yticks([])
for _, spine in ax.spines.items():
    spine.set_visible(False)
plt.show()
fig.savefig("boundaries_community_areas_chicago.png", bbox_inches = "tight")

Figure 1 shows the boundaries of 77 community areas in the City of Chicago. Note that we can set the cmap as RdYlBu_r or YlOrRd_r.

Figure 1. Boundaries of community areas in the City of Chicago, USA.

Matching Taxi Trips with Community Areas

There are three basic steps to follow for processing taxi trip data:

Download taxi trips in 2022 in the .csv format, e.g., Taxi_Trips_-_2022.csv.
Use the pandas package in Python to process the raw trip data.
Match trip pickup/dropoff locations with boundaries of the community area.

import pandas as pd

data = pd.read_csv('Taxi_Trips_-_2022.csv')
data.head()

For each taxi trip, one can select some important information:

Trip Start Timestamp: When the trip started, rounded to the nearest 15 minutes.
Trip Seconds: Time of the trip in seconds.
Trip Miles: Distance of the trip in miles.
Pickup Community Area: The Community Area where the trip began. This column will be blank for locations ourside Chicago.
Dropoff Community Area: The Community Area where the trip ended. This column will be blank for locations outside Chicago.

df = pd.DataFrame()
df['Trip Start Timestamp'] = data['Trip Start Timestamp']
df['Trip Seconds'] = data['Trip Seconds']
df['Trip Miles'] = data['Trip Miles']
df['Pickup Community Area'] = data['Pickup Community Area']
df['Dropoff Community Area'] = data['Dropoff Community Area']
del data
df

By doing so, there are 6,382,425 rows in this new dataframe. For the following analysis, one should remove the trips whose pickup/dropoff locations outside Chicago. In addition, one should clean the outliers that are with 0 (trip) seconds or 0 (trip) miles.

df = df.dropna() # Remove rows with NaN
df = df.drop(df[df['Trip Seconds'] == 0].index)
df = df.drop(df[df['Trip Miles'] == 0].index)
df = df.reset_index()
df = df.drop(['index'], axis = 1)
df.to_csv('taxi_trip_2022.csv', index = False)

import numpy as np

print(np.mean(df['Trip Seconds'].values))
print(np.mean(df['Trip Miles'].values))

By doing so, there are 4,763,961 remaining taxi trips in the dataframe. If you want to aggregate the trip counts of each pickup/dropoff community area, the simplest way to get row counts per pickup/dropoff community area is by calling .groupby().size().

pickup_df = df.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
dropoff_df = df.groupby(['Dropoff Community Area']).size().reset_index(name = 'dropoff_counts')

Visualizing Pickup and Dropoff Trips in 2022

It is not hard to first use the geopandas package to merge pickup/dropoff trip counts into the .shp data and then visualize the trip data with the matplotlib package.

import geopandas as gpd
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14

chicago = gpd.read_file("Boundaries_Community_Areas/areas.shp")
pickup_df = df.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
dropoff_df = df.groupby(['Dropoff Community Area']).size().reset_index(name = 'dropoff_counts')
pickup_df['area_numbe'] = pickup_df['Pickup Community Area']
dropoff_df['area_numbe'] = dropoff_df['Dropoff Community Area']
chicago['area_numbe'] = chicago.area_numbe.astype(float)

pickup = chicago.set_index('area_numbe').join(pickup_df.set_index('area_numbe')).reset_index()
dropoff = chicago.set_index('area_numbe').join(dropoff_df.set_index('area_numbe')).reset_index()

fig = plt.figure(figsize = (14, 8))
for i in [1, 2]:
    ax = fig.add_subplot(1, 2, i)
    if i == 1:
        pickup.plot('pickup_counts', cmap = 'YlOrRd', legend = True,
                    legend_kwds = {'shrink': 0.618, 'label': 'Pickup trip count'},
                    vmin = 0, vmax = 1.4e+6, ax = ax)
    elif i == 2:
        dropoff.plot('dropoff_counts', cmap = 'YlOrRd', legend = True,
                     legend_kwds = {'shrink': 0.618, 'label': 'Dropoff trip count'},
                     vmin = 0, vmax = 1.4e+6, ax = ax)
    plt.xticks([])
    plt.yticks([])
plt.show()
fig.savefig("pickup_dropoff_trips_chicago_2022.png", bbox_inches = "tight")

Figure 2 shows taxi pickup and dropoff trips (2022) on 77 community areas in the City of Chicago. Note that the average trip duration is 1207.75 seconds and the average trip distance is 6.16 miles.

Figure 2. Taxi pickup and dropoff trips (2022) in the City of Chicago, USA. There are 4,763,961 remaining trips after the data processing.

Figure 3 shows rideshare pickup and dropoff trips (2022) on 77 community areas in the City of Chicago. Note that the average trip duration is 953.26 seconds and the average trip distance is 5.19 miles.

Figure 3. Rideshare pickup and dropoff trips (2022) in the City of Chicago, USA. There are 57,290,954 remaining trips after the data processing.

For comparison, Figure 4 shows taxi pickup and dropoff trips (2019) on 77 community areas in the City of Chicago. Note that the average trip duration is 915.62 seconds and the average trip distance is 3.93 miles. Figure 5 shows rideshare pickup and dropoff trips (2019) in which the average trip duration is 987.51 seconds and the average trip distance is 4.88 miles.

Figure 4. Taxi pickup and dropoff trips (2019) in the City of Chicago, USA. There are 12,484,572 remaining trips after the data processing. See the data processing codes.

Figure 5. Rideshare pickup and dropoff trips (2019) in the City of Chicago, USA. There are 96,642,881 remaining trips after the data processing.

In addition, one can analyze the trips of other travel modes in the City of Chicago. Figure 6 shows E-scooter pickup and dropoff trips on 77 community areas in the City of Chicago, see how to process and visualize E-scooter trips. Note that the average trip duration is 913.18 seconds and the average trip distance is 2448.60 meters.

Figure 6. E-scooter pickup and dropoff trips (2022) in the City of Chicago, USA. There are 1,476,028 remaining trips after the data processing.

Taxi Travel Time Changes of Popular Pickup-Dropoff Pairs between 2019 and 2022

As shown in Figure 2 and Figure 4, there are some most popular pickup community areas, e.g., see left panel of Figure 2:

Community area 8: 1,261,696 trips
Community area 32: 888,724 trips
Community area 76: 688,553 trips
Community area 28: 448,476 trips

The pickup trips in these four areas are about 69% of all trips in 2022. For comparison, the left panel of Figure 4 shows that these four areas are also the most popular pickup areas. Specifically, we have

Community area 8: 4,006,793 trips
Community area 32: 3,647,522 trips
Community area 28: 1,451,411 trips
Community area 76: 1,096,552 trips

It seems to be 81.72% of all trips in 2019.

import pandas as pd

data19 = pd.read_csv('taxi_trip_2019.csv')
data22 = pd.read_csv('taxi_trip_2022.csv')

# Extract the most popular pickup areas
df19 = data19.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
df19 = df19.sort_values(by = ['pickup_counts'], ascending = False)
df22 = data22.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
df22 = df22.sort_values(by = ['pickup_counts'], ascending = False)

In what follows, one can choose some pickup-dropoff pairs to analyze taxi travel times.

# Return hour from datetime column
data19['hour'] = pd.to_datetime(data19['Trip Start Timestamp'],
                                errors = 'coerce').dt.hour
data22['hour'] = pd.to_datetime(data22['Trip Start Timestamp'],
                                errors = 'coerce').dt.hour

From Area 8 to Area 76

When analyzing taxi travel times and movement speeds, one should remove some outliers (e.g., anomalies in trip seconds/miles).

# From Area 8 to Area 76 in 2019
df1 = data19[(data19['Pickup Community Area'] == 8) & (data19['Dropoff Community Area'] == 76)]
df1 = df1.drop(df1[df1['Trip Seconds'] < 600].index)
df1 = df1.drop(df1[df1['Trip Seconds'] > 7200].index)
df1 = df1.drop(df1[df1['Trip Miles'] < 10].index)
df1 = df1.drop(df1[df1['Trip Miles'] > 25].index)

# From Area 8 to Area 76 in 2022
df2 = data22[(data22['Pickup Community Area'] == 8) & (data22['Dropoff Community Area'] == 76)]
df2 = df2.drop(df2[df2['Trip Seconds'] < 600].index)
df2 = df2.drop(df2[df2['Trip Seconds'] > 7200].index)
df2 = df2.drop(df2[df2['Trip Miles'] < 10].index)
df2 = df2.drop(df2[df2['Trip Miles'] > 25].index)

In what follows, we visualize the average travel time and speed from area 8 (i.e., Downtown) to area 76 (i.e., Airport) in both 2019 and 2022. Figure 7 shows average travel time curves and movement speed curves. It is not hard to see the remarkable reduction of average travel time in 2022 when comparing to 2019. The results of average speed demonstrate that the paths/routes from area 8 to area 76 is less congested in 2022 when comparing to 2019.

Figure 7. Average travel time and speed from area 8 (i.e., Downtown) to area 76 (i.e., Airport) in both 2019 and 2022.

Figure 8. Average travel time and speed from area 32 (i.e., Downtown) to area 76 (i.e., Airport) in both 2019 and 2022.

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure(figsize = (8, 2.5))
ax = fig.add_subplot(1, 2, 1)
# Average travel time in 2019
m1 = df1.groupby(['hour'])['Trip Seconds'].mean().values / 30
s1 = df1.groupby(['hour'])['Trip Seconds'].std().values / 30
plt.plot(m1, color = 'blue', linewidth = 1.8, label = '2019')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)),
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower),
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'blue', alpha = 0.05)

# Average travel time in 2022
m1 = df2.groupby(['hour'])['Trip Seconds'].mean().values / 30
s1 = df2.groupby(['hour'])['Trip Seconds'].std().values / 30
plt.plot(m1, color = 'red', linewidth = 1.8, label = '2022')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)),
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower),
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'red', alpha = 0.05)

plt.xticks(np.arange(0, 24 + 1, 2))
plt.xlabel('Time (hour)')
plt.ylabel('Average travel time (min)')
plt.grid(axis = 'both', linestyle='dashed', linewidth = 0.1, color = 'gray')
ax.tick_params(direction = "in")
ax.set_xlim([-1, 24])
plt.legend()

ax = fig.add_subplot(1, 2, 2)
# Average speed in 2019
df1['speed'] = df1['Trip Miles'].values / (df1['Trip Seconds'] / 3600)
m1 = df1.groupby(['hour'])['speed'].mean().values
s1 = df1.groupby(['hour'])['speed'].std().values
plt.plot(m1, color = 'blue', linewidth = 1.8, label = '2019')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)),
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower),
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'blue', alpha = 0.05)

# Average speed in 2022
df2['speed'] = df2['Trip Miles'].values / (df2['Trip Seconds'] / 3600)
m1 = df2.groupby(['hour'])['speed'].mean().values
s1 = df2.groupby(['hour'])['speed'].std().values
plt.plot(m1, color = 'red', linewidth = 1.8, label = '2022')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)),
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower),
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'red', alpha = 0.05)

plt.xticks(np.arange(0, 24 + 1, 2))
plt.xlabel('Time (hour)')
plt.ylabel('Average speed (mph)')
plt.grid(axis = 'both', linestyle='dashed', linewidth = 0.1, color = 'gray')
ax.tick_params(direction = "in")
ax.set_xlim([-1, 24])
plt.legend()

plt.savefig("travel_time_plus_speed_pickup_8_dropoff_76.png", bbox_inches = "tight")
plt.show()

(Posted by Xinyu Chen on November 12, 2023.)