You are on page 1of 8

Data Forensic Examination of the Demise of the

Taxi Industry in Chicago

Author(s)
David Faust, Loyola University Chicago, ​dfaust@luc.edu​, Accounting and Information Systems
Undergraduate
Grace Boos, Loyola University Chicago, ​gboos@luc.edu​, Information Systems and Management
Undergraduate
Natalie Dusina, Loyola University Chicago, ​ndusina@luc.edu​, Accounting and Information Systems
Undergraduate
Katherine Fox, Loyola University Chicago, ​k​fox@luc.edu​, Information Systems and Marketing
Undergraduate
Samantha Schaller, Loyola University Chicago, ​sschaller@luc.edu​, Information Systems and Marketing
Undergraduate

Faculty Advisor
Professor Svetlozar Nestorov, Loyola University Chicago, ISSCM Department, ​snestorov@luc.edu

Summary
We live in an age of easy access to transportation. Affordable rides are now readily available with
a tap of the finger. This new transportation industry is booming, and as a result, the taxi industry has
suffered greatly. When looking at industry trends as represented in the media it is easy to assume that the
taxi industry is dying. However, is that really true? After performing an analysis on the changes in urban
transportation in the city of Chicago via a 60GB dataset of 113 million taxi trips from 2013 - 2017 that is
publicly accessible on the Chicago Data Portal, our team has discovered a niche community area in
Chicago that has experienced an increase in rides over the past 5 years. Whereas the majority of taxi
industry in Chicago is clearly experiencing a dramatic downward spiral, this particular group is managing
to survive and thrive.

1
1) Problem and Motivation
Our team began with a desire to answer the question: Will taxis become obsolete in 5-10 years?
We sought to answer this issue with analytics and data mining. Visualization 1A shows the number of
daily taxi trips in Chicago from Jan 1, 2013 to Jul7 31, 2017. The precipitous decline is striking. The
busiest day of the year for taxis in Chicago is the Saturday before St. Patrick’s Day, which is traditionally
when the city of Chicago hosts its renowned St. Patrick’s Day parade. The number of taxi rides on this
benchmark day has declined from over 150,000 trips in 2014 to just 56,000 in 2017 – a stunning 63%
decline! However, the state of the taxi industry is in fact even worse. Visualization 2B displays the
year-over-year monthly change rates for 2014-2017. For 2014 the number of rides was growing, albeit at a
decreasing pace. Starting in March 2015, the year-over-year monthly change was negative with an
increasing linear trend. The last number shows that the rides in July 2017 have decreased by almost 50%
from July 2016. After observing this decrease in taxi trips overall, we were motivated to see if this trend
applied to all areas of the taxi industry. We became interested in analyzing specific facets of the industry
that were still sustainable even with the prevalence of newer rideshare apps and systems.

2) Approach
Our analysis revolved around the dire state of the taxi industry, noticing huge downward trends
within the totality of the dataset. As a group, we initially considered over 30 different variables and their
potential influence on the decrease in taxi rides in Chicago. We also considered the unions with
alternative public datasets including CTA (Chicago Transit Authority) trains, weather, and Divvy Bike (a
bike-sharing company in Chicago) data to see if effective conclusions could be drawn using Tableau, our
primary software for analysis. Tableau is a tool that spurs creativity because it allows one to quickly create
dynamic visualizations. After experimenting with different ways to visually display the complex dataset,
we deduced the most intuitive facets of our exploration: trip duration, community groups, and taxi
drivers. We discovered that these three attributes can reveal interesting insights on the recent
developments within the taxi industry.

3) Datasets
Our dataset is from the Chicago Data Portal, a free data catalog that allows users to view and
interact with Chicago data. This dataset includes data from 2013-2017 on all taxi trips. This dataset is
structured and open to the public. There are 113 million records, each of which distinguishes a single taxi
trip in the Chicagoland area. There are 23 columns. In our analysis we focused on various subsets of this
large dataset and utilized a hyper extract (as explained in Section 4) so that performance was enhanced.

4) Tools & Analytics


We used Tableau Desktop 10 for our analysis. Tableau uses a language called VizQL to generate
visualizations based on the dimensions and measures specified. A common starting point for our
conclusions was to create histograms and understand the distribution of our data in order to create
meaningful/relevant/helpful clusters. Hyper is a new feature within Tableau which enhances performance
because it is designed for faster data processing on a large, complex dataset such as this.
The team also used RStudio in our analysis of clusters for Taxi IDs. RStudio is an open-source
data analysis software for the programming language R and is used for statistical computations and data

2
mining. We used RStudio to run a cluster optimization function and plot the change of the weighted sum
of squares in order to determine the optimal number of clusters to use.
Since this project was a culmination of the efforts of five individuals over several months, effective
and collaborative data storage was vital. Initially, our group attempted to use Dropbox, but because of the
size of our extract, we ended up utilizing Google Drive to share updated Tableau and note files.

5) Results
As discussed in Section 1, Visualization 1A shows the daily number of trips for each year, with
peaks and troughs that coincide with the annual St. Patrick’s Day parade and Christmas Day, respectively.
Visualization 1B percent difference of number of trips from a month of one year to the corresponding
month of the next year. The average cost of all trips in 2013 was $14.20, and the average cost of a trip in
2017 was $16.40. So, even while the total number of trips is decreasing, the average cost of a trip has
increased. One attribute that we found held particular insight was trip duration. Visualization 2A shows
the histogram based on trip duration. The bin size is 5 minute intervals, and we then employed this
visualization to create group parameters. We decided to group all the rides into five distinct groups. These
trips are defined as “Very Short” 0-5 minute trips (13.7% of the data), “Short” 5-10 minute trips (35.1% of
the data), “Medium” 10-20 minute trips (32.5% of the data), “Long” 20-40 minute trips (14.1% of the
data), and “Very Long” 0ver 40 minute trips (4.2% of the data).
Utilizing those trip duration groups, we created Visualization 2B which shows the average cost per
trip, separated by each group. From this we learned that the average cost of the “Long” and “Very Long”
trips has increased over time, while the cost of the other trip duration groups has remained stagnant or
decreased. Based on the grouping by Trip duration, we see that the increase in average trip cost is only
due to “Long” and “Very Long” trips. Since trips longer than 20 minutes are increasingly producing more
revenue in the taxi industry, we were inspired to find more attributes of these “Long” and “Very Long”
trips. We looked into where these trips were occuring, and attributes of taxi drivers who drive these trips.
To find out more about the characteristics of “Long” and “Very Long” trips, our team created
Visualization 3A. This visualization provided an interesting insight, an increase in the number of records
for trips that took 17 and 18 miles to complete. Upon identifying this jump in the numbers, we decided to
explore that data further. We connected these trips to community area 76, O’Hare Airport. ​O'Hare Airport
is located just 17 miles northwest of downtown Chicago. There is also a smaller jump in number of trips
that 12-13 miles which corresponds to the distance from downtown Chicago to Chicago’s Midway Airport.
To support the conclusion that these 17-18 mile trips were connected to O’Hare airport, our team created
Visualization 3B depicting the spread of minutes for trips that were between 17 and 18 miles long. This
showed us that most trips were 24 minutes long or longer, which aided in our determination of trip
distance categories.
In our analysis of “Far” trips (defined as trips between 10 and 20 miles) in the Downtown and
O’Hare community areas, our team found that both areas experienced an increase in trips from 2013
through 2016 as seen in Visualization 4A. This visualization also displays a dashed line representing the
trend for each community area. Additional observation confirms that the number of Far O’Hare trips
significantly increased, while the number of Downtown trips in this category increased but not as
overwhelmingly. The O’Hare community area is the only community area in the dataset that experienced
an increase in all trips over the total time period analyzed (almost 32%). This re-confirms that O’Hare is a
hotspot for the “Long” and “Very Long” trips as previously defined, which could be due to a variety of
reasons such as convenience, familiarity, business travel, and personal preference.
After identifying O’Hare Airport as a sustainable location for taxis, we wanted to identify the
individuals behind the wheel. This led to our analysis of Taxi IDs. Based on our previous analysis of trip
duration and community areas, we discovered a niche group of top-earning Taxi IDs that continue to

3
thrive in the industry. This became evident when looking at the aggregated Trip Total of each Taxi ID over
the years observed and determining that the top earners in one year carried over to the other years in the
dataset. In order to learn more about this unique group, we created a set of Taxi IDs which combines the
top 5 Taxi IDs from each year based on revenue (Visualization 5A--Q2 2013). We feel this group of taxi
drivers is representative of success over time in the taxi industry. The size of circles on the map display the
number taxi trips of top earning Taxis to emphasize the increase of trips from O’Hare over time
(Visualization 5B--Q2 2016). The noticable increase of taxi trips at O’Hare and decrease in other areas in
Visualization 5B prove this point. Through our investigation of these top earning Taxi IDs, we discovered
that successful Chicago drivers engage with O’Hare more in 2016 than they did in 2013.
In order to prove that the most sustainable taxi drivers earn the highest revenue and perform
longer trips on average, we utilized the clustering method outlined in Section 4. By creating a sub-dataset
of average trip total and average trip duration in minutes on the Taxi ID entity level, our team analyzed
the numeric attributes in a cluster optimization function. This function determined that the optimum
number of clusters is 6. Interpreting the results from the Tableau k-means clustering tool as seen in
Visualization 6A, the team found that O’Hare trips belonged to clusters 1, 2 and 3. The team separately
determined the top 50 Taxi IDs producing the highest total trip cost (Trip Total) over the entire dataset
and analyzed which clusters to which they belonged (Visualization 6B). We highlighted 41 of the top 50
Taxi IDs (as they were categorized into the O’Hare clusters) and wanted to analyze their duration and cost
averages at the airport.
Visualization 6C displays the average trip minutes and trip cost for each of the 41 selected Taxi
IDs, conveying that these average trip durations mostly fell between 27 and 36 minutes and average trip
cost was centered around $50. This is significant because the best drivers make more on average when
they go to and from O’Hare airport when compared to other locations in Chicago, including Downtown.
This information corroborates with our previous analysis above, and supports the team’s conclusion that
the taxi drivers who have figured out how to profit off of the taxi industry perform longer rides and
normally work to and from O’Hare.

6) Contributions and Uniqueness


The novelty of our approach comes from the structure of our analysis. By breaking trips into
defining groups (for example, trip duration or community areas), we can take this huge dataset and draw
conclusions that would have otherwise been impossible. Using this approach, we discovered a facet in
opposition to the general decreasing trend of the taxi industry. We can confidently conclude that trips
over 20 minutes, with pick-up or drop-off at the airports still represent a growth area. The taxi drivers
who routinely drive these routes have experienced continued success. This represents a silver-lining in an
otherwise bleak outlook for the taxi industry in Chicago.
Our study can serve as a template for studying taxi industry trends in other metropolitan areas
nationwide and internationally. Such efforts could help city planners and urban transportation
researchers examine and reassess the role of taxis in future urban centers.

4
7) Appendix: Data Visualizations

5
6
7
8) References
Chicago Data Portal: ​https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew
Tableau Technical Manual Resource Center: ​https://www.tableau.com/support/desktop
RStudio Documentation: ​https://support.rstudio.com/hc/en-us/categories/200035113-Documentation

You might also like