B Tech Project Thesis

DATA ANALYTICS BASED
DYNAMIC PASSENGER INFORMATION SYSTEM

A Project Report
submitted by
RAKESH BEHERA
in partial fullment of the requirements
for the award of the degree of
BACHELOR OF TECHNOLOGY
TRANSPORTATION DIVISION
DEPARTMENT OF CIVIL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
MAY 2014
CERTIFICATE
This is to certify that the project report titled Data Analytics Based Dynamic Passenger
Information System, submitted by Rakesh Behera, to the Indian Institute of Technology,
Madras, for the award of the degree of Bachelor of Technology, is a bonade record of
the research work done by him under my supervision. The contents of this report, in full or
in parts, have not been submitted to any other Institute or University for the award of any
degree or diploma.
Dr. Lelitha Devi V.
Project Guide
Associate Professor
Dept. of Civil Engineering
IIT-Madras, 600 036
Prof. Meher Prasad A.
Head of the Department
Professor
Dept. of Civil Engineering
IIT-Madras, 600 036
Place: Chennai
Date: 19th May 2014
i
ACKNOWLEDGEMENTS
My earnest thanks to Dr. Lelitha Devi, for her support throughout the study. It is through her
guidance that the project has gained structure and been accomplished in such a short span
of time. Her foresight and expertise has helped us make the right choices in the project
and otherwise. I am thoroughly indebted to her for the amount of time she has spent in
reviewing my analyses and reports. I thank her for her belief in my potential in carrying out
the tasks involved. I consider it a privilege to have worked under her guidance.
I also owe my gratitude to Dr. Shankar Ram C. S. for his valuable inputs. His contribu-
tion could not have been substituted by anyone else. I also thank Dr. J. Murali Krishnan for
the constant support and encouragement that he has provided me throughout my academic
life at IITM. I take this opportunity to thank Akhilesh, Krishna, Siddharth and Anil for the
help offered by them in data acquisition and the development of the online version of the
framework. I would also like to acknowledge all the other project staff and students at the
Centre of Excellence in Urban Transportation, IIT Madras.
Friends have been an integral part throughout the stay here at IIT Madras. Life at IITM
cannot be complete without them. I thank all my friends and wing mates for making my
stay here at IIT Madras, a memorable one.
Finally, I would like to thank my parents and my younger brothers for their enduring
support and unconditional love, without which this project would not have been possible.
ii
ABSTRACT
KEYWORDS: Travel Time Prediction, Historical Trajectory Search, Kalman Fil-
ter, V-clustering.
The present study developed a reliable system for real-time bus arrival/travel time predic-
tion under heterogeneous trafc conditions that exist in India. The study is different from
(and more challenging than) most of the previous studies which involved homogeneous
trafc conditions. To accomplish the above goal, a robust framework namely, Historical
Trajectory and Kalman Filter based Travel/Arrival Time Prediction (HTKFTP) is proposed
in this study. The proposed framework has two major components: (i) similar trajectory
search; (ii) travel time prediction using similar trajectories. Through the data analysis
performed, travel time correlations (between spatially close stretches of road) and other
temporal patterns in travel times were identied, which were used for the development of
various schemes for the selection of historical trajectories. The prediction algorithm based
on Kalman Filter was also improved to account for the high variance in travel times on cer-
tain locations or during certain time of the day. The proposed schemes were corroborated
using real-world GPS trajectory data collected from the Metropolitan Transport Corpora-
tion (MTC) buses in Chennai.
iii
TABLE OF CONTENTS
CERTIFICATE i
ACKNOWLEDGEMENTS ii
ABSTRACT iii
LIST OF TABLES vii
LIST OF FIGURES viii
ABBREVIATIONS ix
NOTATION x
1 INTRODUCTION 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 LITERATURE REVIEW 6
2.1 A Brief History of Trafc Prediction . . . . . . . . . . . . . . . . . . . 6
2.2 Approaches Exploiting "Similarity" . . . . . . . . . . . . . . . . . . . 9
2.3 Trajectory Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 DATA ANALYSIS 12
3.1 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Extracting Trip Data . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Calculation of Segment-wise Travel Times . . . . . . . . . . . 15
iv
3.3 Correlation Between Segments . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Travel Time Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 THE FRAMEWORK AND THE CLUSTERING ALGORITHM 23
4.1 Terms and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Trajectory Search based on Passed Segments Scheme . . . . . . . . . . 27
4.5 The Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Nearest Neighbour Search in Passed Segments Scheme . . . . . . . . . 30
4.7 Similarity based on Temporal Features . . . . . . . . . . . . . . . . . . 31
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 THE PREDICTION ALGORITHM 32
5.1 Travel Time Prediction using Kalman Filter . . . . . . . . . . . . . . . 32
5.2 The Base KF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Integration of Trajectory Search and Prediction algorithms . . . . . . . 35
5.4 Modications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 PERFORMANCE EVALUATION 38
6.1 Measures of Performance . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Parameter Optimization in Passed Segment Scheme . . . . . . . . . . . 39
6.2.1 Spatial lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2.2 Minimum Number of Trajectories (MNT) in a Cluster . . . . . 40
6.3 Evaluation of the PS scheme . . . . . . . . . . . . . . . . . . . . . . . 41
6.4 Evaluation of the Weekday/Weekend Temporal Feature . . . . . . . . . 41
6.5 Evaluation of the Temporal Neighbourhood Feature . . . . . . . . . . . 42
6.6 Evaluation of the base KF Algorithm for Prediction . . . . . . . . . . . 42
6.7 Evaluation of the Adaptive KF Algorithm . . . . . . . . . . . . . . . . 44
6.8 Evaluation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 SUMMARY AND CONCLUSIONS 47
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
v
7.3 Scope for Further Research . . . . . . . . . . . . . . . . . . . . . . . . 48
A PYTHON CODE LISTING FOR CLUSTERING ALGORITHM 49
A.1 Method for creating clusters from similar trips . . . . . . . . . . . . . . 49
A.2 Auxiliary method for nding optimum splits in the clustering algorithm 51
A.3 Method for nding nearest neighbours from clusters . . . . . . . . . . . 51
LIST OF TABLES
3.1 A sample of the raw data received from the GPS devices on the buses. . 13
3.2 A sample of data records after transformation. . . . . . . . . . . . . . . 14
4.1 An example of segment-wise travel times on historical trajectories. . . . 28
4.2 An example of partitioned segment-wise travel times after application of
clustering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
LIST OF FIGURES
3.1 Pearsons correlation coefcients versus the segment distance. . . . . . 16
3.2 Average correlation coefcient versus the segment distance. . . . . . . . 17
3.3 Travel time analysis by hours of the day . . . . . . . . . . . . . . . . . 18
3.4 Comparison between weekday peak and weekday off-peak trips. . . . . 19
3.5 Correlations between travel times occurring in different hours of the day 20
3.6 Comparison between weekday and weekend trips. . . . . . . . . . . . . 21
3.7 Comparison between the weekdays. . . . . . . . . . . . . . . . . . . . 22
4.1 Overall architecture of the HTKFTP framework . . . . . . . . . . . . . 26
5.1 Variation of travel time variance across the segments of 19B route . . . 36
6.1 Optimum values of parameters involved in the clustering algorithm . . . 40
6.2 Comparison of MAE for individual test trips before and after adding the PS
scheme to the naive method. . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Comparison of MAE for individual test trips before and after adding the
weekday/weekend feature to the PS scheme. . . . . . . . . . . . . . . . 43
6.4 Comparison of MAE for individual test trips before and after adding the
temporal neighbourhood feature. . . . . . . . . . . . . . . . . . . . . . 43
6.5 Comparison of MAE for individual test trips before and after using the base
KF algorithm for prediction. . . . . . . . . . . . . . . . . . . . . . . . 44
6.6 Comparison of MAE for individual test trips before and after using the
Adaptive KF algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.7 Improvement of the mean MAE (over all the test trips) throughout the evo-
lution of the method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.8 Comparison between HTKFTP and the prediction method using static in-
puts in KF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
viii
ABBREVIATIONS
AI Articial Intelligence
ANN Articial neural networks
DTW Dynamic Time Warping
ED Euclidean Distance
GPS Global Positioning System
HTD Historical Trajectory Database
HTKFTP Historical Trajectory and Kalman Filter based Travel/Arrival Time Prediction
HTTP Historical Trajectory based Travel time Prediction
KF Kalman Filtering
k-NN k-Nearest Neighbors
LCSS Longest Common Subsequence
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
MLR Multivariate Linear Regression
MTC Metropolitan Transport Corporation (Chennai)
NNS Nearest Neighbour Search
RBMS Real-time Bus Status Monitoring
SARIMA Seasonal Autoregressive Integrated Moving Average
SVR Support Vector Regression
TTP Travel Time Prediction
ix
NOTATION
Correlation coefcient between two variables
R
raw
A raw route represented as a sequence of points, p
i
s
S
i
A segment of road between two points, p
i
and p
i+1
R A raw route represented as a sequence of segments, S
i
s
B
i
The i
th
bus stop on a route
t
i
Time taken to reach a point p
i
on a route, starting from p
0
T
raw
A raw trajectory represented as a sequence pairs of the form (p
i
, t
i
)
t
i
Actual time taken to cover a segment S
i
t
i
Predicted travel time on S
i
A
B
i
Actual arrival time of the bus at the bus stop B
i
A
B
i
j
The j
th
predicted arrival time at the bus stop B
i
T A trajectory represented as a sequence of t
i
s
ST
i
List of historical travel times on segment S
i
SC
i
List of clusters or intervals for S
i
C
S
i
i
The i
th
cluster for S
i
T
curr
The current (or incomplete or test) trajectory. Also denoted as T
test
wav
i
The weighted average variance for a split at the i
th
element of a list.
a
i
Travel time evolution factor from S
i
to S
i+1
w
i
Process disturbance in travel time evolution at S
i
z
i
Measured travel time on S
i
v
i
Measurement noise associated with S
i
Q
i
Variance of the historical w
i
s for S
i
R
i
Variance of the historical v
i
s for S
i
x
CHAPTER 1
INTRODUCTION
1.1 Motivation
With the ever-increasing number of vehicles on roads in urban areas, trafc congestion has
become one of the most serious problems facing the society, especially the commuters.
In India, the problem is more prominent in the metropolitan cities such as Mumbai, New
Delhi, Chennai, etc. One of the reasons people are shifting to private transportation is the
unreliability of the public transportation systems (Bende, 2012). Holeywell (2013) points
out that, travellers care most about getting picked up from their stop in 10 minutes or less
to be able to make their scheduled connections. It also points out that, the travellers are not
so interested in whether their rides are crowded or whether they can nd a seat.
In todays busy society, information regarding arrival time or travel time of transport
from a place to another is becoming more and more valuable. With a schedule of predicted
arrival times at each bus stop available via VMS or as mobile or web application, people
can make timely plans for their upcoming activities and business which will reduce their
anxiety caused by uncertain delays. Thus, there is a necessity for a system that can inform
the travellers about the latest travel times of the concerned buses before they make their
transit plans. This may also attract more passengers to use public transport, which in turn
can lead to lesser trafc congestion.
1.2 Background
Accurate estimation of travel times of public transportation has been a challenging research
problem that remains open for the past thirty years in the transportation research commu-
nity (Abkowitz, 1981; Polus, 1978). A simple prediction approach is to adopt the average
travel time derived from historical data. However, making constant estimation of the travel
time for a path, apparently does not capture the dynamic trafc conditions very well. Thus,
advanced techniques for travel time estimation were proposed in the early literature (Ghosh
and Knapp, 1978; Oda, 1990; Nihan and Holmesland, 1980). Even though the specic ap-
proaches adopted in these studies are different, they share a common idea, i.e., discover
certain regular patterns from the historical data collected over time. Some proposed to t
historical data to statistical models such as Gaussian models, Bayesian network and Markov
Chains in order to facilitate statistical analysis (Polus, 1978; Sumi et al., 1990). Techniques
based on regression models learn from historical data. They involve building of regres-
sion functions for estimating travel time in terms of various external factors (Polus, 1979;
Ghosh and Knapp, 1978). A prediction is made by using known values of those factors un-
der current situation as input. Techniques based on time series models focus on discovering
internal relationship among historical time-series data in order to identify similar patterns
to make prediction under the current situation (Oda, 1990; Nihan and Holmesland, 1980).
However, the performances of the above approaches are highly constrained by the quali-
ty/quantity as well as the types of data available. For example, conventional collection of
trafc data is typically conducted by surveys or using expensive sensors deployed along the
roads at specic locations to record arrival times, trafc ow volumes, and other statistics
of vehicles.
In the recent years, due to the advent of positioning and wireless communication tech-
nologies, wireless devices equipped with Global Positioning System(GPS) have been widely
deployed on various private and public vehicles, generating massive amount of vehicle tra-
jectory data which can be used for eet management and other transportation applications.
Time-tagged location data, usually represented in the form of trajectories, bring a great po-
tential for real-time prediction of the vehicle travel times. Among the public transportation
systems, the travel times of buses, which drive along with other vehicles on roads, are more
difcult to predict than trains and subways, which ride on exclusive paths. First, the travel
condition of a bus may easily get affected by various internal and external factors, including
accidents, weather, road construction, government policies and even temperature. Second,
for vehicles in metropolitan areas (such as Chennai), errors often exist in positional-data
acquisition due to the interference by urban canopies and other sources of errors. Thus, in
this paper, we propose a hybrid prediction framework to estimate the travel time of buses
by exploiting selected historical trajectory data and an efcient state estimation technique
capable of making precise estimations by exploiting a series of travel time measurements.
2
1.3 Research Overview
Recently, research works on discovering trafc patterns from historical data collected from
vehicles have received signicant attention (Chen et al., 2011; Li and Rose, 2011; Tiesyte.
and Jensen, 2009). These works show that trafc patterns exist in road segments and thus
could be used to predict the future trafc condition on the same segment and on a few up-
coming segments. This nding provides the basis for using similar trajectories to predict
the travel time of an ongoing bus journey. In this study, a new bus travel time prediction
framework, called Historical Trajectory and Kalman Filter based Travel/Arrival Time Pre-
diction (HTKFTP) for real-time prediction of travel time at upcoming segments (and thus
the arrival time at bus stops) of an ongoing bus journey is carried out. The basic idea behind
HTKFTP is to use a collection of historical trajectories similar to the current bus journey
to predict the travel times in future segments of the bus journey. Specically, the HTKFTP
framework (i) identies a set of similar trajectories as the basis for travel time estimation
instead of relying on only one historical trajectory best matching the on-going bus journey;
(ii) explores different features (e.g., travel times of passed segments as well as time/day of
the bus trajectories) to identify the sample set of similar trajectories; (iii) uses the similar
trajectories as inputs to the Kalman Filter based prediction method.
Several issues were faced in the design of the HTKFTP framework. For example, many
features are associated with the trajectories. Some of these features are categorical while
the others are numerical. Discriminative features and properly dened similarity functions
for those features needed to be used in order to identify a sample set of similar trajectories
effective for travel time prediction. To determine a set of similar trajectories based on travel
time on passed segments, the V-clustering algorithm, that partitions the whole spectrum
of travel times on a segment into a number of intervals (or clusters) was considered. To
determine a set of similar trajectories based on hours/days, exploratory data analysis in-
volving space-time trajectory plots of the historical trips was carried out. Accordingly, the
HTKFTP framework is able to retrieve the sample set of similar trajectories efciently and
in turn use that sample set to estimate the travel times. To corroborate the proposed ideas
and evaluate the prediction schemes proposed, an empirical experimentation using real bus
trajectory data collected in Chennai, India, was conducted. This research work has made a
number of signicant contributions as summarized below.
3
A new framework, namely, HTKFTP, for predicting the travel times over future seg-
ments of an ongoing bus journey based on historical trajectory data. The framework
consists of two major components: (i) similar trajectory retrieval; and (ii) travel time
estimation.
A detailed data analysis to investigate the correlation between bus travel times in
route segments and a number of trajectory features, e.g., passed segment travel time,
hours, days, etc. Based on our analysis, we select a number of trajectory features to
identify similar trajectories.
A clustering algorithm for passed segment travel times and space-time trajectory
analysis in order to group similar trajectories together. These similar trajectory clus-
ters allow us to efciently and effectively retrieve a sample set of trajectories similar
to the ongoing bus trajectory.
An efcient state estimation technique based on Kalman Filter, capable of making
precise estimations by exploiting a series of travel time measurements in an inherent
feedback mechanism. The base estimation scheme was modied to take into account,
the large variance in the data observed at selected locations/times.
Through a comprehensive experimental study, using a real data set collected from buses
in Chennai, India, the proposed ideas were validated. The framework was evaluated in
terms of prediction accuracy. The experimental results show that the prediction scheme
proposed, signicantly outperforms the baseline and state-of-the-art schemes.
1.4 Chapter Outline
The remainder of this report is organized as follows:
Chapter 2 reviews some literature in the concerned area.
Chapter 3, analyses the collected historical trajectory data.
An overview of the HTKFTP framework and the similar trajectory selection algo-
rithm, detailing its design, is discussed in Chapter 4.
4
The prediction scheme and the real-time prediction system design are detailed in
Chapter 5.
Chapter 6, reports a comprehensive experimental study using the collected real data
set of bus trajectories, queried in real-time.
Finally, we conclude this work in Chapter 7, with a summary of the work, followed
by conclusions and scope for future work.
5
CHAPTER 2
LITERATURE REVIEW
This chapter reviews the past research that has fuelled our motivation in prediction of move-
ment of vehicles. We begin by giving a brief history of trafc prediction, and review the
major research that has focused specically on similarity-based prediction of arrival/travel
times and trajectory patterns.
2.1 A Brief History of Trafc Prediction
Research in transportation dates back to the 30s of the last century. With few vehicles on
roads and under-developed technologies, it was then almost impossible to collect signicant
data about trafc conditions. Thus, studies during that time were mainly about identifying
certain rules that could be used to guide trafc management and the construction of trans-
portation infrastructure. For example, the relations between trafc volumes and the weather
were reported by Johnson (1930). It justied the improvement of road surfaces during bad
weather. As another typical example, the authors of Vey and Pope (1935) veried a denite
relationship between highway lighting and highway accidents. In general, where adequate
lighting is provided, there is a substantial reduction in night accidents.
With the development of technologies and increasing number of vehicles on roads, more
data about trafc conditions could be collected, subsequently causing the emergence of
research on trafc prediction in the 50s. However, during this period, trafc data adopted
in most cases were vehicle volumes (or ow) because they were easily collected by hand.
For example, in Glanville (1955), Lighthill and Whitham (1955) and Buckley (1968), to
obtain the vehicle volume on a road, observers were placed at certain locations to record
the number of vehicles passed by. Such an approach was inefcient and made it difcult
to collect a large amount of data. Therefore, the arrival/travel time prediction did not arise
until 1970s (Wong and Sussman, 1973; Sussman et al., 1974), when trafc sensors were
widely adopted enabling researchers to have sufcient data for analysis.
Estimation of arrival/travel times, especially for buses, started to attract increasing at-
tention since the 80s (Abkowitz, 1981; Polus, 1978; Sumi et al., 1990). Along with the
development of the society, congestions started happening increasingly in cities which cre-
ated a need to improve the quality of public transportation service. As the most important
aspect of public transportation service, arrival/travel time prediction became the most criti-
cal topic in trafc prediction area. At early stage of the research on this topic, constrained
by technologies, researchers had to work on data collected from trafc sensors and surveys,
off-line. Since the development of GPS devices and wireless technologies, it is possible
to collect large volume of trafc-related data in real-time. Therefore, real-time arrival/-
travel time prediction has become a hot topic since these technologies are widely applied
in public transportation system. Over the decades, researchers applied different models
and methods on real-time arrival/travel time prediction. In Zhu et al. (2011), the authors
developed mathematical models taking into account the travel times on links, dwell times
at stops, and delays at intersections. The algorithm proposed in Lin and Zeng (2001) is to
provide real-time bus arrival information based on the bus location data, the schedule infor-
mation, the difference between scheduled and actual arrival times, and the waiting time at
time-check stops. Predicting methods based on historical data are also developed in Tiesyte
and Jensen (2008).
With the development of Articial Intelligence, researchers have widely adopted Arti-
cial Intelligence methods in real-time arrival/travel time prediction. As a result of this, travel
time prediction approaches in the modern literature can be broadly classied as model based
and data driven. Model based approaches predict travel times using trafc ow models and
the underlying physical phenomena. For example, Krishnan and Polak (2008) explored
recurring themes in trafc conditions and used k-Nearest Neighbors (k-NN) for indirectly
predicting short term travel times using 15-minute aggregate ow data. Esawey and Sayed
(2011) used a VISSIM
1
micro-simulation model of down-town Vancouver to predict travel
times using trafc volume and travel time data of nearby segments. Kalman Filtering (KF)
is one of the most widely adopted methods in travel time prediction in the recent literature.
Vanajakshi et al. (2009) used a KF based method for predicting segment-wise travel times
(using travel times of previous two vehicles) in heterogeneous trafc conditions prevalent
in Indian cities such as Chennai. KF takes into account the stochastic properties of the
1
VisSim is a visual block diagram language for simulation of dynamical systems and model based design
of embedded systems.
7
process disturbance and the measurement noise. It works well for short-term prediction.
Other notable works in KF include Xu et al. (2008), Shalaby et al. (2004) and Zhu and
Wang (2000). Westgate et al. (2013) used a Bayesian model for travel time estimation of
ambulances using GPS data.
Data driven approaches predict travel time with the use of statistical relationships, which
are derived from historical data (travel times, speeds, volumes, etc.). The most commonly
reported data driven approaches in the literature include machine learning techniques, time
series analysis and historical averaging approaches. In machine learning techniques, the
prediction model learns some properties from several instances of historical data. For ex-
ample, Patnaik et al. (2004) used a machine learning technique called multivariate linear
regression for bus arrival time estimation using automatic passenger counter (APC) data.
Articial neural networks (ANNs) is another most widely used method. Liu et al. (2009)
used neural networks to indirectly predict travel times using trafc volume and ow data.
ANNs has a huge advantage that it can process complex non-linear relationships. However,
it is limited by the extremely long training time. Other notable works using ANNs include
van Lint (2006), Zou et al. (2008) and Batool and Khan (2005). Besides, other machine
learning methods are also popular in recent years. Real-time prediction using Support Vec-
tor Regression (SVR) and Support Vector Machine (SVM) has become a hot topic recently.
For example, Wu et al. (2004) used SVR for travel time prediction using highway trafc
data. Vanajakshi and Rilett (2007), Vanajakshi and Rilett (2004) and Yu et al. (2006) are the
other instances of the use of SVR. Similar to ANNs, SVR is too expensive in training for
real-time updates. In a time series analysis approach, temporal patterns are identied in the
historical data and future values are predicted with the assumption that these patterns hold
in the near future. For example, Guin (2006) used a time series analysis approach called
seasonal autoregressive integrated moving average (SARIMA) to predict travel times using
historical travel time data.
Though the model based approaches provide valuable insights into the mechanisms of
trafc ow and queue dynamics, their inherent limitations hinder their application in real-
time systems. The major disadvantages include high computational complexity, intensive
model/parameter calibration, requirement for predicting trafc demand/capacity and the
degree of expertise required for design and maintenance. On the other hand, data driven
approaches can be deployed quicker and cheaper compared to model-based approaches.
8
They can provide scope for prediction when there is a large diversity (or variance) in the
historical data. In such cases, predicting using physical models which are narrow in scope
can be expensive. In this study, a data driven approach is chosen for travel time prediction
exploiting similar historical trajectories, as explained in the following sections.
2.2 Approaches Exploiting "Similarity"
Since the bus trips repeat in the same route, more or less around the same time, on dif-
ferent days, the similarity-based approach is the straightforward approach to predict future
travel times. Great amount of work has been done on identifying similar trajectories or
similar time series, in both one-dimension and multi-dimensions. Yi and Faloutsos (2000)
proposed Lp-norm to compute the Manhattan Distance or Euclidean Distance as a mea-
sure of similarity. Lp-norm is widely applied in various applications but is only available
for time series with same length. Therefore, other similarity measures are developed and
adopted. Berndt and Clifford (1994) introduced Dynamic Time Warping (DTW) which was
adopted later in Assent et al. (2009) and Vlachos et al. (2006). The concept of edit distance
was introduced in Levenstein (1966) and the most widely used distance based on edit dis-
tance is Longest Common Subsequence (LCSS) distance. Vlachos et al. (2002), Fashandi
and Moghaddam (2005) and Hermes et al. (2009) applied LCSS as the distance measure to
fetch similar trajectories or time series. However, these algorithms tend to emphasize on the
overall similarity of the whole trajectory, without considering the similarity of trajectories
in individual or subsets of segments. Additionally, while LCSS and DTW are applicable to
trajectory data, they are highly sensitive to noises and errors. In this project, similarity mea-
sure of trajectories based on similarity of corresponding individual segments is proposed.
Recently, prediction methods based on historical trajectory data have also been developed
in Jensen and Tie (2008), Tiesyte. and Jensen (2009) and Tiesyte and Jensen (2008). The
authors show that the similarity between historical trajectories and current position data
of a bus can be exploited to predict bus arrival time at bus stations. This shares the same
intuition with the development of the Historical Trajectory based Travel time Prediction
(HTTP) framework in Lee et al. (2012). The present project can be said to be built on top
of HTTP with the inclusion of additional features for the selection of similar trajectories
and the use of Kalman Filter for prediction. In Jensen and Tie (2008), the authors devel-
9
oped a system called TransDB, that searches the historical trajectory database for the most
similar trajectory based on the passed segments of the current bus trajectory in order to
make a good prediction. The basic idea is that, based on the proposed trajectory similarity
function, the nearest neighbourhood trajectory (NNT) and the trajectory of current bus ride
are anticipated to exhibit similar travelling behaviour (in terms of travel time). Based on
this assumption, the NNT serves as a good basis for predicting the future travel time of
current bus ride without explicitly taking into account various external and internal factors.
However, Lee et al. (2012) argue that the historical trajectory that is most similar to the
passed segments of the current bus trajectory alone may not provide the best prediction of
the on-going bus ride. Thus, they collect a set of similar trajectories and adopt a statisti-
cal approach to make predictions. Additionally, they exploit different features associated
with trajectories and develop different similarity functions to nd similar trajectories that
make signicantly more accurate travel time predictions. Our approach varies from HTTP
in a way that it is not a statistical approach. From the data analysis studies as explained in
Chapter 4, it was found that, though the statistical approach provides satisfying predictions
for a few upcoming segments, the predictions get worse if they were made for a time far
into the future (future segments far from the current one). It was observed that the historical
trajectories within a temporal neighbourhood of 30 minutes around the ongoing trajectory,
are more signicant in improving the prediction accuracy for farther future segments. Addi-
tionally, with the error feedback mechanism inherent in the KF based prediction algorithm,
the accuracy tends to improve from one future segment to the next one during prediction.
2.3 Trajectory Patterns
Patterns of historical trajectories are described in two classes: trend and periodicity. The
trend represents a general systematic linear or non-linear component that changes over
time and does not repeat or at least does not repeat within the time range captured by data.
The periodicity represents the component that repeats itself in certain intervals of time. In
Wu et al. (2003), the authors display the daily periodicity from historical data of travel
times around the same location. Zhu et al. (2009) also conducted an analysis to verify
the existence of periodicity of speeds over time on a route segment. Chen et al. (2011)
and Li and Rose (2011) verify the pattern by measuring the correlation between the trafc
10
on a specic route of different time periods. Vanajakshi et al. (2009) analysed travel time
variation plots in heterogeneous trafc conditions, using GPS trajectory data from the buses
in Chennai, India. From their analysis, they concluded that the travel time patterns were
more related for consecutive vehicles (with a headway 15 min) on the same day. Weekly
and daily patterns were not as signicant as the above one. Hence, they used the travel times
of the previous two vehicles for prediction. Similarly, Kumar and Vanajaksh (2012) used
a statistical test to check whether the previous trips on the same day or previous days(s)
same-time trip or previous week(s) same-day/same-time trip is signicant in predicting
the travel times of and ongoing trip. The authors concluded that the previous two weeks
same-day/same-time trips and the previous three trips on the same day were signicant and
could be included as inputs in the prediction model developed using a simple exponential
smoothing technique.
It is clear from the above attempts that, the travel time behaviour of the vehicles moving
on xed routes is not random. There exist signicant patterns in travel times for trips made
around the same time of the day. Such patterns verify the possibility of using historical
data of a certain segment to predict for the future trafc condition on the same segment.
This forms the basis for the development of the HTKFTP framework introduced in Chapter
1. The next chapter discusses the various kinds of analyses carried out on real-world bus
trajectory data to explore possible patterns in travel times.
11
CHAPTER 3
DATA ANALYSIS
Several analyses to explore the correlations and patterns in the historical trajectory data
comprising of segment wise travel times was carried out. Our goal in data analysis is two-
fold:
To verify the suitability of using historical trajectories for prediction of future travel
times for an ongoing trajectory; and
To explore any patterns in travel time data that can be used for the prediction.
3.1 Raw Data
The raw GPS data used in this study were collected over a period of 4 months, from January
2014 to April 2014, from the Metropolitan Transport Corporation (MTC) buses, running on
one of the busiest routes in Chennai namely, 19B which connects Kelambakkam in south
to Saidapet in central Chennai. Each bus is equipped with a GPS device that records the
status of the bus along with its movement and pushes the status to a central server every 10
seconds. Each data point consists of the GPS coordinates and the corresponding time-stamp
as shown in Table 3.1. Each bus and each route have their own identications. Location
details of each bus stop in a selected route is collected and stored. Each bus station has its
own name, GPS coordinates as well as the IDs of the routes it belongs to, so that all the
bus stations for each route can be found. For a specic route, a bus station has a sequence
number among all bus stations belonging to this route. In most cases, there are more than
one bus travelling on a route. Each bus travels on a xed route several times in a day.
Taking the data of the last four months, i.e., from January to April 2014, there were
totally 28 buses on 19B route which completed 3,686 trajectories running back and forth.
The north-bound 19B route with an ID 1101 is chosen for analysis. This route has 15 stops,
with the origin at the Kelambakkam Bus Station and the last stop at Saidapet Bus Depot. It
Table 3.1: A sample of the raw data received from the GPS devices on the buses.
Timestamp Longitude Latitude
04-Apr-14 09:28:15 80.242317 13.005729
04-Apr-14 09:28:25 80.242317 13.005729
04-Apr-14 09:28:35 80.242241 13.005681
04-Apr-14 09:28:45 80.241928 13.005391
04-Apr-14 09:28:55 80.241828 13.004879
covers a distance of 29.4 kilometres (i.e. 147 segments) and the average trip duration from
the origin to the destination, is about 4000 seconds. From January to April 2014, there are
totally 2,212 north-bound trajectories in this route.
3.2 Data Transformation
From the raw data of time-stamp and latitude/longitude, other useful quantities such as
distance, cumulative distance, UNIX time, time difference and speed were calculated as
explained below. Distance (assuming straight line travel) between two consecutive GPS
locations of a bus was found out using the haversine formula as shown in Equation 3.1.
D =Rcos
1
(a +b)
where
R = radius of Earth = 6371000 m (mean)
a =cos
2
lat
1
cos
2
lat
2
b =sin
2
lat
1
sin
2
lat
2
cos(lon
1
lon
2
)
(3.1)
Table 3.2 shows sample transformed data. From the calculated distances, the corresponding
cumulative distance travelled till each GPS point was also calculated. The timestamp data
initially in the format "dd-mm-yyyy HH:MM:SS" (a string), was converted into UNIX time
format
1
. This conversion speeds up several operations with the time-stamps. With the help
of these time-stamps, the time difference between each pair of consecutive GPS points was
calculated (column with heading t(s) in Table 3.2). Speed of the bus at a particular point
was calculated by dividing the corresponding value of distance by the time difference.
1
The UNIX time form of time-stamp is the number of seconds (an integer) passed since 00:00:00 hours,
January 01, 1970 till the timestamp under consideration.
13
Table 3.2: A sample of data records after transformation.
UnixTime(s) Lon(
) Lat(
) t(s) Dist(m) CumDist(m) Speed(m/s)

1379390422 80.127693 12.923 10 43.1092 294.1599 4.31092
1379390432 80.127899 12.922989 10 22.384146 316.544 2.238414
1379390443 80.128448 12.92292 11 60.106274 376.6503 5.464206
1379390453 80.128997 12.922869 10 59.87249 436.5228 5.987249
1379390463 80.129661 12.922829 10 72.165902 508.688 7.21659
3.2.1 Data Cleaning
There were several stray records in the raw data. These could be detected using the distance
and time difference values. Some records had distance more than 1000 metres in a 10
second interval which is impossible since the corresponding speed becomes more than 360
km/h. This may be because of errors in (or misplacement of) the longitude and latitude
values. A higher value of time difference implies the absence of several GPS logs. The
distance in these cases is also inaccurate (since we assumed straight line travel and the bus
might have undergone several changes in direction in a long time). Such data were detected
in an automated way and were not considered for analysis.
3.2.2 Extracting Trip Data
The daily data les for each device included multiple trips made by that bus in that day. The
rst task was to extract the data trip wise and store in separate CSV les. Each such trip le
consisted of about 600 records with the rst record corresponding to departure of the bus
from the origin bus station and the last one corresponding to the arrival at the destination
bus station. There were 3 - 4 trips made by each bus per day. Each trip le was named in
the following name format: "IMEI_date_start time_direction.csv", where IMEI and
date are the IMEI number corresponding to the device and the date of data. Start time is the
timestamp of the rst record of the le (i.e. departure time) and direction implies whether
it is a north bound trip or a south bound trip.
After this extraction, the cumulative distance was updated for each trip separately. The
cumulative distance and the UNIX time were used for plotting the space-time trajectories
(cumulative distance vs. cumulative time) for further analysis as discussed in Section 3.4.
14
3.2.3 Calculation of Segment-wise Travel Times
For the calculation of travel time, the study routes were discretised into smaller segments
of length 200 meters each. These segments had xed end points which were maintained
throughout the analysis for calculating the historical travel times. For a particular route,
these segmental travel times were stored in a grid layout where each column represented a
segment and each row representing a trip. Thus, a column consisted of the travel times on a
particular segment for all the trips over four months and a row consisted of the travel times
on all the segments of the route for a particular trip.
3.3 Correlation Between Segments
Analysis was carried out to nd correlations between the segments, which check that the
choice of passed segment travel times as a similarity measure to nd historical trajectories,
is suitable for prediction of future segment travel times. If a correlation in terms of travel
times exists between segments, previous segments can be taken as related to later segments
along the route. Given a current trajectory and its similar historical trajectory in terms of
passed segments, their future travel times are also similar with a high probability. We use
Pearsons correlation as the tool to measure the correlation between segments. Pearson
Product Moment Correlation (Pearsons correlation for short) is widely used to measure
the linear association between two variables. The value of the Pearsons correlation coef-
cient always falls between -1 and 1. Positive values mean positive correlations and negative
values mean negative correlations. The farther the value from 0, the stronger is the corre-
lation. Given two variables X and Y, with means

X and

Y and the standard deviations
X
and
y
, correlation between them is computed as,
=
n
i=1
(X
i

X)(Y
i

Y )
(n 1)
X
Y
(3.2)
where n is the number of elements in X and Y . The farther two segments are from each
other, the weaker will be the inuence of one on the other.
Figure 3.1 can be used to detect the Pearsons correlation for any two segments as well
as its trend along with distance. Y-axis values are the Pearsons correlation coefcient
15
between two travel time arrays (historical travel times corresponding to two segments) and
X-axis shows the number of segments between them, which is termed as the Segment-
Distance. For example, given a Pearsons correlation between segment 20 and segment 25,
a corresponding point is drawn on the gure with the X-value being 5 (which is 25 minus
20). However, such a gure is not able to offer a clear illustration of the change of Pearsons
correlation because there are too many points for each X-axis value. To solve this problem,
we plotted Figure 3.2, which represent the average value of all points for each X-value.
As shown in Figure 3.1, the Pearsons correlation exists commonly between any ar-
bitrary segments. However, the correlation does not appears to be high for most pair of
segments. Specically, when two segments are near to each other, the Pearsons correlation
is remarkable and obviously higher than others. Therefore a segment is more related to
nearby segments than farther ones. Figure 3.2 indicates an apparent decline curve from 1
along the X-axis. Based on this, it can be concluded that segments closer to the one being
analysed is the most correlated one and can be used as input for prediction.
Figure 3.1: Pearsons correlation coefcients versus the segment distance.
3.4 Travel Time Patterns
The second goal of data analysis was to explore any pattern inside the data that could be
used for the prediction. Intuitively, travel times of a segment should not only be related to
that of near segments, but also to other segment specic or trafc related parameters. For
16
Figure 3.2: Average correlation coefcient versus the segment distance.
example, in a city area, the trafc conditions are usually the worst during the peak hours in
the morning and evening. Therefore, we can associate the travel times to a temporal feature.
Similarly, the travel times on the same segment may appear differently in weekdays and
weekends. In weekdays, the travel time may be higher than that on weekends.
The present study analyses two of those patterns, which are most common, namely day-
wise pattern and time of the day pattern. During peak hours in the morning and evening,
congestions happen with a high probability. Therefore the travel time of a segment may be
high during peak hours and low in off-peak hours. To visualize the travel time variation
within a day, within-day travel times are grouped into 14 time periods
2
of 1 hour each.
Figure 3.3 shows the variations in travel times along a day for two typical segments namely,
Segment 28 and Segment 100. For each of these segments, travel times are assigned into
the selected 14 bins in terms of the hour in which they happened. The Y-axis represents
the travel time in seconds. For each box plot, the thick line in the middle of the box is the
median. The upper edge and lower edge of the box are the 75
th
and 25
th
percentiles of the
data, respectively. Some data regarded as outliers are shown as bubbles (outside the upper
and lower fences
3
).
It can be seen that the travel times in the morning from 8 am to 10 am and in the evening
2
The usual working hours of the MTC buses.
3
The upper fence (end point of dotted line) is calculated as Median + 1.5(IQR) and the lower fence as
Median - 1.5(IQR), where IQR = The inter-quartile range, i.e., the difference between the 75
th
and 25
th
percentile values of the data.
17
(a) Variation of travel times on Segment 28 across the hours of the day
(b) Variation of travel times on Segment 100 across the hours of the day
Figure 3.3: Travel time analysis by hours of the day
from 5 pm to 7 pm are relatively higher than others. It can be expected that travel times
on a segment happening in peak hours are more similar to those in other peak hours, and
travel times in off-peak hours are also likely to be similar to each other. As a general
rule, travel times which occurred around the same time of the day are more similar to
each other. This forms the basis for the temporal neighbourhood scheme introduced in
Chapter 4. According to the scheme, historical trajectories which occurred within a xed
temporal neighbourhood (of 30 minutes or 1 hour) of the test trajectory are more reliable
for prediction that those outside the neighbourhood.
18
Figure 3.4: Comparison between weekday peak and weekday off-peak trips.
Figure 3.4 shows the space-time trajectories
4
for all the 2,212 trajectories. The blue
trajectories happened in the peak hours whereas the green ones happened in the off-peak
hours during the weekdays. Clearly, the peak hour trajectories have more variance than
those in off-peak hours.
Figure 3.5 shows a heat-map that represents the correlation matrix which was obtained
by binning all the historical travel times on Segment 28 into 14 bins (corresponding to the 14
working hours in a day) and calculating the Pearsons correlation coefcients among them.
The diagonal squares are all white (correlation = 1) since these represent the correlation of
one bin with itself. It is clear from the heat-map that the squares closer to the diagonal are
whiter than those away from the diagonal. This means that the historical travel times which
occurred temporally closer (within a radius of 1-2 hours) to each other are more correlated
to each other. This conclusion forms the basis of the temporal neighbourhood feature for
the selection of similar historical trajectories, as discussed in Section 4.7.
Travel times are not only related to the hour they happen, but also to the day on which
4
A plot between the cumulative distance and the cumulative time taken to cover that distance.
19
Figure 3.5: Correlations between travel times occurring in different hours of the day
20
Figure 3.6: Comparison between weekday and weekend trips.
they happen. To verify the correlations between travel times and the day, we classied the
days into 2 classes namely, weekday and weekend. As Figure 3.6 indicates, travel times
in weekdays have higher variance than those in weekends. Thus, the assumption of taking
weekday/weekend as a discriminative feature for trajectory selection is also valid. A similar
analyses across different days of the week is shown in Figure 3.7 and it can be seen that they
are not distinctly different from each other and hence they were not separately analysed.
From the above analyses, it can be concluded that several patterns exist in the travel
times of buses moving on the same route. In the present case, the weekday/weekend pat-
tern and the intra-day hourly pattern are the most signicant. Based on these patterns, two
schemes based on the temporal features of the trajectories are proposed in Chapter 4. From
the correlation analysis, it was concluded that the correlation between closely spaced seg-
ments is signicant. This forms the basis for the passed segments scheme proposed in the
next chapter.
21
Figure 3.7: Comparison between the weekdays.
22
CHAPTER 4
THE FRAMEWORK AND THE CLUSTERING
ALGORITHM
Through the data analysis presented earlier in Chapter 3, we observed the correlations be-
tween the segment travel times and the various trajectory features. Cluster analysis was
adopted for the identication of the most correlated trips and is discussed in this chapter
along with the other schemes based on temporal features. Using the identied trips as input,
a novel travel time prediction framework, called Historical Trajectory and Kalman Filter
based Arrival/Travel Time Prediction (HTKFTP), based on a large collection of historical
bus trajectories is developed, the details of which are discussed in the next chapter. This
chapter focuses on the historical trajectory selection part whereas the prediction algorithm
is discussed in details in Chapter 5. The section below denes the necessary terminology
that are used in this framework.
4.1 Terms and Denitions
Since the buses are travelling on xed routes, the geometrical routes in a two-dimensional
space can be represented in a one-dimensional space, where the position of each point on
the route is the distance from the start of the route. A route can be considered as consisting
of points on it and in a classical way, we choose a series of points to represent a route. In
our case, each point on the route is at a distance from the origin which is a multiple of
200 meters (along the route), so that the entire route is split into segments of 200 meters
length. This choice, of having smaller segments to represent the route, was made to closely
capture the pattern of segment-wise travel times along the route for a particular journey.
The various terms used in this study, are dened below.
1. A raw route R
raw
is represented as a sequence of points, p
0
, p
1
, ..., p
n
.
Each point, p
i
, stands for the starting point of the i
th
segment and its value denotes the
total distance along the route from the starting point of the route to the end of (i 1)
th
segment. Thus, p
i
< p
i+1
. n is the total number of points on the route including the origin
and destination.
2. A segment S
i
is a part of a route between two adjacent points p
i
and p
i+1
.
3. A route R is represented by a sequence of segments, S
0
, S
1
, ..., S
n1
.
The value of S
i
denotes p
i+1
- p
i
. Our goal is to predict the arrival times at the bus stops.
Each bus stop has a latitude and longitude which lies on the route.
4. A route is also represented by a sequence of bus stops, B
0
, B
1
, ..., B
m
.
The value of B
i
denotes S
0
+ ... + S
l1
+ d if the bus stop B
i
lies on the segment S
l
.
Here, d is the distance along the route from p
l
(end point of segment S
l1
) to the location
of bus stop B
i
. The trajectory data of a bus journey consists of a series of time-stamped
locations of the bus on the route.
5. A raw trajectory T
raw
is represented as a sequence p
0
, t
0
, ..., p
n
, t
n
.
p
i
R and t
i
denotes the travel time from p
0
to p
i
.
6. A trajectory T is a sequence t
0
, ..., t
N
.
t
i
denotes the travel time on S
i
and N ( = n 1) is the number of segments on the
route. During a complete trajectory, a bus generates travel times on all the segments on the
route. Therefore, given M historical trajectories, there are M travel times for each segment.
7. For a segment S
i
, there is a corresponding sequence of travel times, t
S
i
0
, ..., t
S
i
M1
.
t
S
i
j
is the travel time of a bus on this segment in the (j + 1)
th
trajectory and M is the
number of historical trajectories. This sequence is denoted as ST
i
for S
i
.
For a particular route, all the historical trajectory data can be stored in a table format in
which the columns represent the attributes of the trips such as start time from the origin,
date of trip and the travel times on each segment whereas the rows (or records) represent
the individual trajectories. As discussed later in this chapter, the historical travel times on
a particular segment are clustered into smaller groups so as to minimize the within-cluster
variance for each group.
8. Given a sequence of historical travel times ST
i
for a segment S
i
, it can be split into
a sequence of intervals (or clusters) SC
i
:= C
S
i
0
, ..., C
S
i
K1
.
24
K is the number of travel time clusters for S
i
. Note that, K is a random variable which
depends on the variance of the historical travel times on the segment. The process fromST
i
to SC
i
is explained later in this chapter.
4.2 Problem Formulation
Consider a bus route R := S
0
, ..., S
N1
with N segments. For a bus travelling on segment
S
i
, its current (incomplete) trajectory T
curr
can be represented as a sequence of travel times
on the passed segments, i.e. T
curr
:= t
curr
0
, ..., t
curr
i
(0 i N). Let d be the
distance of bus from p
i+1
(end point of S
i
) along the route. Suppose the bus stop B
j
at
which to predict the bus arrival time lies on S
l
(l > i) and d be the distance of B
j
from p
l
(start point of S
l
).
Given M historical trajectories and T
curr
, we aim to develop an effective framework to
predict the travel times

t
i
, ...,

t
l
on the segments S
i
, ..., S
l
. The arrival time of the
bus at B
i
is given by,
A
B
i
= T +
S
i
.

t
i
+

t
i+1
+... +

t
l1
+
d
S
l
.

t
l
(4.1)
where T is the current time and A
B
i
is the arrival time at B
i
.
4.3 Overview of the Framework
In this section, we rst provide an overview of the proposed HTKFTP system framework
and then discuss the details of the cluster analysis (in the passed segments scheme) car-
ried out for pattern identication. Figure 4.1 shows the system design of the HTKFTP
framework. As illustrated, the proposed HTKFTP system (i.e., a location based service)
continuously collects bus trajectory data from GPS-equipped buses which report the latest
bus status including time-stamped geographical coordinates of the bus and instant speed.
The HTKFTP server is responsible for receiving and storing the trajectory data, monitoring
the incomplete trajectories of on-going buses, and making prediction of bus travel time on
the routes in response to (i) passenger enquiries and (ii) real time updates of bus arrival
25
Figure 4.1: Overall architecture of the HTKFTP framework
times at bus stops. As shown in Figure 4.1, the HTKFTP server consists of three modules:
a) Real-time Bus Status Monitoring (RBSM) module; b) Travel Time Prediction (TTP)
module; and c) Nearest Neighbour Search (NNS) module.
The RBSM module is responsible for communicating with the buses to receive bus
status information and GPS data updates of the on-going trajectories. Once an update from
a bus b reaches the server, RBSM catches the status (such as current bus coordinate and
new time stamp) of b, extracts features associated with the developing trajectory T
b
, and
stores the information as part of T
b
in the historical trajectory repository.
The TTP module is responsible for predicting the arrival times of buses at bus stops,
which can be reduced to a problem of predicting the travel times of buses on their remain-
ing route segments. As mentioned, the TTP module can be invoked to make predictions by
(i) a passenger enquiry; or (ii) the real-time updates of bus arrival information at stops. The
former arrives on demand and the latter happens periodically. In this paper, for simplicity,
we focus on predicting the travel time of a bus, given its current location, on remaining seg-
ments of its journey on the bus route. Moreover, instead of constantly making predictions,
26
we assume that TTP is invoked every time when RBSM receives an update that the bus has
crossed a segment (including the GPS data of bus location) and passes the required input
parameters for prediction to TTP. Our idea behind the TTP module is to use a few best
matches for the ongoing trajectory as inputs to Kalman Filter, which, efciently predicts
the travel times on the future segments by employing a robust mechanism.
As illustrated in the gure, TTP relies on NNS module to search for similar trajectories
effectively and efciently. As there could be different ways to identify the sample set of
similar trajectories, different notions of similarity could be explored to ensure the effective-
ness of TTP. On the other hand, with a massive amount of historical data, it is infeasible to
make exhaustive comparison between the trajectory of current bus journey against all the
historical trajectories in the database. To ensure the search efciency, we create indices of
trajectories and related patterns in the NNS module to avoid retrieval of irrelevant trajec-
tories that are not helpful for our travel time estimation. In other words, we only fetch a
relatively small set of candidate trajectories and return them back to TTP.
The HTKFTP is introduced above as a general framework to support travel time predic-
tion. The remaining task is to devise similarity trajectory based prediction schemes which
rst invoke the NNS module to retrieve a sample set of trajectories for making effective
travel time estimation in the TTP module. Based on our data analysis, we observed the
travel time correlation between two segments and the travel time patterns corresponding to
some temporal features such as hours and days. Therefore, we follow these observations
to introduce two schemes based on passed segments and temporal features. As their names
suggest, these two schemes use the passed segments (PS) and temporal features (TF) of an
on-going bus journey, respectively, to identify similar trajectories for prediction.
4.4 Trajectory Search based on Passed Segments Scheme
In PS scheme, the prediction is done by nding the historical trajectories "similar" to the
current one in terms of the travel times on the segments already crossed by the moving bus.
Thus a similarity measuring algorithm has to be taken into consideration. As mentioned
before, the conventional algorithms measuring similarity of time series such as, Lp-norm,
Dynamic Time Warping (DTW) and Longest Common Subsequence (LCSS), are not appro-
27
priate in this project because those algorithms are highly sensitive to any error or outlier
in the data. As a result, a slight variation in the collected data might result in dramatical
mismatches between the current trajectory and historical trajectories. In addition, these
algorithms only evaluate the overall similarity of the whole trajectory, rather than the sim-
ilarity of trajectories with respect to each segment. To address the above problems, we
propose a new similarity measure that takes into account the similarity between two tra-
jectories on each segment. Given two trajectories, t
0
, ..., t
n
and t
0
, ..., t
n
, we
compare each pair of travel times t
i
and t
i
. If the difference between each pair is less
than a threshold specic to that segment, the two trajectories are considered "similar". This
method improves the conventional distance measure algorithms in that, for two "similar"
trajectories, not only the whole ones, but also the corresponding segments should be similar.
However, this method is limited by the low efciency that is caused by searching for
similar travel times based on each segment, especially when the number of historical tra-
jectories is large. To overcome this, we can allocate travel times into clusters and match the
current travel time to the cluster averages to nd the appropriate cluster.
To better illustrate this problem, we provide the following example. For a specic route,
given a number of historical trajectories on this route, we can create a table with attributes
(columns) corresponding to the travel times of each segment of the route and a record
corresponding to each historical trajectory (Table 4.1).
Table 4.1: An example of segment-wise travel times on historical trajectories.
Trajectory ID Segment0 Segment1 Segment2 Segment3
Trajectory1 20 155 63 29
. . . . .
. . . . .
. . . . .
TrajectoryM t0 t1 t2 t3
By using the clustering algorithm that will be discussed next, we partition each column
into several non-overlapping ranges. Each range contains at least one value and each value
only falls in one range. This Table 4.1 can be transferred into a table as shown in Table 4.2,
where the number of ranges for each segment is not necessarily equal.
28
Table 4.2: An example of partitioned segment-wise travel times after application of cluster-
ing algorithm.
Segment0 Segment1 Segment2 Segment3
15;20 75;89;90 55 21
32 155;180 61;63 26;29
68 262 69;73 33;36;37
82;93 - 77 -
4.5 The Clustering Algorithm
In this section we consider an algorithm used to partition each sequence of travel time
values, ST
i
, into a sequence of clusters SC
i
. Since, ST
i
is a sequence of numerical values
and can be represented in a one-dimensional space. Splitting such a sequence is actually to
allocate a set of one-dimensional data into clusters. Lee et al. (2012) compared two robust
clustering algorithms namely, K-means and V-clustering. As they pointed out in their work,
the K-means algorithm has two limitations. Firstly, the initial cluster centroids are chosen
randomly and different choices may cause different clustering results. Another issue is how
to determine the value of K. With no common direction on this problem, it is hard to offer
a perfect value of K. They also found V-clustering to be performing better (in the passed
segment scheme) than K-means with the help of experiments with real-world data. So, in
this study, we concentrate only on the V-clustering algorithm.
This V-clustering algorithm was introduced by Yuan et al. (2010) to allocate a sorted
list of one-dimensional data into clusters. In this algorithm, a list of values is rst sorted.
Then it is split into clusters in an iterative manner. At each iteration, the list is split into two
parts and the weighted average variance (WAV) is calculated for the resulting child lists.
An optimum split is found out that minimizes the WAV of the resulting child lists. The
WAV for a split at the i
th
element of the list is dened in Equation 4.2.
wav
i
=
L
1
i
L
V ar(L
1
i
) +
L
2
i
L
V ar(L
2
i
) (4.2)
where |L
i
1
| and |L
i
2
| are the cardinalities of the resulting child lists for the i
th
split and
V ar(L
i
1
) and V ar(L
i
2
) are their respective variances. The list is recursively partitioned so
that the running time of the clustering algorithm for a segment with M historical travel
times becomes O(log M). Hence, the running time for the entire trajectory database is
29
O(N log M) (which is fast), where N is the total number of segments on the route. The
iteration is stopped when each cluster is left with a minimum number of travel times (or
minimum number of trajectories, MNT) which is a tunable parameter (i.e. its value is
decided to strike a balance between minimizing the errors in prediction and maximizing the
computational speed). Each cluster for a segment is associated with a cluster average, i.e.,
the average of all the travel times in it. The selection of the values of various parameters of
the clustering algorithm is made after the experiments with the real-world data as discussed
in Chapter 6.
4.6 Nearest Neighbour Search in Passed Segments Scheme
Given a current trajectory t
0
, t
1
, t
2
, with the passed segments S
0
, S
1
and S
2
, let
t
2
falls in a certain cluster for S
2
and we take it as the match to t
2
. All trajectories
whose travel times on S
2
fall in the matching cluster are marked. Matching is usually done
by nding the cluster for the particular segment, whose cluster average is closest to t
2
(which is the current trajectorys actual travel time on S
2
). The same operation is applied
to S
1
and S
0
and then we can nd trajectories whose travel times of the three past segment
fall in all the matched clusters. This method, known as Segment ltering, was introduced
by Lee et al. (2012). Since for each of the three passed segments, the historical trajectories
travel times are similar to the current trajectory, they can be considered as "similar" to the
current trajectory and can be used for prediction. However, the Segment ltering method
has a limitation when the number of historical trajectories is small (i.e. < 5,000 as in our
case compared to 24,000 in case of Lee et al. (2012)) and when the segments are smaller
in length. In such a case, the sets of similar trajectories found for S
0
, S
1
and S
2
may not
overlap at all (or have negligible intersection). To overcome this issue, we rst aggregate
the travel times for a xed number of consecutive segments (which is a tunable parameter)
and then apply the clustering algorithm. Then, we use the set of similar trajectories found
for S
2
as the nal set of similar trajectories, without going for segment ltering.
For example, if the initial idea was to use three passed segments in segment ltering,
we aggregate the travel times for every triple of consecutive segments and then cluster the
aggregated travel times for each such triple. As the bus moves from one segment to the
next, this window of three passed segments is maintained and the clusters for the triple is
30
searched to nd the match.
4.7 Similarity based on Temporal Features
Besides passed segment (PS) scheme, we also propose a scheme that uses features inside
the historical data that are directly related to the travel time. Using similar trajectories found
from the PS scheme, we can provide satisfactory predictions. However, this method cannot
guarantee accuracy under all the circumstances. For example, when unusual events happen
on a future segment, it is hard to make a reliable prediction from historical data because the
events might never have happened in the history (limited by the amount of historical data
collected). Fortunately, resorting to features related to trafc information on the current
segment, we can make predictions of travel times by rst selecting trajectories by matching
the temporal features and then using the PS scheme. For example, the time when a bus
enters a segment is important because the trafc changes along with time of the day. It is
common that during peak hours in the morning and evening, congestions happen with a
high probability. Also, intuitively, the segment-wise travel times on two trajectories close
together temporally may bear a high correlation with each other. As veried by Figure 3.6
in Chapter 3, the weekday and weekend trips have different variances in their space-time
trajectories. Hence, in the nal hybrid scheme, in order to make predictions for an ongoing
trip, the day on which it is occurring is rst used to select weekday or weekend trajectories.
On this set of trajectories the TF (temporal neighbourhood) and PS schemes are applied in
sequence to nd the nal set of similar trajectories.
4.8 Summary
The nal rened set of similar trajectories that results from the application of the hybrid
scheme introduced above, is used for prediction of travel times on the upcoming segments
of the ongoing trajectory. Experiments to support our claim that the hybrid scheme is
more effective in prediction than the individual ones, were carried out with real-world data
as discussed in Chapter 6 on performance evaluation. The prediction algorithm based on
Kalman Filter and the modications made to the base Kalman algorithm are explained in
details in the next chapter.
31
CHAPTER 5
THE PREDICTION ALGORITHM
In Chapter 4, we explored the various schemes to search for similar historical trajectories
that are effective for travel time prediction (TTP). The problem now, is to use the travel
times of the identied similar trajectories to predict for the current trajectory. The simplest
way to predict the travel time on an upcoming segment of the current trip is to use the mean
(or median) of all the travel times from the identied similar trajectories on the correspond-
ing upcoming segment. Another way is to give weights to the individual trajectories before
calculating the mean. The weight given to a similar trajectory can be the inverse square of
its Euclidean distance
1
from the ongoing trajectory, as explained in Larose (2005). How-
ever, in this study, we focus on a robust, short-term prediction technique based on Kalman
Filter (KF) which can take into account the associated variability to a certain extent. Before
moving on, we review some previous work involving Kalman Filter for travel time predic-
tion including those which were attempted in heterogeneous trafc conditions prevalent in
India.
5.1 Travel Time Prediction using Kalman Filter
The rst introduction of Kalman Filter dates back to 1960, when Kalman (1960) published
his famous paper describing a recursive solution to the discrete-data linear ltering problem.
The Kalman lter is a set of mathematical equations that provides an efcient computational
(recursive) means to estimate the state of a process, in a way that minimizes the mean of
the squared error. As mentioned in Welch and Bishop (2006), the lter is very powerful in
several aspects: it supports estimations of past, present, and even future states, and it can
do so even when the precise nature of the modelled system is unknown.
In the literature of travel time prediction, Chein and Kuchipudi (2002), Liu et al. (2006),
Nanthawichit et al. (2003), Chen and Chein (2001) and Yang (2005) are some of the earli-
1
Square root of the sum of squares of differences between the corresponding segments of the historical
and the current trajectory (till the segments crossed in the current trip)
est to introduce KF. Nanthawichit et al. (2003) and Yang (2005) explored the possibility of
using GPS probe vehicle data into KF for travel time prediction. Vanajakshi et al. (2009)
is one of the earliest attempts that used KF with GPS probe vehicle data for short-term
travel time prediction under heterogeneous trafc conditions such as those prevalent in In-
dia. From their travel time variation plots (across the route), the authors concluded that
the travel time patterns along the route were more related for consecutive vehicles (with a
headway 15 min) on the same day. Weekly and daily patterns were not as signicant as
the above one. Hence, they used the travel times of the previous two vehicles for predicting
the travel time of the test vehicle (the ongoing trip). However, when the headways between
the consecutive vehicles are more (1 hour), the accuracy of the approach decreases (This
is a serious issue when the previous vehicles passed during an off-peak hour and the test
vehicles passes in a peak hour, or vice versa.). Since, inputs to KF in this method are xed
most of the times, there is no means to rectify the accuracy once one of the above mentioned
issues creep in. Hence, there was a need to modify the method such that it uses dynamic in-
puts for prediction in order to address the prevalent trafc condition at the moment. This is
where the similar trajectory search as discussed in Chapter 4, can be helpful. Based on the
latest actual travel times of the test vehicle, the trajectory search algorithm nds all the his-
torical trajectories which occurred under the same trafc conditions as the current one. In
Chapter 6, we prove that the dynamic input method outperforms the static input method by
using real-world data. In the following section, we discuss the base KF algorithm as men-
tioned in Vanajakshi et al. (2009). In the subsequent sections, we discuss the changes made
to both the base KF algorithm and the trajectory search algorithm to effectively integrate
them for travel time prediction.
5.2 The Base KF Algorithm
It is assumed that the evolution of travel time between the various segments is governed by,
t
i+1
= a
i
t
i
+w
i
(5.1)
where t
i
is the travel time taken for covering S
i
(the i
th
subsection), a
i
a parameter that
relates the travel time taken in S
i
to the travel time taken in S
i+1
and w
i
the process distur-
33
bance associated with S
i
. The measurement process was assumed to be governed by,
z
i
= t
i
+v
i
(5.2)
where z
i
is the measured time of travel in S
i
and v
i
the measurement noise. It was further
assumed that w
i
and v
i
are zero mean white Gaussian noise signals with Q
i
and R
i
being
their corresponding variances.
The prediction algorithm requires as input, at least two trajectories in the form of
segment-wise travel times. Trajectory which is more similar to the current one is called
base trajectory (denoted by T
base
) and the other one is called correction trajectory (denoted
by T
corr
). The data obtained from T
base
was used to obtain the value of a
i
for each subsec-
tion. The data from T
corr
were used in the prediction algorithm to obtain the estimate of
travel time of the test (or the ongoing) trajectory (denoted by T
test
). Following are the steps
involved in the algorithm:
1. The travel time data from T
base
was used to obtain the value of a
i
through a
i
=
t
T
base
i+1
/t
T
base
i
, i = 1, ..., (N 1), where t
T
base
i
is the travel time taken in T
base
to
cover S
i
.
2. The discretisation is carried out over space rather than over time (as is done in tradi-
tional applications of the KF). Let t
T
test
i
denote the travel time taken by in T
test
to
cover S
i
. It is assumed that E[t
T
test
1
] =

t
1
, and E[(t
T
test
1

t
1
)
2
] = P
1
, where
t
1
is the estimate of the travel time in T
test
on S
i
.
3. For i = 2, ..., (N 1), the following steps are performed:
(a) The a priori estimate of the travel time is calculated using

t
i+1
= a
i
t
+
i
,
where the superscript - denotes the a priori estimate and the superscript +
denotes the a posteriori estimate.
(b) The a priori error variance (denoted by P
) was calculated using P
i+1
= a
i
P
+
i
a
i
+
Q
i
:
(c) The Kalman gain (denoted by K) was calculated using K
i+1
=
P
i+1
P
i+1
+R
i+1
:
(d) The a posteriori travel time estimate and error variance were calculated using,
34
respectively,

t
+
i+1
=

t
i+1
+K
i+1
[z
i+1

t
i+1
] and P
+
i+1
= [I K
i+1
]P
i+1
,
where the data measured from T
corr
was used for providing the values of z
i+1
in the equation to calculate

t
+
i+1
.
Thus, the objective here is to predict the travel times of T
test
using the travel time data
obtained from T
base
and T
corr
. When the T
test
is in S
i
, its travel time for S
i+1
, which is de-
noted by t
i+1
, is predicted. The KF algorithm works like a predictor-corrector algorithm.
The a posteriori estimate of t
i
of the T
test
is used to obtain the a priori estimate of t
i+1
(this being the prediction step) and then the measurement of the travel time, T
corr
in S
i+1
(which is denoted by z
i+1
in the equations in step 4d above) is used to obtain the a posteriori
estimate of t
i+1
of T
test
(this being the correction step). In the following section, we dis-
cuss the modications made to the trajectory search algorithm and the base KF algorithm
in order to integrate them and to tackle a few issues concerned with the variance of travel
times.
5.3 Integration of Trajectory Search and Prediction algo-
rithms
As we discussed in the previous section, the KF based algorithm needs only two best
matched trajectories for travel time prediction. Based on the actual travel times received
in real-time from the test vehicle, the trajectory search algorithm nds similar historical
trajectories with travel time patterns matching that of the current one. The task now, is to
rank the matched trajectories based on some metric and send the top two to the prediction
algorithm. To accomplish this, for each matched trajectory, its Euclidean distance from the
test trajectory is found out using the equation,
ED =
(t
T
test
1
t
T
hist
1
)
2
+ (t
T
test
2
t
T
hist
2
)
2
+... + (t
T
test
m
t
T
hist
m
)
2
(5.3)
where ED is the Euclidean distance between the test trajectory and a matched historical
trajectory, t
T
test
i
is the travel time on S
i
for the test vehicle, t
T
hist
i
is the travel time on
S
i
for a matched historical trip and m the number of segments crossed by the test vehicle
when the request is made. The above Euclidean distance gives the measure of similarity
35
Figure 5.1: Variation of travel time variance across the segments of 19B route
between two trajectories with respect to their individual segment travel times. The matched
trajectories are now ranked according to the increasing values of their EDs from the test
trajectory. The top two are sent to the prediction algorithm. As the test vehicle moves from
one segment to the next one, with the newly available actual travel time of test vehicle, the
trajectory search algorithm again nds the best matches from history, ranks them and sends
the top two to the prediction algorithm, which updates the previous predictions with more
accurate ones, thus making the process dynamic in nature.
5.4 Modications
High variances in travel times during certain periods of the day and on certain segments,
leading to higher prediction errors on selected trips or segments was the main issue faced
by the existing algorithm. As can be seen in the box plots in Figure 3.3 (Chapter 3), during
the peak hours, besides the median travel time (thick line inside the box), the variance of
travel times also increases (indicated by increased height of the box). Figure 5.1 below,
shows that the variance of travel times is also high in certain segments on the route. Each
line in the plot is obtained by calculating the variance of travel times at each segment for
the trips occurred in a two hour band in the history.
To address the high variance (to some extent), the Q
i
and R
i
values which represent
36
the variances of the process disturbance and the measurement noise in the KF algorithm,
are updated using the actual travel times of the test vehicle. This modied KF algorithm,
known as Adaptive Kalman Filter, was introduced in Tripathi (2013). Since the method
used the actual travel times of test vehicle, it is more effective towards the end of the trip
when more actual travel times from the test vehicle are available. Irrespective of the high
variances in travel times, this method learns from the errors it makes in the past segments
and updates the future predictions to rectify further errors. According to Tripathi (2013),
Q
i
is given as follows,
Q
i
=
1
i
i
j=1
(w
j
w
i
)
2
(5.4)
where w
i
=
1
i
i
j=1
w
j
and each w
j
= t
T
test
j+1
a
j
t
T
test
j
. Similarly, R
i
is calculated as,
R
i
=
1
i
i
j=1
(v
j
v
i
)
2
(5.5)
where v
i
=
1
i
i
j=1
v
j
and each v
j
= z
i
t
T
test
j
, where z
i
= t
T
corr
j
.
Another issue with the trajectory search algorithm is when the travel times on certain
segments of a peak hour trip is low (similar to those in off-peak hour trips). In such a case,
the algorithm searches for historical trajectories with lower travel times on the correspond-
ing segments and most of the returned trajectories are from off-peak hours, thus reduc-
ing the accuracy of the prediction algorithm. To address this issue, the temporal features
scheme included in the trajectory search algorithm is useful to some extent. According
to the scheme, only those historical trajectories, which occurred within a xed temporal
neighbourhood (30 min or 1 hour) of the test trajectory are searched. This makes sure that
most of the matched trajectories are from the same trafc conditions.
To corroborate the various modications and schemes proposed above, experiments
were performed with actual GPS data collected from the MTC buses in Chennai. In the
next chapter, comparisons of errors in predictions, made with and without the application
of the modications, are shown to prove the effectiveness of the modications. We also
discuss the process carried out to nd the optimumvalues of the parameters of the clustering
algorithm using real data.
37
CHAPTER 6
PERFORMANCE EVALUATION
In the performance evaluation stage, the performance of the proposed schemes were eval-
uated and the various parameters of the algorithms were optimized. For the purpose of
evaluations and comparisons, all the trips made on Tuesday, March 04, 2014 (a typical
working day) were taken as the test trips. For the prediction of arrival times in case of each
test trip, all the historical trips that happened between January 01, 2014 and the test trip,
were considered for trajectory search. Arrival time predictions at the Saidapet Bus Depot
(the last stop) were considered for the evaluation. In the next section, the various measures
of performance considered in this study are dened.
6.1 Measures of Performance
For a particular bus in an ongoing trip, approaching a particular bus stop (target bus stop,
denoted by B
tar
), several measures were considered to evaluate the accuracy of the pre-
dicted arrival times at the bus stop. As the bus moves from the origin bus stop B
0
towards
B
tar
(for which evaluation is conducted), a series of predicted arrival times

A
B
tar
i
(found
by using Equation (4.1)) are quoted for the target bus stop as and when the bus completely
moves past a segment (200 m). These quoted arrival times are logged along with the corre-
sponding time-stamp and the location of the bus at that time. When the bus reaches B
tar
,
the actual arrival time is noted and the errors in the series of quoted arrival times are calcu-
lated in seconds. Following are the important measures of performance used in this study
for evaluation:
Mean Absolute Error
It is the mean of the absolute values of errors in arrival time prediction at B
tar
. It is denoted
by or simply by MAE. Lower the MAE value, more accurate are the predicted travel
times and hence, better the method. MAE is calculated as follows:
MAE =
1
n
n
i=1
A
B
tar
i
A
B
tar
, (6.1)
where

A
B
tar
i
is the i
th
predicted arrival time at B
tar
and A
B
tar
is the actual arrival time
(calculated after the bus arrives at B
tar
). n is the total number of arrival time predictions
made.
Mean Absolute Percentage Error
It is the mean absolute value of the percentage errors in arrival time prediction at B
tar
. It is
denoted by MAPE. Lower the MAPE value, more accurate are the predicted travel times
and hence, better the method. MAPE is calculated as follows:
MAPE =
1
n
n
i=1
A
B
tar
i
A
B
tar
A
B
tar
. 100, (6.2)
where the symbols used, bear their usual meaning. The following section discusses the
method followed to nd the optimum values of the parameters involved in the passed seg-
ment scheme (used in the passed segment scheme and introduced in Section 4.5).
6.2 Parameter Optimization in Passed Segment Scheme
Different parameters associated with the pattern analysis and prediction were studied rst
to identify the optimum value that can be used in the implementation. This was carried out
ofine, using historic GPS data. Predictions were made for different values of a parameter
keeping the other parameters constant. The optimum value of a parameter was assumed
to be the one which gives the lowest MAPE. Since, there were several test trips on the
test date, each associated with a MAPE value, the MAPE value plotted versus the corre-
sponding value of a parameter in the subsequent gures is the average MAPE for all the
test trips on that date.
39
6.2.1 Spatial lag
The spatial lag is the number of passed segments to consider in the passed segment scheme
as explained in Section 4.4. Prediction of arrival times were made using an MNT
1
of 20
and with different values of spatial lag ranging from 1 to 10. In Figure 6.1a, it can be seen
that the MAPE decreases till a spatial lag of 7 and then the effect of change is small. Hence,
the optimum value was chosen to be 7.
(a) Variation of arrival time prediction MAPE
with spatial lag.
(b) Variation of arrival time prediction MAPE
with MNT.
Figure 6.1: Optimum values of parameters involved in the clustering algorithm
6.2.2 Minimum Number of Trajectories (MNT) in a Cluster
Predictions were made using a spatial lag of 7 and with different values of MNT ranging
from 5 to 40 in steps of 5. As can be seen in Figure 6.1b, the MAPE decreases till a MNT
of 20 and then increases beyond. Hence, the optimum value was chosen to be 20.
Traditional clustering approaches such as those in Lee et al. (2012), optimize another
parameter called v-thresh, which is the threshold for the maximum variance allowed for a
cluster resulting from the clustering algorithm. However, since the length of the segments
considered in this study (200 m) is much less than those considered in Lee et al. (2012)
(where each segment was the stretch of the road between two bus stops), the travel time
1
Minimum Number of Trajectories allowed in a cluster (i.e. the minimum cluster size). This is also a
tunable parameter.
40
variances are also lower (for most segments) in this study. Hence, the only governing
parameter in the clustering algorithm is the MNT value.
The subsequent sections compare the measures of accuracy (particularly the mean abso-
lute error or MAE in arrival time predictions at Saidapet) with and without the application
of the various schemes and modications discussed in Chapter 4 and 5. These comparisons
help to visualize the improvement in accuracy by each scheme/modication.
6.3 Evaluation of the PS scheme
In order to prove that the PS scheme (involving clustering of historical trajectories) is effec-
tive, it was compared with the simple averaging method in which the predicted travel times
on the future segments were found by averaging the travel times on the corresponding seg-
ments from the historical trajectories. Please note that, no temporal features were applied
yet. Only the PS scheme was introduced to the naive approach and the improvement was
measured. After receiving the similar trips from the trajectory search algorithm, the predic-
tion was still done by averaging those similar trips (KF algorithm was not introduced yet).
The parameters of the PS scheme were those found out in Section 6.2.
As can be seen in Figure 6.2, the MAE for most of the test trips were improved. How-
ever, for the peak hour trips, the MAE values were still high for both the methods.
6.4 Evaluation of the Weekday/Weekend Temporal Fea-
ture
To measure the improvement caused by the weekday/weekend feature, it was added to the
PS scheme. That is, depending on the day on which the test trip occurs, either the weekday
or weekend historical trajectories would be clustered and the prediction would be made by
averaging the trajectories returned by the combination of the schemes. Since, all the test
trips were on a weekday, all the weekday trajectories would be clustered. As can be seen in
Figure 6.3, the MAE for some peak hour test trips had reduced considerably (by about 200
seconds). Hence, the added feature was indeed effective in improving the results. However,
41
Figure 6.2: Comparison of MAE for individual test trips before and after adding the PS
scheme to the naive method.
the MAE for the peak trips are still high around 600 seconds (10 minutes).
6.5 Evaluation of the Temporal Neighbourhood Feature
The temporal neighbourhood feature was added on top of the weekend/weekday feature
before the trajectories were sent to the clustering algorithm. Since, all the test trips were
made on weekday, only those trajectories which happened within a xed temporal neigh-
bourhood (2400 seconds in this case) of the test trip in the weekdays, would be clustered.
From Figure 6.4, it is clear that the effect of including this feature is drastic. MAE values
for almost all the test trips are within 300 seconds (5 minutes).
6.6 Evaluation of the base KF Algorithm for Prediction
Till now the prediction was being done by averaging the travel times of similar trajectories
returned by the combination of PS and temporal feature schemes. At this stage, the base
KF algorithm was introduced to the status quo method. The predictions were made by
42
Figure 6.3: Comparison of MAE for individual test trips before and after adding the week-
day/weekend feature to the PS scheme.
Figure 6.4: Comparison of MAE for individual test trips before and after adding the tem-
poral neighbourhood feature.
43
Figure 6.5: Comparison of MAE for individual test trips before and after using the base KF
algorithm for prediction.
the integrated method, as discussed in Section 5.3. It is clear from Figure 6.5 that, the
MAE values have improved for some off-peak trips, they have increased for the peak-trips.
Hence, the overall improvement is not that signicant.
6.7 Evaluation of the Adaptive KF Algorithm
It was observed in the previous section that the overall improvement brought by the base
KF algorithm was insignicant. In this section, the effect of modications to the base
KF algorithm for prediction are visualized. To the status quo method for prediction, the
modications suggested in Section 5.4, were applied. As is clear from Figure 6.6, the MAE
have reduced for most of the trips including the peak trips, thus proving the effectiveness
of the modication.
In the next section,the improvements brought by the individual modications are sum-
marized.
44
Figure 6.6: Comparison of MAE for individual test trips before and after using the Adaptive
KF algorithm.
6.8 Evaluation Summary
To summarize the improvements brought by the introduction of the individual modications
step by step, the mean of the MAE values (for all the test trips) versus the stage when a
particular modication was introduced were plotted. As can be seen in Figure 6.7, there
is a gradual decline in the mean MAE values along with the evolution of the method. The
method, evolved as a result of these gradual additions, is the nal method for the HTKFTP
framework. In Figure 6.8, the KF algorithm using static inputs (two previous vehicles) and
the nal HTKFTP method are compared. Clearly, the new HTKFTP method outperforms
the KF with static inputs, for all the test trips.
The next chapter concludes this study, summarizing the entire process of development
of the HTKFTP framework. The possible enhancements which will improve the scalabil-
ity and performance of the system are also listed. An application which can utilize the
predicted arrival/travel times by the HTKFTP framework is also introduced.
45
Figure 6.7: Improvement of the mean MAE (over all the test trips) throughout the evolution
of the method.
Figure 6.8: Comparison between HTKFTP and the prediction method using static inputs in
KF.
46
CHAPTER 7
SUMMARY AND CONCLUSIONS
7.1 Summary
The present study was an attempt for developing a reliable framework for real-time bus ar-
rival prediction under heterogeneous trafc conditions prevalent in India. The heterogeneity
and lack of lane discipline makes the Indian trafc different from (and more challenging
than) western trafc and hence most of the existing solutions may not work under the sce-
nario under consideration.
Several phases were involved in the development of the framework. First, travel time
pattern analysis was carried out using GPS data, collected for four months, from the north-
bound MTC buses running on the 19B route. Based on the analysis, the trajectory search
schemes using temporal features of the trajectories were proposed and tested. The high cor-
relation between spatially close segments, formed the basis of the passed segment scheme
which implements a robust clustering algorithm to search for similar trips using passed seg-
ment travel times. In order to address high errors in certain trafc conditions prevalent in
certain locations or time of the day, the base KF algorithm, used for prediction was modi-
ed into adaptive KF algorithm, which updates the predictions based on the errors made in
the past.
All the proposed schemes/modications were corroborated by experiments, conducted
using the GPS data form the MTC buses running on the 19B route. The improvements
brought by each modication were measured separately. The nal prediction method,
evolved as a result of the present study, was compared with the traditional prediction method
that used static inputs, and the improvements were demonstrated.
7.2 Conclusions
Accurate information regarding bus arrival/travel time at the bus stops will help in reducing
the uncertainty and waiting time associated with public transit system. This may help in
attracting more passengers to public transportation which in turn may help in reducing con-
gestion in the road network. Also, such information will be useful for the transit authorities
for planning the routing and scheduling activities. Thus, information regarding travel time
or arrival time of buses is very valuable in todays society and has attracted researchers
towards travel time prediction in the recent years.
The promising results show that the HTKFTP framework proposed in this study can
be used to implement real-time APTS applications on a large scale. This study sets up the
stage for the development of a Dynamic Transit Trip Planner that can utilize the predicted
travel times on a number of routes to nd the quickest path between two locations in the
city. The media for the conveyance of these information to the travellers can range from
variable message sign (VMS) boards and kiosks to websites and mobile applications.
7.3 Scope for Further Research
Though the overall results have improved as compared to the traditional prediction method
using static inputs, considerable errors still exist for certain time periods and certain loca-
tions/stretches where the travel times have high variance. This can be addressed discovering
prediction methods that take care of high variance in the training data. An ideal prediction
algorithm should also require less historical training data to produce satisfactory results.
Although the system is able to provide real-time updates for the buses on a few routes sat-
isfactorily (with negligible latency), the response time (for the queries made by users or
scheduled updates at the bus station) may get affected if the number of buses to handle
is increased (due to increase in running times of the clustering and the trajectory search
algorithms). An ideal system should be able to serve thousands of buses running simul-
taneously in a city, with limited resources (such as, server memory, network bandwidth,
database capacity, etc.). As future enhancements, a lot of research can be done in these
lines, to improve the performance and scalability of the system for a city-wide implemen-
tation.
48
APPENDIX A
PYTHON CODE LISTING FOR CLUSTERING
ALGORITHM
A.1 Method for creating clusters from similar trips
def c r e a t e _ c l u s t e r s ( t r a i n i n g _ t r i p s , s p a t i a l _ l a g , mnt ) :
# t o t a l no of s egment s
num_segs = l e n ( t r a i n i n g _ t r i p s [ 0 ] [ 2 ] )
c l u s t e r s _ f o r _ t e s t _ t r i p = {}
# form c l u s t e r s f o r each s eg one by one ,
# i f s p a t i a l _ l a g i s 7 , i t wi l l s t a r t from
# i ndex 6 i . e . 7 t h va l ue
f o r s eg i n r ange ( s p a t i a l _ l a g 1, num_segs ) :
# keys : t t s and va l ue : c or r e s pondi ng t r i p _ i d s
s e g _ t t _ h a s h = {}
f o r t r i p i n t r a i n i n g _ t r i p s :
# keys as t he TT s
s e g _ t t _ h a s h [ sum( t r i p [ 2 ] [ seg ( s p a t i a l _ l a g 1)
: s eg ] ) ] = t r i p [ 0 ]
s or t e d_ha s h_ke ys = s o r t e d ( s e g _ t t _ h a s h . keys ( ) )
# l i s t t o s t o r e t he e l e me nt s a t
# whi ch t o s p l i t t he e n t i r e p a r e n t l i s t .
s p l i t _ e l e me n t s = [ ]
f i n d _ s p l i t s ( s or t e d_ha s h_ke ys , s p l i t _ e l e me n t s , mnt )
s p l i t _ i n d i c e s = s o r t e d ( [ s or t e d_ha s h_ke ys . i ndex ( t t )
f o r t t i n s p l i t _ e l e me n t s ] )
s e g _ c l u s t e r s = [ ]
c l u s t e r s _ f o r _ t h i s _ s e g = {}
# each i t e r a t i o n i s a c l u s t e r
f o r i i n r ange ( l e n ( s p l i t _ i n d i c e s ) ) :
i f i == 0:
keys = s or t e d_ha s h_ke ys [ : s p l i t _ i n d i c e s [ i ] ]
i f l e n ( keys ) < mnt :
c on t i n u e
e l i f i == l e n ( s p l i t _ i n d i c e s ) 1:
keys = s or t e d_ha s h_ke ys [ s p l i t _ i n d i c e s [ i ] : ]
c on t i n u e
e l s e :
keys = s or t e d_ha s h_ke ys [ s p l i t _ i n d i c e s [ i 1]
: s p l i t _ i n d i c e s [ i ] ]
c on t i n u e
c l u s t e r _ h a s h = {}
f o r key i n keys :
c l u s t e r _ h a s h [ key ] = s e g _ t t _ h a s h [ key ]
s e g _ c l u s t e r s . append ( c l u s t e r _ h a s h )
c l u s t _ t t _ l i s t = s o r t e d ( c l u s t e r _ h a s h . keys ( ) ) \
c l u s t _ t r i p _ i d _ l i s t = [ c l u s t e r _ h a s h [ key ]
f o r key i n s o r t e d ( c l u s t e r _ h a s h . keys ( ) ) ]
r o u t e _ i d = 1101
s e g_i d = seg + 1
a v g _ t t = r ound ( numpy . mean ( c l u s t _ t t _ l i s t ) , 2 )
c l u s t _ i n t _ t t _ l i s t ] ) [ 0 ] [ 0 ] )
t t _ v a r = r ound ( numpy . var ( c l u s t _ t t _ l i s t ) , 2 )
c l u s t e r s _ f o r _ t h i s _ s e g [ a v g _ t t ] = { a vg_t t : a vg_t t ,
t t _ l i s t : c l u s t _ t t _ l i s t ,
t r i p _ i d _ l i s t : c l u s t _ t r i p _ i d _ l i s t ,
t t _ v a r : t t _ v a r }
50
A.2 Auxiliary method for nding optimum splits in the
clustering algorithm
def f i n d _ s p l i t s ( t t _ l i s t , s p l i t _ e l e me n t s , mnt ) :
" " " Fi nd s p l i t t i n g e l e me nt s of t he TT l i s t
r e c u r s i v e l y u n t i l min no . of t r i p s i n t he
l i s t i s l e s s t ha n mnt " " "
i f l e n ( t t _ l i s t ) < 2 mnt :
r e t u r n s p l i t _ e l e me n t s
e l s e :
l i s t _ v a r = numpy . var ( t t _ l i s t )
l i s t _ l e n = l e n ( t t _ l i s t )
min_wav = l i s t _ v a r
mi n_wav_i d = 0
# a t l e a s t 2 el ems r eq . f o r var c a l c .
f o r i i n r ange ( 1 , l e n ( t t _ l i s t ) 1) :
wav = ( ( l e n ( t t _ l i s t [ : i ] ) numpy . var ( t t _ l i s t [ : i ] ) )
+ ( l e n ( t t _ l i s t [ i : ] ) numpy . var ( t t _ l i s t [ i : ] ) ) )
/ l i s t _ l e n
i f wav < min_wav :
min_wav = wav
mi n_wav_i d = i
i f ( t t _ l i s t [ mi n_wav_i d ] not i n s p l i t _ e l e me n t s )
and ( l e n ( t t _ l i s t [ : mi n_wav_i d ] ) > mnt )
and ( l e n ( t t _ l i s t [ mi n_wav_i d : ] ) > mnt ) :
s p l i t _ e l e me n t s . append ( t t _ l i s t [ mi n_wav_i d ] )
f i n d _ s p l i t s ( t t _ l i s t [ : mi n_wav_i d ] , s p l i t _ e l e me n t s , mnt )
f i n d _ s p l i t s ( t t _ l i s t [ mi n_wav_i d : ] , s p l i t _ e l e me n t s , mnt )
A.3 Method for nding nearest neighbours from clusters
def g e t _ n e a r e s t _ n e i g h b o u r s ( c l u s t e r _ h a s h , c r o s s e d _ s e g _ t t ,
num_near es t _nei ghbour s , mi n_num_segs ) :
51
#76 va l ue s i f bus i s on seg77
num_s egs _cr os s ed = l e n ( c r o s s e d _ s e g _ t t )
l a s t _ s e g _ t t = c r o s s e d _ s e g _ t t [ 1] # t t on seg76
l a s t _ s e g _ i d = num_s egs _cr os s ed # = 76
c a n d _ c l u s t e r = c l u s t e r _ h a s h [ num_s egs _cr os s ed ]
[ c l u s t e r _ h a s h [ num_s egs _cr os s ed ] . keys ( )
[ min ( r ange ( l e n ( c l u s t e r _ h a s h [ num_s egs _cr os s ed ] . keys ( ) ) ) ,
key=l ambda i : abs ( c l u s t e r _ h a s h [ num_s egs _cr os s ed ] . keys ( ) [ i ]
l a s t _ s e g _ t t ) ) ] ]
c a n d _ t r i p _ i d s = c a n d _ c l u s t e r [ t r i p _ i d _ l i s t ]
c a n d _ t r i p _ i d s _ s t r = " " +" , " . j o i n ( c a n d _ t r i p _ i d s ) +" "
c a n d _ t r i p _ d a t a = db_ops . ge t _t t _f r om_db ( t r i p _ i d s =
c a n d _ t r i p _ i d s _ s t r , num_segs=mi n_num_segs )
cand_num_segs = l e n ( c a n d _ t r i p _ d a t a [ 0 ] [ 2 ] )
# Euc l i de a n d i s t a n c e c a l c u l a t i o n s
r msd_hash = {}
f o r c a n d _ t r i p i n c a n d _ t r i p _ d a t a :
s um_s q_di f f = 0
f o r i i n r ange ( 0 , num_s egs _cr os s ed ) :
s um_s q_di f f += ( c r o s s e d _ s e g _ t t [ i ]
c a n d _ t r i p [ 2 ] [ i ] ) 2
r msd_hash [ s um_s q_di f f 0. 5] = c a n d _ t r i p
r a n k e d _ b e s t _ t r i p s = c o l l e c t i o n s . Or der edDi ct ( s o r t e d (
r msd_hash . i t e ms ( ) , r e v e r s e =Fa l s e ) )
r e t u r n r a n k e d _ b e s t _ t r i p s . i t e ms ( ) [ : num_ne a r e s t _ne i ghbour s ]
52
REFERENCES
1. Abkowitz, M. D. (1981). An analysis of the commuter departure time decision. Trans-
portation, 10, 283297.
2. Assent, I., M. Wichterich, R. Krieger, H. Kremer, and T. Seidl (2009). Anticipatory
dtw for efcient similarity search in time series databases. Proceedings of the Very Large
Database Endowment, 2(1), 826837.
3. Batool, F. and S. Khan, Trafc estimation and real time prediction using ad-hoc networks.
In Proceedings of the IEEE Symposium in Emerging Technologies. 2005.
4. Bende, A. (2012). Pune overtakes maximum city in vehicle pop-
ulation @online. URL http://www.dnaindia.com/mumbai/
report-pune-overtakes-maximum-city-in-vehicle-population/
1667537.
5. Berndt, D. J. and J. Clifford, Using dynamic time warping to nd patterns in time series.
In Proceedings of the Knowledge Discovery and Data Mining. 1994.
6. Buckley, D. J. (1968). A semi-poisson model of trafc ow. Transportation Science, 2(2),
107.
7. Chein, S. and C. Kuchipudi, Dynamic travel time prediction with real time and historical
data. In 81st Annual Meeting, TRB, National Research Council, Washington, D.C., volume
CD-ROM. 2002.
8. Chen, G., X. Yang, D. Zhang, and J. Teng (2011). Historical travel time based bus arrival
time prediction model. ASCE, 421, 148148.
9. Chen, M. and S. Chein, Dynamic freeway travel time prediction using probe vehicle data:
link based vs. path based. In 80th Annual Meeting, TRB, National Research Council, Wash-
ington, D.C., volume CD-ROM. 2001.
10. Esawey, M. E. and T. Sayed (2011). Using buses as probes for neighbor links travel
time estimation in an urban network. Transportation Letters: The International Journal of
Transportation Research, 3(4), 279292.
11. Fashandi, H. and A. Moghaddam, A new rotation invariant similarity measure for trajec-
tories. In Proceedings. 2005 IEEE International Symposium on Computational Intelligence
in Robotics and Automation. 2005.
12. Ghosh, D. and C. Knapp (1978). Estimation of trafc variables using a linear model of
trafc ow. Transportation Research, 12(6), 395402.
13. Glanville, W. (1955). Road safety and trafc research in great britain. Journal of the
Operations Research Society of America, 3(3), 283299.
14. Guin, A., Travel time prediction using a seasonal autoregressive integrated moving average
time series model. In IEEE Intelligent Transportation Systems Conference. 2006.
53
15. Hermes, C., C. Wohler, K. Schenk, and F. Kummert, Long-term vehicle motion predic-
tion. In Proceedings. 2009 IEEE International Symposium on Intelligent Vehicles. 2009.
16. Holeywell, R. (2013). Top reasons people stop using public tran-
sit @online. URL http://www.governing.com/blogs/view/
gov-reasons-riders-abandon-public-transit.html.
17. Jensen, C. S. and D. Tie, Transdb: Gps data management with applications in collec-
tive transport. In Proceedings of the 5th Annual International Conference on Mobile and
Ubiquitous Systems: Computing, Networking, and Services, Mobiquitous, ICST, Brussels,
Belgium, Belgium. 2008.
18. Johnson, A. N. (1930). Seasonal distribution of trafc. Highway Research Board Proceed-
ings, 9, 117122.
19. Kalman, R. E. (1960). New approach to linear ltering and prediction problems. Transac-
tions of the ASME: Journal of Basic Engineering, 82, 3545.
20. Krishnan, R. and J. Polak, Short-term travel time prediction: An overview of methods
and recurring themes. In Proceedings of the Transportation Planning and Implementation
Methodologies for Developing Countries Conference (TPMDC), Mumbai, India. 2008.
21. Kumar, S. V. and L. Vanajaksh, Pattern identication based bus arrival time prediction. In
Proceedings of the ICE - Transport, 1200001. 2012.
22. Larose, D. T., DISCOVERING KNOWLEDGE IN DATA: An Introduction to Data Mining.
John Wiley & Sons, 2005, 1 edition.
23. Lee, W.-C., W. Si, L.-J. Chen, and M. C. Chen, Http: a new framework for bus travel
time prediction based on historical trajectories. In Proceedings of the 20th International
Conference on Advances in Geographic Information Systems. 2012.
24. Levenstein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and
reversals. Soviet Physics Doklady, 10(8), 707710.
25. Li, R. and G. Rose (2011). Incorporating uncertainty into short-term travel time predic-
tions. Transportation Research Part C: Emerging Technologies, 19(6), 10061018.
26. Lighthill, M. J. and G. B. Whitham (1955). On kinematic waves. ii. a theory of trafc ow
on long crowded roads. Proceedings of the Royal Society of London. Series A, Mathematical
and Physical Sciences, 229(1178), 317345.
27. Lin, W. H. and J. Zeng (2001). An Experimental Study on Real Time Bus Arrival Time
Prediction with GPS Data. Masters thesis, Center for Transportation Research and Dept. of
Civil and Environmental Engineering, Virginia Polytechnic Institute and State University.
28. Liu, H., V. L. H., V. Z. H., and Z. K., Two distinct ways of using kalman lters to predict
urban arterial travel time. In IEEE Conference on Intelligent Transportation Systems. 2006.
29. Liu, H., K. Zhang, R. He, and J. Li (2009). A neural network model for travel time pre-
diction. IEEE International Conference on Intelligent Transportation Systems, Shanghai,
1, 752756.
54
30. Nanthawichit, C., N. T., and S. H, Application of probe vehicle data for real time traf-
c state estimation and short term travel time prediction on a freeway. In Transportation
Research Board, National Research Council, Washington, D.C., volume CD-ROM. 2003.
31. Nihan, N. L. and K. O. Holmesland (1980). Use of the box and jenkins time series
technique in trafc forecasting. Transportation, 9, 125143.
32. Oda, T., An algorithm for prediction of travel time using vehicle sensor data. In Third
International Conference on Road Trafc Control. 1990.
33. Patnaik, J., S. Chien, and A. Bladikas (2004). Estimation of bus arrival times using apc
data. Journal of Public Transportation, 7(1), 120.
34. Polus, A. (1978). Modeling and measurements of bus service reliability. Transportation
Research, 12(4), 253256.
35. Polus, A. (1979). A study of travel time and reliability on arterial routes. Transportation,
8, 141151.
36. Shalaby, A., O. Tomeh, and L. Sun (2004). Prediction model of bus arrival and departure
times using avl and apc data. Journal of Public Transportation, 4161.
37. Sumi, T., Y. Matsumoto, and Y. Miyaki (1990). Departure time and route choice of
commuters on mass transit systems. Transportation Research Part B: Methodological,
24(4), 247262.
38. Sussman, J. M., H. Wong, and R. Miller (1974). Estimating travel times of highway
networks. Journal of Transportation Engineering, 100, 1326.
39. Tiesyte, D. and C. S. Jensen, Similarity-based prediction of travel times for vehicles trav-
eling on known routes. In Proceedings of the 16th ACM SIGSPATIAL international confer-
ence on Advances in geographic information systems, ACM, New York, NY, USA. 2008.
40. Tiesyte., D. and C. S. Jensen, Assessing the predictability of scheduled-vehicle travel
times. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances
in Geographic Information Systems, ACM, New York, NY, USA. 2009.
41. Tripathi (2013). ADAPTIVE BUS TRAVEL TIME PREDICTION FOR REAL TIME AP-
PLICATIONS.
42. van Lint, J. W. C. (2006). Reliable real-time framework for short-term freeway travel time
prediction. Journal of Transportation Engineering, 132(12), 921932.
43. Vanajakshi, L. and L. Rilett, A comparison of the performance of articial neural net-
works and support vector machines for the prediction of trafc speed. In IEEE Intelligent
Vehicles Symposium. 2004.
44. Vanajakshi, L. and L. Rilett, Support vector machine technique for the short term predic-
tion of travel time. In IEEE Intelligent Vehicles Symposium. 2007.
45. Vanajakshi, L., S. Subramanian, and R. Sivanandan (2009). Travel time prediction
under heterogeneous trafc conditions using global positioning system data from buses.
IET Intelligent Transport Systems, 3(1), 19.
55
46. Vey, A. H. and C. S. Pope (1935). The relation of highway lighting to highway accidents.
Highway Research Board Proceedings, 14, 429441.
47. Vlachos, M., M. Hadjieleftheriou, D. Gunopulos, and E. Keogh (2006). Indexing multi-
dimensional time-series. The Very Large Database Journal, 15(1), 120.
48. Vlachos, M., G. Kollios, and D. Gunopulos, Discovering similar multi-dimensional tra-
jectories. In Proceedings of the 18th International Conference on Data Engineering. 2002.
49. Welch, G. and G. Bishop (2006). An Introduction to the Kalman Filter. University of
North Carolina at Chapel Hill, TR 95-041, Department of Computer Science, University of
North Carolina, Chapel Hill, NC 27599-3175.
50. Westgate, B. S., D. B. Woodard, D. S. Matteson, and S. G. Henderson (2013). Travel
time estimation for ambulances using bayesian data augmentation. Annals of Applied Statis-
tics, 7(2), 11391161.
51. Wong, H. K. and J. M. Sussman (1973). Dynamic travel time estimation on highway
networks. Transportation Research, 7(4), 355370.
52. Wu, C.-H., J.-M. Ho, and D. T. Lee (2004). Travel time prediction with support vector
regression. IEEE Transactions on Intelligent Transportation Systems, 5(4), 276281.
53. Wu, C.-H., D. C. Su, J. Chang, C. C. Wei, J. M. Ho, K. J. Lin, and D. T. Lee, An
advanced traveler information system with emerging network technologies. In Proc. 6th
Asia-Pacic Conf. Intelligent Transportation Systems Forum. 2003.
54. Xu, T., O. Tomeh, and L. Sun (2008). Urban expressway real-time trafc state estimation
and travel time prediction within ekf framework. ASCE, 319, 32.
55. Yang, J., Travel time prediction using the gps test vehicle and kalman ltering techniques.
In American Control Conf., Portland, OR, USA. 2005.
56. Yi, B. and C. Faloutsos, Fast time sequence indexing for arbitrary lp-norms. In Proceedings
of the 26th International Conference on Very Large Databases, San Francisco, CA, USA.
2000.
57. Yu, B., Z. Yang, and Z. Yao (2006). Bus arrival time prediction using support vector
machines. Journal of Intelligent Transportation Systems, 10(4), 151158.
58. Yuan, J., Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang, T-drive: driving
directions based on taxi trajectories. In Proc. 18th SIGSPATIAL International Conference
on Advances in Geographic Information Systems. 2010.
59. Zhu, H., Y. Zhu, M. Li, and L. M. Ni, Seer: Metropolitan-scale trafc perception based
on lossy sensory data. In Proc. IEEE INFOCOM. 2009.
60. Zhu, T., F. Ma, T. Ma, and C. Li, The prediction of bus arrival time using global posi-
tioning system data and dynamic trafc information. In Wireless and Mobile Networking
Conference (WMNC), 4th Joint IFIP. 2011.
61. Zhu, Z. and W. Wang (2000). A travel time estimation model for route guidance systems.
ASCE, 277, 85.
56
62. Zou, N., J. Wang, and G. Chang, A reliable hybrid prediction model for real-time travel
time prediction with widely spaced detectors. In 11th International IEEE Conference on
Intelligent Transportation Systems. 2008.
57
LIST OF PAPERS BASED ON THESIS
1. Rakesh Behera, Devarsh Kumar and Lelitha Vanajakshi, Data Analytics based
Dynamic Passenger Information System. In Urban Mobility India (UMI) Research
Symposium, New Delhi, (2013).
58

B Tech Project Thesis

Uploaded by

Document Information

Original Description:

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

B Tech Project Thesis

Uploaded by

Copyright:

DATA ANALYTICS BASED

DYNAMIC PASSENGER INFORMATION SYSTEM

) t(s) Dist(m) CumDist(m) Speed(m/s)

) was calculated using P

You might also like