Professional Documents
Culture Documents
Thesis
Submitted in partial fulfilment of the requirements of
BITS C421T/422T Thesis
By
Rachit Bhargava
ID No. (2011A4TS232P)
Under the supervision of
Dr. K Madhava Krishna
Associate Professor, IIIT Hyderabad
&
Dr R.K. Mittal
Professor, BITS Pilani
2011A4TS232P
Acknowledgement
Firstly, I would like to thank the Head of the Mechanical Department,
Dr. Sai Jagan Mohan for granting his permission to allow me to explore
my interests in robotics and pursue a semester off campus in one of
the leading robotics facilities in India at IIIT, Hyderabad. I appreciate
the freedom he has given me during the duration of my thesis to
pursue what I am interested in.
I would also like to thank Dr. R.K. Mittal, who has been my on campus
advisor for the duration of this thesis. I am very thankful to him for his
support during this period and more importantly, for spending his
precious time with me to guide me on how to pursue my interests in
robotics.
Finally and most importantly, I would like to thank my supervisor, Dr.
K Madhava Krishna for believing in me and trusting me with projects
of vast importance to him even when I had no prior research
experience in this field. With the help of my mentor, Harit Pandya and
Dr. Madhava, I have been able to explore the highly exciting field of
computer vision and gain valuable practical exposure in the same. This
experience has made me a lot more confident to explore further in the
highly complex, diverse and ever evolving field of robotics.
My time here has been very enjoyable and I hope my work has been
up to your standards and has reflected my enthusiasm to work in this
field.
2011A4TS232P
CERTIFICATE
This is to certify that the Thesis entitled, The design of traffic light and traffic
sign recognition systems for autonomous vehicles, and submitted by
Rachit Bhargava ID No. 2011A4TS232P in partial fulfilment of the requirement
of BITS C421T/422T Thesis embodies the work done by him/her under my
supervision.
2011A4TS232P
Project abstract
Traffic car accidents claim more than a million lives each year and
majority of these accidents are due to driver error. This figure can be
significantly reduced if we are able to provide information to the
driver to react more quickly or even have the automobile react on his
behalf. Thus there is a need of an intelligent driver safety system. This
project deals with two such modules for this intelligent system namely
traffic sign and traffic light recognition system. The aim of the project
is to develop a system that is able to take in images of dense urban
environments and be able to detect traffic lights and traffic signs and
more importantly recognise these correctly. An even more important
requirement is that this must be achieved in real time for it to be
implemented successfully in an autonomous vehicle. The following
will be achieved using basic image processing techniques like
segmentation, dilation and erosion and machine learning techniques
like Cascade training and SVM (Support Vector Machine) .
2011A4TS232P
Table of contents
1. Design of traffic light detection system for autonomous vehicle
1.1 Introduction
1.2 The image and the RGB and HSV colour space
12
15
1.6 Results
16
1.7 Challenges
19
20
2.1 Introduction
20
2.2 The image and the RGB and HSV colour space
21
22
24
27
2.6 Results
28
2.7 Challenges
28
3. Conclusion
29
4. References
30
2011A4TS232P
2011A4TS232P
1.2 The image and the RGB and HSV colour space
Any image that is inputted into a computer is stored in a 3 Dimensional matrix with each
point in the matrix being represented by 3 channels of red, blue and green intensities. The
intensities range from the value 0 (Extremely Dark) to 255 (Extremely Bright). It is from
this matrix, are we to extract information that indicates the position of the object we
desire to find. For instance in the case of the traffic lights can be distinguished from other
objects in an image since they can be of 2 different colours only namely red and green and
the fact that their intensity value is extremely high in comparison to its surroundings. Thus
our first step will be to use this information to our advantage through colour segmentation
of the image.
Colour segmentation algorithms include the histogram threshold method, the feature
space clustering method, the region-based method, the edge-based detection method,
the fuzzy set method, the neural network, the method based on the physical model and
so on. Of which, the threshold method has a good real-time performance, and the hue of
each kind of light is basically fixed. When a light turns on, its intensity is largely fixed.
2011A4TS232P
Therefore, the threshold segmentation method based on the HSV colour space is suitable
for recognition of traffic lights.
HSV colour space is a colour model based on human vision, and many researches have
used colour threshold with the HSI colour model. It can be used as a measure for
distinguishing predefined colours of traffic lights. In this work, the segmentation method
in HSV colour space was used. About 900 samples with different lighting conditions,
different background environment, and different brightness were selected to calculate
the H, S & V statistical curves of the red and green light. The characteristic parameter of
the threshold segmentation in HSV colour space was obtained according to the statistics
curves and by trial and error. By various trials we finally were able to decide upon a region
of H, S and V so that we were able to minimize the number of candidates for the next step
and the number of false negatives.
The resultant HSV ranges taken in the final algorithm were:
For Red light:
H: 165-34
S: 192-255
V: 192-255
2011A4TS232P
2011A4TS232P
2. Reducing number of candidates for recognition phase: Let us return to the question
of what is unique about the shape of traffic lights. The fact that is circular of course!!
So we find the contour associated with each blob of connected components and attain
the bounding box surrounding these blobs. These are done using predefined functions
in the OpenCV library. It is done by measuring the centroid of the blob by means of
moments and finding the extreme corners of the blob. Once we have the bounding
boxes we get a measure of their aspect ratio. As is obvious this results in the removal
of blobs which do not satisfy the circularity condition, thereby further reducing the
number of candidates for spotlight recognition phase.
The resultant of these two steps is:
Fig 6: The result after noise removal and applying geometric constraints
10
2011A4TS232P
The next step is extremely important. We now concentrate on the structure of the
traffic light as shown below.
Fig 7: Schematic of the traffic light used to decide the region of interest
From the blob candidates we are able to figure out its height and width as well as its
centroids. This is done with the image moments m00, m01 and m10. The general equation
of image moments Mij is given by the equation.
where x and y are the image coordinates or matrix row and column and I(x,y) is the pixel
value or intensity value
By studying the schematic representation of the traffic lights in the area we used the
centroid, width and height information of the blob (traffic spotlight) to generate a region
of interest (by expansion of the bounding box using ratios of dimensions that are attained
from the schematic representation) that will contain the entire traffic lights and not just
the spotlight. It is this region of interest that is going to be passed onto the recognition
phase.
11
2011A4TS232P
So, in the image that we have been following we get the following region of interests:
12
2011A4TS232P
Fig 8: Example rectangle features use in Haar-cascade. The sum of pixels in the white rectangles is
subtracted for the sum of the pixels in the grey rectangles. Here A and B are two rectangle feature, and
C and D are three and four rectangle feature (Viola P. & Jones M.; 2001).
13
2011A4TS232P
combination of 1st and 2nd classifier and so on until the final classifier is learned.
Therefore, the final classifier is the combination of all previous n-classifiers, which is
shown in Fig 9. The Ada-boost cascade of classifiers is one of the fastest and robust
methods of detection and characterization, however, it presents some limitations on
complex scenes especially those that changes shape (Sialat et al; 2009).
The final contribution (or weights) of any of the n classifiers is linearly dependent on the
ratio of the true positives and the false positives that are attained in each case.
14
2011A4TS232P
The image is stored in the form of a 3 dimensional matrix in which each dimension
represents the red, green and blue channel respectively.
The image is then converted to an HSV image because it is easier to handle and process.
(Reasons explained before)
The image is then thresholded within the given constraints. The constraints were decided
through statistical analysis of the constraints on sample data.
We now have a binary image each for the green threshold and red threshold.
The binary images are applied morphological transformations and geometric constraints
so that noise is removed and the number of candidate blobs are reduced.
The centroid is calculated via moments (m00, m01 and m10) and hence so was the width
and the height of individual blobs.
Using the schematics of the standard traffic light we expand the region of interest beyond
the width and height of the blob.
The region of interests are then extracted in the greyscale image of the original and then
its Haar descriptor are calculated.
These Haar descriptors serve as the input for the cascade trainer generated using
Adaboost and the sample positive and negative images.
After colour segmentation the colour information of the blob was stored.
This information coupled with the detected windows gained after the testing using the
Cascade trainer, we are able to successfully detect and recognise traffic lights on inputting
an image.
15
2011A4TS232P
1.6 Results
We first need to understand how results are computed in the field of computer vision.
There are 2 values used to determine the efficiency of the system namely Precision and
Recall.
Fig 11:
Precision: PRECISION is the ratio of the number of
relevant records retrieved (true positives) to the total
number of irrelevant and relevant records retrieved
(i.e. (true positives + false positive) or number of
detections). It is usually expressed as a percentage.
Recall: RECALL is the ratio of the number of relevant
records retrieved (i.e. true positives) to the total
number of relevant records (i.e. (true positives + false
negatives) or total number of positive data) in the
database. It is usually expressed as a percentage.
16
2011A4TS232P
In a system built for driver safety precision, recall, speed of detection and accuracy of the
recognition become of utmost importance. For the system in its current stage the
following were the results achieved:
Precision: 90%
Recall: 72%
Fig 12: The above 4 figures below are examples of some results from our
system on the LARA traffic light detection dataset
17
2011A4TS232P
18
2011A4TS232P
1.7 Challenges
Of course since we are dealing with real life images we are going to have to deal with
obstacles that arise from exposure to the environment.
Diverse illumination conditions: This is an issue that has been central to any computer
vision problem. Detection and recognition are solely dependent on the fact that we are
able to extract the features of the object accurately. But as it happens in external
conditions, the amount of light available is not always ideal. In cases of extremely bright
conditions, there may occur spectral highlights (where reflected light follows a Spectral
model rather than a Lambertian model of reflection) while in the case of darkness there
is no light available for the object to reflect. In either of the cases there may be a loss of
essential features leading to occurrence of false negatives.
Fig 13: Example of loss of relevant features like colour of the object and addition of irrelevant ones (like existence
of image gradients around the white spot even though the object is actually uniform) due to spectral highlighting
Motion blurring: The roads are not always smooth and neither are conditions always ideal
to operate a camera smoothly. There are bound to be many vibrations. Thus the the
camera may not accurately capture a steady image of the environment and there may be
blurring of the image which may lead to inaccurate feature extraction and a large amount
of noise.
19
2011A4TS232P
Similar background: The nature of the objects that we may find in the surroundings of an
object of interest are always unpredictable. It is bound to happen that the background of
the object of interest may possess a colour within the constraint and hence not be filtered.
As a result the object and the background appear to be fused in the binary image as one
blob. As a result it may get filtered out since it may now not satisfy the geometric
constraints.
Partial Occlusions: We are not always going to have a clear view of the objects. It may
happen that a huge vehicle or tree may block our view of the traffic light. This partial
hiding of the object behind certain obstacles is what is known as occlusion. As is obvious
it is impossible to capture all the features of an object that is partially hidden from our
view. Loss of essential features may lead to it not getting detected.
Lack of robustness to location: As has been mentioned earlier there is a requirement of
positive samples to train the cascade trainer. These positive samples hence serve as a
template for the trainer to find out if our region of interest actually contains the object of
interest. But, there is no uniformity in the colour, shape, structure etc. of the traffic lights
around the world. In a country like India such features may even differ from city to city!
So it is not possible to train a cascade trainer which is robust to the location of the car or
more importantly robust to change in the features of the object of interest.
20
2011A4TS232P
2.2 The image and the RGB and HSV colour space
The concept that is going to be applied is the same as for the traffic lights (refer 1.2). Of
course, the sample data that is being taken is different and hence the thresholds here are
going to be different from that being observed in the traffic lights. There are primarily two
types of traffic signs that we are going to take into account when it comes to the colour
of the traffic signs, namely red and blue traffic signs. The following were the thresholds in
the HSV space:
For red signs:
H=160-15
S=40-255
V=0-255
21
2011A4TS232P
If you observe the traffic sings in the image you will observe how well they have been
extracted out of the dense urban background.
22
2011A4TS232P
To compute the HOG descriptor, we operate on 10x10 pixel cells within the detection
window. These cells will be organized into overlapping blocks with the overlap of 50
percent. This is termed as the block. The block is divided into 2x2 cells with each cell having
a size of 5x5. For each cell we calculate the image gradient for each pixel as shown below.
The gradient is calculated by sliding a gradient mask (an example can be seen below) over
the entire image and calculating the values by 2 dimensional discrete convolution.
23
2011A4TS232P
24
2011A4TS232P
We first assume a 900 dimensional space with each point, of course, being represented
by an 900 dimensional vector. We then collect positive samples, images of traffic sings in
our case and an even greater number of negative samples, images of anything other than
traffic signs. We then calculate the HOG features (the 900 length feature vector) of each
sample and plot them in this space along with information of its class (+1 for positive and
-1 for negative). We get a figure as below:
25
2011A4TS232P
Here w is the normal vector to the plane and b is the distance from the origin. Both b and
w are normalized so that the two parallel hyper-planes on which the support vectors lie
are at unit distance from the hyperplane.
Thus this now becomes an optimization problem of maximizing the value of D or
minimizing the value of w. In summary the primal solution to our problem becomes:
The problem is then converted to a dual problem and when applying the KarushKuhn
Tucker conditions it is observed that the solution is dependent on the support vectors and
a few parameters (like Penalty factor and Gaussian factor), not discussed here. When a
testing variable (z) is inputted the distance of the variable is calculated by inserting the
value z instead of variable x in the equation of the optimal hyper-plane. Depending on the
sign and magnitude of the value we can determine the probability of the testing image
belonging to each class.
Of course in real situations linearly separable data is not available but instead we have
something like below (left hand side):
26
2011A4TS232P
by means of application of a kernel. The one I have used is what is shown in the figure
above and is known as the Radial Basis Function given by:
The addition of the new dimension based on the information present in the vectors helps
generate a mapping where a viable hyper-plane exists. This is the advantage of the kernel.
It is a very interesting and vast topic. Please check the references for further information.
In my traffic sign detection system there are 38 different classes which have been trained
in two different ways:
Multiclass non-linear SVM with an RBF Kernel: Similar to the process explained above,
just involving 38 different classes. Similar images belong to a particular class. The testing
images is matched against a class.
Exampler-SVM: It is a recently developed method. In this method each sample is a class
of its own and a hyper-plane exists between that sample and the rest. Here the testing
data is matched against each sample and not a class.
The image is stored in the form of a 3 dimensional matrix in which each dimension
represents the red, green and blue channel respectively.
The image is then converted to an HSV image because it is easier to handle and process.
(Reasons explained before)
The image is then thresholded within the given constraints. The constraints were decided
through statistical analysis of the constraints on sample data.
We now have a binary image each for the blue threshold and red threshold.
27
2011A4TS232P
We then find the blobs in the image and find the approximate polygon corresponding to
each blob and hence classify them as circles and triangles (2 common shapes of traffic
signs).
We then apply certain geometric constraints (size, aspect ratio etc.) on the contours
attained and hence reduce the number of testing candidates. The bounding boxes of the
remaining candidates is attained.
We then train 3 Multi Class non-linear SVMs each for the following cases:
1. Blue traffic sign (containing 8 signs / positive classes and 1 negative class)
2. Red triangle traffic signs (containing 16 signs / positive classes and 1 negative class)
3. Red circle traffic signs (containing 14 signs / positive classes and 1 negative class)
The bounding boxes are used to extract the region of interest from the grayscale image
of the original image. The image is resized to 30x30 so that it can be tested using the
appropriate trainer (one of the 3 mentioned above).
If the region of interest matches with any of the positive classes its bounding box is
marked on the original image along with its label.
2.6 Results
This system is still in a very early stage and hence precision recall measurements can yet
not be done. The computation time was attained and was roughly 0.29 seconds per image
owing to the high complexity involved in comparison to the traffic light recognition
problem. Some of the result images have been compiled in the attached folder titled
Traffic Signs (images are too large for MS Word).
2.7 Challenges
The challenges involved are exactly the same as that observed in the traffic light system
(Refer section 1.7) but are amplified owing to the more complex structure of traffic signs.
Here are a few challenges which are specific to this traffic signs.
Differentiating between the speed limit signs i.e. differentiating between the numbers
120 and 100 or 50 and 30 is extremely complex and a quicker and more accurate
process is required to distinguish such numbers.
28
2011A4TS232P
Unlike traffic lights, traffic signs are not so well maintained in the urban setup and are
prone to rusting which can lead to loss of essential features.
The traffic signs are made up of highly reflective surfaces and are very susceptible to
spectral highlighting.
Traffic signs are so diverse in structure and colour that it is impossible to create a system
that works perfectly even in the same city.
Traffic signs are placed to the sides of the roads and are hence more likely to be occluded
and merge with its surroundings.
3. Conclusion
Autonomous vehicles (driverless cars) are going to be available in the not too distant
future. The problem of detecting objects of interest in the unpredictable outdoor
environment is still a very common problem that researchers are trying to solve. The
system that I have presented above is an example of one of the real time solutions that is
available to solve the problem.
There are still a large number of problems to be tackled before the system reaches a level
that in can be implemented in commercial cars. This is evident from sections 1.7 and 2.7
of the report. A major reason to these problems is that such systems rely on the
infrastructure that it is interacting with. In a country like India where roads are not smooth
(more motion blurring) and traffic lights and signs are not well maintained, developing an
intelligent system for dealing with such a high amount of unpredictability is altogether a
different and highly complex problems to solve.
However, rapid developments are being in made in computer architecture and if Moores
Law is to be believed, we may be able to emulate a human brain in the near future that
might be capable of solving such complex problems. However, for now, till that day arrives
we are going to have to rely on systems like the one shown above that have been designed
assuming nearly perfect environmental conditions and fairly uniform infrastructure (in
shape and features). As the results show, the system works pretty well and provides a
base upon which future work can be done.
29
2011A4TS232P
4. References
Traffic light recognition
1
3
4
5
6
7
Conference paper:
The Recognition and Tracking of Traffic Lights Based on Color Segmentation
and CAMSHIFT for Intelligent Vehicles
Jianwei Gong, Yanhua Jiang, Guangming Xiong, Chaohua Guan, Gang Tao and
Huiyan Chen
Conference paper:
Traffic Light Recognition using Image Processing Compared to Learning
Processes
Raoul de Charette, Fawzi Nashashibi
Website (for training dataset):
http://www.lara.prd.fr/lara
Conference paper:
Rapid Object Detection using a Boosted Cascade of Simple Features
Michael jones and Paul Viola
Website:
http://www.kirupa.com/design/little_about_color_hsv_rgb.htm
Website:
http://docs.opencv.org/
Website:
http://in.mathworks.com/help/images/morphology-fundamentals-dilationand-erosion.html
Conference paper:
Traffic Sign Recognition How far are we from the solution?
Markus Mathias, Radu Timofte, Rodrigo Benenson, and Luc Van Gool
Conference paper:
Road-Sign Detection and Recognition Based on
Support Vector Machines
Saturnino Maldonado-Bascn, Hilario Gmez-Moreno
Journal Paper:
Automatic road-sign detection and classification based on support
vector machines and hog descriptors
A. Adam, C. Ioannidis
Conference paper:
Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
30
2011A4TS232P
31