You are on page 1of 6

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 4, Issue 1, January 2015

Pedestrian Detection: A Survey of Methodologies, Techniques and Current


Advancements
Tanmay Bhadra1, Joydeep Sonar2 , Arup Sarmah3 ,Chandan Jyoti Kumar4
Dept. of CSE & IT, School of Technology
Assam Don Bosco University

Abstract

II. DATASETS

Detecting pedestrian in an image is a challenging task in


the field of object detection. With the increase in the
number of pedestrian fatalities in roads the significance
of pedestrian detection is also increasing. In this paper we
do a brief study on some of the existing pedestrian
detection systems and also discuss in detail some of the
benchmark data sets which are currently used in the field
of pedestrian detection. Most data sets are also made
publicly available so that it can also be used by other
researchers in their survey/research. We also take a look
at some of the state of the art techniques used by many
pedestrian detection system and give a brief description
on them. Furthermore we also demonstrate a typical
working structure of pedestrian detection system. Hence
we concluded that to be able to detect and track people
plays a key role in the area of research, and machine
vision plays a crucial role in this task.

I. INTRODUCTION
Pedestrian detection is a canonical instance of object
detection. It has various applications such as car safety,
surveillance, robotics etc. which enabled it to acquire
some much needed attention in the previous years. On the
contrary pedestrian detection remains to be a challenging
task in the field of object detection. The detection of
pedestrian is becoming more significant as the number of
pedestrians fatalities are increasing day after day (more
than 30999 pedestrians are killed and 430000 injured in
traffic around world every year)[1]. One of the main
concerns of car manufacturers is to have an automated
system that is able to detect pedestrians in the
surroundings of a vehicle.
To be able to effectively detect pedestrians based on
vision is challenging for number of reasons. Few such
challenges are pedestrians appear in different
backgrounds with a wide variety of appearances and also
different body sizes, poses, clothing and outdoor lighting
conditions.
Distance of the pedestrian from the camera also plays
a vital role as standing relatively far away from the
camera may make them appear small in the image [2].
Most pedestrian detectors can achieve satisfactory
performance on high resolution datasets, however they
encounter difficulties in low resolution images [3] [4]

Despite having various benchmark datasets,


Bastian Leibe et al. has used 44 recorded sequence of 33
different people walking parallel to the camera image
plane for their training dataset [5]. Figure 1 shows a
sample data from the training dataset used by Bastian
Leibe et al.

Figure 1:Training Dataset


Over the years, there have been many public pedestrian
datasets. INRIA [6], ETH [7], TUD-Brussels [8], Daimler
[9] (Daimler stereo [10]), Caltech-USA [11], and KITTI
[12] are the most commonly used ones.
INRIA is one of the oldest dataset and hence has
comparatively lesser number of images. However the
dataset has high quality of annotation of pedestrians in
various settings and as such it is widely used for training.
ETH and TUD-Brussels fall under the category of midsized video datasets. Daimler lacks color channels and as
such is not considered by all methods. Whereas Daimler
stereo, ETH and KITTI provide stereo information.
Except INRIA all other datasets are obtained from video,
and hence enable the use of optical flow.
Caltech-USA and KITTI are currently one of the most
predominant benchmarking for a large number of
methods they have been evaluated, whereas KITTI is
often used because of the diversity of its test set. KITTI is
not yet frequently used [12,13]. INRIA, ETH
(monocular), TUD-Brussel, Daimler (monocular) and
Caltech-USA are available under a unified evaluation
toolbar whereas KITTI uses its own separate one with
unpublished test data. Both toolboxes maintain an online
ranking where published methods can be compared.

www.ijsret.org

31

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 1, January 2015

III. BLOCK DIAGRAM

Figure 2: Flow Chart of a Typical Pedestrian Detection


Training: In machine learning one of the most important
mechanisms is to train our algorithm on our training set.
The training set must be different and distinct from the
test set. If we use the same data set for both training and
testing, the resulting model may not be able to detect
unseen data. Hence it is important to separate data into
training and test set. Once a model has been created using
the training set, we can test the model with the help of the
test set.
In the training part the main motive is to extract features
of the object so that we can feed the extracted features to
the classifier. After normalization the data in the training
set we can extract features like Haar-like features or
Histogram of oriented Gradient (HOG) features. Certain
algorithms like AdaBoost uses a number of training
samples to help select appropriate features from the
dataset. AdaBoost is able to combine classifiers with poor
performance into a bigger classifier with much higher
performance [14].
Once training the system is done and we have a classifier,
we can feed the test set to the classifier to check the
efficiency of our algorithm. Once the image to be
processed is loaded the system can abstract the subimages or the Region of Interest (ROI) from the image
and load it onto the classifier. Based on the features
extracted in the training phase the classifier will be able
to classify whether in a particular image a pedestrian is
present or not.

IV. SURVEY ON EXISTING SYSTEM


Bastian Leibe et. al. [15]in their paper have
evaluated descriptors, shape context descriptors and local
chamfer descriptor and also evaluated four different
interest point detectors for pedestrian detection. These
results were then compared to the standard global

chamfer matching approach. In this paper they try to


show that shape context trained on real edge images
outperforms those trained on clear pedestrian silhouettes.
This paper first makes a comparison between global
chamfer matching and local chamfer matching. The
outcome of the above comparison clearly indicated that
local chamfer matching outperformed the former
technique. A second comparison between local chamfer
matching and local shape context was carried out. In this
comparison local chamfer matching is again found to be a
better technique when silhouette information is used
alone. However when real edge images were used for
training local shape context outperforms all other tested
approaches.
Global chamfer matching is a very popular detection
approach based on global features. It matches object
shape silhouettes of the images to image structures. In
order to achieve this a silhouette is first shifted over the
image. Then a distance between a silhouette T and the
edge image at each location l is calculated (Dchamfer(T,l)).
This distance is based on a distance transform DT. The
distance transform DT computes the distance of one pixel
1
to the nearest feature pixel: Dchamfer(T,l) = || ( +

). This feature stands out mainly because the resulting


similarity measure is smoother[21] and hence speeds up
the matching process of pedestrian detector by employing
a hierarchical search.
Another Local Approach they described was the Implicit
Shape Model(ISM), which is trained by extracting local
features from training images and then making a model
from their spatial occurrence on the object. From each
training image an interest point detector D is applied, and
local features F are calculated around the extracted
points. These local features are clustered and form a
visual vocabulary of typical local features. Then the
spatial occurrence distributions on the training data are
recorded for each typical features. This step is known as
Model Training.
After this step comes Hypothesis Generation in which
codebook entries are matched to one of the extracted
features and votes are cast for possible object locations
according to the occurrence distribution learned in the
training phase. Then a segmentation mask can be inferred
for each hypothesis. This is done by projecting
supporting features of a hypothesis back to the image and
using the stored segmentation masks in order to get the
local features. Finally, they applied a Minimum
Description Length (MDL) based verification step for
disambiguating overlapping hypothesis.
In order to extract feature points, they used interest point
detector within the ISM approach. In this paper they
evaluated the use of four different interest point detectors:
Harris[16,17],
Difference-of-Gaussian[18],
HarrisLaplace[19], and Hessian-Laplace[20]. Also, in this

www.ijsret.org

32

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 1, January 2015

paper, two shape-based local feature descriptors were


compared and were applied within the ISM framework.

(a)

(b)

(c)

(c)

(d)

Figure 3: Interest points (in yellow) on an example


image: (b) Harris (c) DoG (d) Harris-Laplace (e)HessianLaplace
Local chamfer descriptor and shape context descriptors
are generally trained on silhouette images. They can,
however, be applied to real edge images of which the
silhouette images are an idealized approximation. Both
the descriptors were trained not only on silhouette images
but also on real edge images to make the descriptors
robust and realistic. Foreground as well as background
structures influence the result of features extracted when
learning shape-based features on real edge images rather
than on silhouette images.
Finally from their experimental results they concluded
that shape context descriptor trained on real edge images
performed best. Compared to raw image patches and
Local Chamfer, it was able to achieve a 20% gain in EER
performance. The different interest point detectors also
had a big impact on the performance of detection. It was
concluded
that
the
Hessian-Laplace
detector
outperformed the other detectors. Hence, using shape
Context descriptors trained on real edge images and the
Hessian-Laplace detector represents a good combination
for pedestrian detection. The shape context descriptor
also speeds up the computation process because of its
relatively low dimensionality.
E. Naranjo et al. in his study provided us a brief
idea about combination of feature extraction methods for
vision based pedestrian detection. In such cases there are
two basic components-First location of pedestrian in the
image, second combining it with a SVM based
classifier[22]. A candidate selection mechanism is
applied in order to ease the pedestrian recognition task in
Intelligent transport System. This selection of candidates
can be implemented in 3D scene or 2D image plane by
performing an object segmentation .For 4D scene object
segmentation stereo vision is used[24][25]. 2D image

object segmentation tackles the problem of candidate


selection using a single camera.
Candidate selection-Candidate selection method plays a
crucial role in the global performance of the pedestrian
detection system. This method must assure that no miss
detection occurs. All real pedestrians must be detected
effectively. Moreover the candidates that are described by
a bounding box in the image must be detected precisely
as possible. As the detection accuracy plays an important
role on the performance of the recognition stage.
Information from 3-D images are extracted using
disparity map techniques [27], also segmentation based
on v-disparity[25][26].here they have proposed a
candidate selection method which is based on the direct
combination of the 3D co-ordinates of the relevant points.
Accordingly for candidate segmentation purposes a nondense 3D geometrical representation is created and used.
Such kind of representation allows robust object
segmentation whenever the number of relevant points in
the image is high. Major advantage is outliers can be
easily filtered out in the 3D space which makes the
method less sensitive to noise.
SVM classifier- Support vector machine first proposed in
[28][29] is a common approach for pedestrian detection.
SVM provides a method to calculate the hyper plane that
optimally separates two high-dimensional classes of
objects. Other important aspects while constructing a
classifier are global classification structures, use of
single/multiple cascaded classifiers and the training
strategy.
The by-components approach mentioned here divides the
candidate body into several parts. Each body part is then
independently learnt by a specialized classifier in the first
learning stage. The body are then integrated by another
classifier in a second learning stage which helps in cases
like partially occluded pedestrians. Here independent
classifiers are used for each body part to make the
learning process easy and simple. After a huge number of
trials they have proposed 6 different sub regions for each
candidate region of interest. 1st sub region located in the
head zone. Arm and leg are covered by 2nd, 3rd, 4th, 5th sub
region. Each classifier produces a theoretical output by 1(non-pedestrian) and +1(pedestrian).
A set of features must be extracted from each sub region
and fed to the classifier. Here 7 different feature
extraction methods are tested namely-Canny filter, Haar
wavelets, gradient magnitude and orientation.
Coocurrence matrix, Normalized Histograms (HON),
Number of Texture Unit (NTU).After applying these
features to various images it is found that NTU and
Histograms are fit for Head and Arms, HON is fit for
Legs and NTU is fit for area between the legs.
Finally we can conclude that a comparative study of
feature extraction methods for vision-based pedestrian
detection was carried out. To reduce variability of

www.ijsret.org

33

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 1, January 2015

pedestrians the learning process was simplified by


decomposing dividing candidates into 6 local sub-regions
which are fed to individual SVM classifiers.
Edgar Seemann et al. in his paper considers the
problem of pedestrian in crowded real-world scenarios.
The core method comprises of local and global cues via
probabilistic top-down segmentation. The method
consists of a series of iterative evidence aggregation
steps. The main objective is to detect a pedestrians
presence (it may be occluded) in an image, localize them
in the image. Here still gray images are considered. The
first step is to sample local features from the image and
combine them to generate hypothesis about possible
object locations. Then for every hypothesis a probabilistic
top-down segmentation is computed to determine its area
of support in the image. The evaluation criteria not only
consists of yes/no detection decision but also precise
locations and extents of the pedestrians.[30]
Training approach-First a codebook is learned of local
appearances which are characteristics for the object
category. In order to achieve it a scale-invariant DoG
interest point operator [31] is applied to all training
images and by extracting image patches with a radius of
3 of the detected scale. Grouping of the extracted
patches is done using an agglomerative algorithm, prior
to which all these patches are rescaled to a fixed size.
While recognition the patch extraction process is used
and the local information is collected in a probabilistic
Hough voting procedure from sample patches. Each patch
is matched to the codebook and those entries matches
that are found cast votes for possible objects position.
The possibility to generate top-down segmentations using
learned knowledge about an object category is a recent
approach [32,33,34]. Here this approach is used to
improve recognition. For each hypothesis, they trace back
the image to determine a per-pixel level where its
support came from, thus segmenting the object from the
background. We have observed that a central topic of the
paper consists of aggregation of evidences from the
image in multiple iterations.
The first step is this direction is to use the top-down
segmentation to refine object hypothesis. The main idea
behind this step is the integration of information only
about the object and discards misleading information
from the background.
Navneet Dalal et al. experimentally show that grids
of Histogram of Oriented Gradient (HOG) descriptors
significantly outperform other feature sets that are used to
detect pedestrians. They further study the influence on
performance in each stage. Finally they concluded that
fine-scale gradients, fine orientation binning relatively
coarse spatial binning, and high-quality local contrast
normalization are all important for better results. This
approach gives a near-perfect result. And so they have
used a more challenging dataset containing over 1800

annotated human images with a range of pose variations


and backgrounds. [35]
Locally normalized Histogram of Oriented Gradient
(HOG) descriptors have better performance than those
feature sets including wavelets [36, 37]. These descriptors
are oblivious of edge orientation histograms [38, 39], and
shape context[40], but they are able to improve their
performance by using overlapping local contrast
normalizations. They have used linear SVM classifier
throughout for simplicity and speed. These detectors gave
perfect results on the MIT pedestrian test set [41,36].
As for the feature extraction methods it is based on
evaluating well-normalized local histograms of image
gradient orientations in a dense grid. Local object
appearance can be characterized well by the distribution
of local intensity gradients. The combination of
histogram entries forms the representation. It is also
necessary to contrast-normalize the local responses
before using them.
Oriented histograms have had many predecessors
[42,38,39], but it reached its peak when it was combined
with local spatial histogram and normalization. The shape
context work [40] made representation effective using
only edge pixel counts without the orientation histogram.
These sparse feature based representation minimized the
power and simplicity of HOG's as dense image
descriptors.
The HOG/SIFT representation has various advantages. It
is able to capture edge or gradient structure which is
important for local shape, and it is done in representation
with an easily controllable degree of invariance to local
geometric transformation. But for human detection coarse
spatial sampling, fine orientation sampling and strong
local photometric normalization turns out to be the best
strategy. May be because it permits limbs and body
segments to change appearance and move from side to
side a lot provided they are upright.
They tested their detector on two different datasets. The
first one is an MIT pedestrian database[29], which
contains 509 training and 20 test images of pedestrians in
a city. It has mostly front or back views with a few
limited in different poses. The second dataset was
produced by them which was a significantly more
challenging dataset, 'INRIA'. This set contained 1805
64x128 images of humans acquired after cropping from
personal photos. Most people are usually standing, but
appear in any orientation with huge variety of
background image.
Dariu M. Gavrila and M. Enzweiler in their paper
provides us withan overview of the current state of the art
systems in pedestrian detection, which is a rapidly
evolving field.[9] They covered the main aspects of a
typical pedestrian detection system and also a brief
experimental study on certain systems such as the Haar
wavelet-cased AdaBoost cascade [23], HOG/linSVM

www.ijsret.org

34

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 1, January 2015

[35] and a few others. They have used a large data set of
approximately 8.5 GB size containing more than 20,000
images. The data set was also made public for
benchmarking. After performing the experiments on the
data set they could come to a conclusion that
HOG/linSVM had had a clear advantage when used with
images with higher resolution and also had lower
processing speeds whereas Haar wavelet-based AdaBoost
cascade approach had an advantage with images at lower
resolution and showed (near) real-time processing speeds.
They decomposed the task of detecting pedestrians into
three types: ROI selection, classification and tracking or
temporal integration.

V. CONCLUSION
Detection of pedestrian based on vision is still an
open challenge. To be able to detect pedestrians in
different backgrounds, some being in motion and some
stationary while some change directions unpredictably.
Different approaches have been developed to try to
address the above mentioned and other such
complexities. Although pedestrian traffic fatalities remain
to be a concerning area, car manufacturers are working
hard to protect the safety of the car and pedestrian by the
use of pedestrian detection systems which alerts drivers
when any pedestrian is detected in front of the car.
Various cars like Volvo S60, Mercedes S65AMG, Audi
A8L have already started the use of pedestrian detection
and also in other high-end cars. Although to purchase
these systems may sting a little upfront, but they are
worth every single penny, when it comes to saving a
pedestrian's life.

REFERENCES
[1] D. Gavrila, Sensor-based pedestrian protection, in
IEEE
Intelligent Systems, vol. 16, no. 6, pp. 7781,November 2001.
[2] D. Gavrila, J. Giebel and S. Munder, Vision-based
pedestrian detection: the PROTECTOR system, in Proc.
IEEE Intelligent Vehicles Symposium, pp. 13-18, June
2004.
[3] P. Dollar, C. Wojek, B. Schiele, and P. Perona.
Pedestrian detection: An evaluation of the state of the art.
TPAMI, 2012.
[4] D. Hoiem, Y. Chodpathumwan, and Q. Dai.
Diagnosing error in object detectors. ECCV, 2012.
[5] Pedestrian Detection in Crowded Scenes,Bastian
Leibe, Edgar Seemann, and Bernt Schiele,Multimodal
Interactive Systems, TU Darmstadt, Germany

[6] Dalal, N., Triggs, B.: Histograms of oriented gradients


for human detection. In: CVPR. (2005)
[7] Ess, A., Leibe, B., Schindler, K., Van Gool, L.: A
mobile vision system for robust multi-person tracking. In:
CVPR, IEEE Press (June 2008)
[8] Wojek, C., Walk, S., Schiele, B.: Multi-cue onboard
pedestrian detection. In: CVPR. (2009)
[9] Enzweiler, M., Gavrila, D.M.: Monocular pedestrian
detection: Survey and experiments.PAMI (2009)
[10] Keller, C., Fernandez, D., Gavrila, D.: Dense stereobased roi generation for pedestrian detection. In: DAGM.
(2009)
[11] Dollar, P., Wojek, C., Schiele, B., Perona, P.:
Pedestrian detection: A benchmark. In: CVPR. (2009)
[12] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for
autonomous driving? the kitti vision benchmark suite. In:
Conference on Computer Vision and PatternRecognition
(CVPR). (2012)
[13] Dollr, P., Wojek, C., Schiele, B., Perona, P.:
Pedestrian detection: An evaluation of the state of the art.
TPAMI (2011)
[14]
http://msdn.microsoft.com/enIN/library/bbbb895173.aspx
[15] Edgar Seemann, Bastian Leibe, Krystian
Mikolajczyk, Bernt Schiele,"An Evaluation of Local
Shape-Based Features for Pedestrian Detection"
[16] C. Harris and M. Stephens. A combined corner and
edge detector. In Alvey Vision Conference, pages 147
151, 1988.
[17] C. Schmid and R. Mohr. Local grayvalue invariants
for image retrieval. PAMI, pages 530535, 1997.
[18] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. In IJCV, 2004.
[19] K. Mikolajczyk and C. Schmid. Indexing based on
scale invariant interest points. In ICCV, 2001
[20] K. Mikolajczyk and C. Schmid. A performance
evaluation of local descriptors. Submitted to PAMI, 2004
[21] D. Gavrila. Pedestrian detection from a moving
vehicle. In ECCV, pages 3749. Springer, 2000.

www.ijsret.org

35

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 1, January 2015

[22] M. A. Sotelo, I. Parra, D. Fernandez, E. Naranjo,"


Pedestrian Detection using SVM and Multi-feature
Combination",2006 IEEE Intelligent Transportation
Systems Conference.
[23] P. Viola, M. Jones, and D. Snow, Detecting
Pedestrians Using Patterns of Motion and Appearance,
Intl J. Computer Vision, vol. 63, no. 2, pp. 153-161,
2005.
[24] D. M. Gavrila, J. Giebel, and S. Munder, Visionbased pedestrian detection: The protector system, in
Proc. IEEE Intelligent Vehicles Symposium. pp. 13-18,
Parma, Italy, June, 2004
[25] G. Grubb, A. Zelinsky, L. Nilsson, and M. Rilbe,
3d vision sensing for improved pedestrian safety, in
Proc. IEEE Intelligent Vehicles Symposium. pp. 19-24,
Parma, Italy, June, 2004.

[34] B. Leibe and B. Schiele. Interleaved object


categorization and segmentation. In MVC03, pages
759768, 2003.
[35] Navneet Dalal and Bill Triggs," Histograms of
Oriented Gradients for Human Detection".
[36] A. Mohan, C. Papageorgiou, and T. Poggio.
Example-based object detection in images by
components. PAMI, 23(4):349 361, April 2001.
[37] P. Viola, M. J. Jones, and D. Snow. Detecting
pedestrians using patterns of motion and appearance. The
9th ICCV, Nice, France, volume 1, pages 734741, 2003.
[38] W. T. Freeman and M. Roth. Orientation histograms
for hand gesture recognition. Intl. Workshop on
Automatic Faceand Gesture- Recognition, IEEE
Computer Society, urich, Switzerland, pages 296301,
June 1995

[26] R. Labayrade, C. Royere, D. Gruyer, and Aubert,


Cooperative fusion for multi-obstacles detection with
use of stereovision and laser scanner, in Proc.
International Conference on Advanced Robotics. pp.
1538-1543, 2003.

[39] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma.


Computer vision for computer games. 2nd International
Conference on Automatic Face and Gesture Recognition,
Killington, VT, USA, pages 100105, October 1996.

[27] L. Zhao and C. E. Thorpe, Stereo and neural


network-based pedestrian detection, in IEEE
Transactions on ITS. Vol. 1, No. 3, September, 2000.

[40] S. Belongie, J. Malik, and J. Puzicha. Matching


shapes. The 8th ICCV, Vancouver, Canada, pages 454
461, 2001.

[28] C. Papageorgiou and T. Poggio, A trainable system


for object detection, in Intl J. Computer Vision. Vol. 38,
No. 1, pp. 15-33, 2000.
[29] A. Mohan, C. Papageorgiou, and T. Poggio,
Example-based object detection in images by
components, in IEEE Transactions on Pattern Analisis
and Machine Intelligence. Vol. 23, No. 4, 2001.

[41] C. Papageorgiou and T. Poggio. A trainable system


for object detection. IJCV, 38 (1):1533, 2000.
[42] R. K. McConnell. Method of and apparatus for
pattern recognition, January 1986. U.S. Patent No.
4,567,610.

[30]
Bastian Leibe, Edgar Seemann, and Bernt
Schiele,"Pedestrian Detection in Crowded Scenes".
[31] D. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60 (2):91110, 2004.
[32] E. Borenstein and S. Ullman. Class-specific, topdown segmentation. In ECCV02, LNCS 2353, pages
109122, 2002.
[33] S.X. Yu and J. Shi. Object-specific figure-ground
segregation. In CVPR03, 2003.

www.ijsret.org

36

You might also like