You are on page 1of 15

This article appeared in a journal published by Elsevier.

The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elseviers archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright

Author's personal copy


Image and Vision Computing 30 (2012) 292305

Contents lists available at SciVerse ScienceDirect

Image and Vision Computing


journal homepage: www.elsevier.com/locate/imavis

Scene Aware Detection and Block Assignment Tracking in crowded scenes


Genquan Duan a,, Haizhou Ai a, Junliang Xing a, Song Cao b, Shihong Lao c
a
b
c

Computer Science and Technology Department, Tsinghua University, Beijing, China


Electronic Engineering Department, Tsinghua University, Beijing, China
Development Center, OMRON Social Solutions Co., LTD, Kyoto, Japan

a r t i c l e

i n f o

Article history:
Received 18 July 2011
Received in revised form 7 February 2012
Accepted 10 February 2012
Keywords:
Visual surveillance
Object detection
Object tracking
Particle lter

a b s t r a c t
How far can human detection and tracking go in real world crowded scenes? Many algorithms often fail in
such scenes due to frequent and severe occlusions as well as viewpoint changes. In order to handle these difculties, we propose Scene Aware Detection (SAD) and Block Assignment Tracking (BAT) that incorporate
with some available scene models (e.g. background, layout, ground plane and camera models). The SAD is proposed for accurate detection through utilizing 1) camera model to deal with viewpoint changes by rectifying
sub-images, 2) a structural lter approach to handle occlusions based on a feature sharing mechanism in
which a three-level hierarchical structure is built for humans, and 3) foregrounds for pruning negative and
false positive samples and merging intermediate detection results. Many detection or appearance based tracking systems are prone to errors in occluded scenes because of failures of detectors and interactions of multiple
objects. Differently, the BAT formulates tracking as a block assignment process, where blocks with the same
label form the appearance of one object. In the BAT, we model objects on two levels, one is the ensemble
level to measure how it is like an object by discriminative models, and the other one is the block level to measure how it is like a target object by appearance and motion models. The main advantage of BAT is that it can
track an object even when all the part detectors fail as long as the object has assigned blocks. Extensive experiments in many challenging real world scenes demonstrate the efciency and effectiveness of our approach.
2012 Elsevier B.V. All rights reserved.

1. Introduction
Human detection and tracking are classic problems in computer
vision for the applications in visual surveillance, driver-aided system
and trafc managements, and have achieved signicant progresses
recently. Many existing detection and tracking methods, however,
encounter great challenges from radial distortions, illumination variations, viewpoint changes and occlusions, all of which are quite common in real world scenes.
The goal of our work is to cope with these difculties to detect and
track multiple humans in surveillance scenes using a single stationary
camera. Many detection and tracking systems developed so far assume that the viewpoint is frontal, a person enters the scene without
occlusions, a person appears or disappears in some special locations, a
person will exist in the scene for a given number of frames or the
human ow is gentle. In this paper, we present a robust detection
and tracking system attempting to minimize such constraining assumptions, which is able to handle the following difculties: 1) occlusion, when multiple persons crowdedly enter and move in the scene;

This paper has been recommended for acceptance by Xiaogang Wang.


Corresponding author. Tel.: +86 10 62795495; fax: +86 10 62795871.
E-mail addresses: dgq08@mails.tsinghua.edu.cn (G. Duan),
ahz@mail.tsinghua.edu.cn (H. Ai), xjl07@mails.tsinghua.edu.cn (J. Xing),
cao-s08@mails.tsinghua.edu.cn (S. Cao), lao@ari.ncl.omron.co.jp (S. Lao).
0262-8856/$ see front matter 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.imavis.2012.02.008

2) relatively unconstrained camera viewpoints, rotations and heights;


3) relatively unconstrained human motions, appearances and positions with respect to the camera; 4) humans appearing for only a
small number of frames; and 5) relatively slowly moving humans.
We only assume inherently that humans stand on the ground plane
in the scene and ignore those below this ground plane or stand in
other places such as rooftops, windows or sky. This is a very reasonable assumption which is applicable in most of the surveillance scenes.
We innovate from both detection and tracking for the scenes with occlusions and viewpoint changes. Our main contributions include two
aspects as follows.
A Scene Aware Detection for accurate detection. Specically, it includes: (1) A simple but efcient learning algorithm to use foregrounds to prune negative and false positive samples; (2) A
structural lter approach to detect occluded humans in a feature
sharing mechanism; and (3) A foreground aware merging strategy
to explain foregrounds by detected results;
A Block Assignment Tracking for robust tracking where tracking is
formulated as a block assignment process and the objects are modeled in different levels, i.e. the block level and the ensemble level.
Blocks with the same label form the appearance of one object,
from which robust appearance and motion models can be established. Its main advantage is that it can track an object even when
all the part detectors fail as long as the object has assigned blocks.

Author's personal copy


G. Duan et al. / Image and Vision Computing 30 (2012) 292305

293

The rest of this paper is organized as follows. Related work is discussed in the next section. Our system is overviewed in Section 3.
Scene Aware Detection is presented in Section 4. Block Assignment
Tracking is described in Section 5. Experimental results on many challenging real world datasets are provided along with some discussions
in Section 6. Conclusions and future work are given in Section 7.

the correct object. In order to overcome some of these problems,


Kelly et al. [22] used 3D stereo information to detect pedestrians via
a 3D clustering process and track them by a weighted maximum cardinality matching scheme.

2. Related work

2.2.2.1. Detection based tracking. With the fast development of object


detection techniques, object detectors play an important role in
many tracking algorithms. Some tracking algorithms use detection
as their observation model. One of the most successful techniques is
particle lter [23]. Particle lter is based on Sequential Monte Carlo
Sampling, which has gained many attentions because of its simplicity,
generality, and extensibility in a wide range of challenging applications. Xing et al. [7] combined multiple part detectors with particle lter to track multiple objects with occlusions. Another kind of work is
to associate detected results of video frames locally [24] or globally
[8,15,2527]. Wu et al. [24] associated detection results in two consecutive frames. Jiang et al. [25] adapted Linear Programming for association, while Zhang et al. [26] used min-cost ow. Andriluka et al. [8]
tailored Viterbi algorithm to link detection results, which combined
both the advantages of detections and tracking. Huang et al. [15] presented a three-level hierarchical association approach where they
achieved short tracks and long tracks at the low level and middle
level separately, and rened the last trajectories with the estimated
scene knowledge at the high level. Pirsiavash et al. [27] proposed globally optimal greedy algorithms to estimate the number of tracks, their
birth and death states in a cost function. Global association based
tracking method could theoretically obtain a global optimum, since
the results of all the frames are available before tracking. However,
the cost of heavy computations and temporal delays limits them in
real time applications.

There are a great deal of works in the literature on object detection,


such as faces [1] and pedestrians [24], and multiple target tracking,
such as vehicles [5] and humans [68]. Here we mainly review some
robust detection methods to cope with occlusions and viewpoint
changes at rst, and then discuss some detection related and detection
free tracking algorithms.
2.1. Robust detection
2.1.1. Occlusion handling
Using multiple part detectors, Wu et al. [2] proposed a Bayesian
approach for combination, while Huang et al. [3] introduced a dynamic search. Wang et al. [9] proposed a global-part occlusion handling
method, where an occlusion likelihood map was produced from HOG
feature responses rst and then segmented by mean shift approach.
2.1.2. Viewpoint change handling
Due to the changes of viewpoints, human appearances and poses
vary a lot. To solve this difculty, Li et al. [10] detected objects in rectied sub-images with a learned frontal viewpoint detector. Another
method is to learn one powerful detector for all possible viewpoints,
such as [11,12]. Duan et al. [11] clustered the complex multiple viewpoint samples into several sub-categories rst and then learn a classier for each sub category. Felzenszwalb et al. [12] proposed a more
efcient model, Deformable Part based Model, in which a root lter
and several parts models are learned for each object category that
can detect objects with some pose changes.
2.1.3. Integration with other models
Beleznai et al. [13] used local shape descriptors to infer human locations in images of absolute background difference from background
model. Hoiem et al. [14] and Huang et al. [15] utilized scene geometric
model to restrict the object locations and ground plane model to restrict the objects heights in a particular location.

2.2.2. Detection related tracking

2.2.2.2. Online learning. Avidan [28] trained an ensemble of weak classiers online to distinguish between the object and the background.
Grabner et al. [29] described an online boosting algorithm for realtime tracking, which was very adaptive but may drift. To limit the
drifting problem, Grabner et al. [30] introduced a semi-supervised
learning algorithm using unlabeled data explored in a principled manner, while Babenko et al. [31] proposed an online Multiple Instance
Learning using one positive bag consisting of several image patches
to update a learned classier. However, manually initialization and focusing on single object tracking prevent their applications in our interested scenes.

2.2. Robust tracking


3. System overview
2.2.1. Detection free tracking
Some techniques assume that objects enter the scene in some specic location [5], or appear in the scene without occlusions [5,16] for a
period of time that allows object models to be built up while they are
isolated. Some techniques (e.g. [5,6]) depend on accurate segmentation of moving foreground objects from a background color intensity
model, where Kamijo et al. [5] segmented foreground blocks into vehicles using spatialtemporal information, and Zhao et al. [6] developed
a tracker based on human shape model. All of them rely on an inherent
assumption that there will be signicant difference in color intensity
information between foreground and background. Unfortunately,
there are many problems for background modeling, such as being inaccurate, noise sensitive, and weak in shadow. Similar assumptions
are made in [1720], where the authors extracted features, e.g. intensity, colors, edges, contours, feature points, and used them to establish
the correspondences between model images and target images. Moreover, shape based approaches [6,21] will encounter challenges when
body parts are not isolated which may cause signicant occlusions,
and appearance based ones [16] often fail when several objects get
close together as this kind of algorithms fail to allocate the pixels to

We propose to detect and track multiple humans in surveillance


scenes with occlusions and viewpoint changes using a single stationary camera by taking advantage of some available scene models (e.g.
background, camera, layout and ground plane models). We believe
that the models we use are generic and applicable to a wide variety
of situations. The models used are listed as follows.
(a) A camera model to rectify an image with large viewpoint
changes into a frontal viewpoint;
(b) A background model to direct the system's attention to the regions showing difference from the background;
(c) A layout model to restrict objects in the scene;
(d) A ground plane model to restrict objects standing on the ground.
The whole system is overviewed in Fig. 1, which mainly includes
two components, Scene Aware Detection and Block Assignment Tracking. The three key factors of the SAD are foreground aware pruning to
prune negative and false positive samples, a structural lter approach
based on our previous work [4] to detect occluded objects, and foreground aware merging to explain foregrounds by detected results. The

Author's personal copy


294

G. Duan et al. / Image and Vision Computing 30 (2012) 292305

Fig. 1. System overview. Round rectangle box: inputs and outputs. Rectangle box: procedure. Solid arrow: data ow. Double-line arrow: extra input models. The key factors of our
system are marked out in bold.

BAT formulates tracking as a block assignment process, which can track


an object even when all the part detectors fail, as long as the object has
assigned blocks. The BAT proceeds as follows. It maintains the spatial
and temporal consistence in the block level rst (Block Tracking) and
then precisely estimates locations and sizes of objects in the ensemble
level using appearance, motion and discriminative models (Ensemble
Tracking), and at last assigns blocks to maintain the blocks with the
same label look like apart of human by combining both previous results
(Ensemble to Block Assignment). In implementations, we split each frame
into 8 8 blocks and typically a 640 480 image contains 80 60 =
4800 blocks. A block is called a foreground block if the pixel number
in the foreground region is larger than 20% of that in the whole. Similar
to [5], the BAT takes foreground blocks into account and ignores background ones.
The BAT is a particular segmentation problem, coarser than pixel
level segmentations but ner than bounding boxes as illustrated in
Fig. 2. Pixel level segmentations are dened to achieve the most accurate results. But they are somewhat prone to errors for occlusions and
particularly non-rigid objects like humans with viewpoint changes as

their contours are disturbed and vary drastically. These restrictions


prevent such methods from applications in our concentrated scenes.
Bounding boxes may take extra (non-object or other object) pixels
into account and miss some real pixels. These drawbacks also exist
in the BAT but relatively more moderate, since BAT considers foreground blocks and ignores background ones. Hence, more importantly, BAT can build up more robust appearance and motion models for
objects from these blocks than bounding boxes.
4. Scene aware detection
4.1. Scene models
Background model is widely used in many tracking systems. In
order to establish a background model robust to noise, motion and illumination variations, we employ the lifespan background modeling
algorithm in our previous work [32], where short, middle and long
life span models are online adaptively built and updated in a collaborative manner.

Fig. 2. Comparisons of BAT, bounding boxes and pixel level segmentations in one object. (a) an image; (b) the foreground image; (c) ideal pixel level segmentations labeled manually; (d) bounding boxes with extra pixels (left) and missed pixels (right); and (e) BAT with extra blocks (left) and missed blocks (right). Please see Section 3 for more discussions.

Author's personal copy


G. Duan et al. / Image and Vision Computing 30 (2012) 292305

295

Fig. 3. Models in detection: (a) original images; (b) foregrounds; (c) scene layouts; (d) some searching points in red with lines whose lengths indicating the corresponding human
heights; (e) cropped sub-images and their foregrounds; and (f) detection results projected as quadrangles in original images. The top and bottom rows show a common frontal
viewpoint scene and a changed viewpoint one separately. Note that, in the latter occasion, camera models are adopted to handle the difculty of viewpoint changes.

class classication problem on binary images, and design a simple


discriminatively learning algorithm under the boosting framework
[1]. The aim is to mine some features to learn a fast and effective
pruning detector.
Our used features are based on zero moment of region RG, M(RG),
where M(RG) = (x, y) RGI B(x, y) in a binary image I B. Each feature r is
a sub-region of I B as shown in black of Fig. 4(c). The feature value can
be calculated as

Camera models are utilized to handle viewpoint changes in detection. We follow the method [10], which rst detects objects in subimages rectied from a changed viewpoint to a frontal viewpoint,
and then projects the detection results into the original image. This
kind of method is able to take advantage of detectors learned for a
frontal viewpoint and avoid a more difcult training for multiple
viewpoints samples. During detection, the sampling in 3D space is
projected into the image coordinate as shown in Fig. 3(d) (bottom).
Moreover, there is no need to do such rectications for frontal viewpoint scenes. To speed up detection in these scenes, we assume a
linear mapping from 2D coordinate (x,y) to the human height (Lh),
c1x + c2y + c3 = Lh. c1, c2 and c3 are unknown parameters and can be
estimated through a RANSAC style algorithm like [33]. During detection, the sampling in 2D space is a scanning window process restrained by the linear mapping as shown in Fig. 3(d) (top). Please
refer to [33,10] for details.
Layout models can be easily marked out for stationary scenes such
as Fig. 3(c). We assume that humans stand on the ground plane in the
layout. After integrating these two models with the linear mapping or
camera model mentioned earlier, we can obtain the sampled searching points and corresponding human heights in scenes as illustrated
in Fig. 3(d).





 M rM I B r 
B
 B
f r; I
I 

where |I B| is the total number of pixels in I B. We restrict r as a rectangle, and hence Eq. (1) can be calculated efciently through an integral
image without generating image pyramids like [1].
Positive samples for the pruning can be achieved by labeling manually as shown in Fig. 4(b). However, collecting negative samples are
impractical because of two reasons. One reason is that negative samples can be in any form, which is too time consuming for manually
labeling. The other reason is that when applying the pruning detector,
negative samples themselves are always inaccurate because of noises
in background modeling, and thus it is likely that parts of real objects
are missing in foregrounds and some backgrounds are included in
objects. In fact, negative samples are not necessary because 1) small
amount of negative samples may cause overtting, and 2) large
amount of negative samples might make the pruning detectors very

4.2. Foreground aware pruning (FAP)


This step is to prune negative and false positive samples by foregrounds as shown in Fig. 4(a). We take this pruning problem as a 2-

Negative

False positive of left-body

False positive of head-shoulder

False positive of upper-body

False positive of right-body

False positive of whole-body

Fig. 4. Foreground pruning. (a) Typical pruned negative and false positive examples. (b) Whole body positive masks, from which other part positive masks can be generated.
(c) Five used features.

Author's personal copy


296

G. Duan et al. / Image and Vision Computing 30 (2012) 292305

complex and thus they are inefcient to prune negative and false positive samples. Motivated by the above, pruning classiers are learned
with positive samples only. The classier on feature r is determined as
 
B
hr I



B
1; f r; I T r > 0
0; otherwise

where Tr = min xBi f(r, xiB) , is small positive(10 2), and xiB is a positive sample. In consideration of the inaccuracy of background modeling, positive samples are disturbed by moving 3 pixels left or right, or
2 pixels top or bottom.
This pruning should be fast and effective. Instead of automatically
selecting good features from a large feature pool as [1], we simply
design several features as shown in Fig. 4(c). All classiers learned
on these features are combined together to be one strong detector,
whose orders are not constrained. Then a searching window will be
considered if its corresponding foreground passes this strong detector. For a n m image, the pre-processing of an integral image costs
O(nm) time and space. Then our used feature can be calculated in
O(1) time and thus a bunch of classiers will cost approximately constant time. Its effectiveness will be evaluated in the experiment.
4.3. Structural lter approach
The detection is based on our previous work [4,34]. We proposed
to learn an Integral Structural Filter (ISF) detector in [4] to detect
humans with occlusions and articulated poses in a feature sharing
mechanism. We build up a three-level hierarchical model for human,
words, sentences and paragraphs, where words are the most basic
units, sentences are some meaningful sub-structures and paragraphs
are the appearance statuses (e.g., headshoulder, upper-body, leftpart, right-part and whole-body in occluded scenes). An example is
shown in Fig. 5. We integrate the detectors for the three levels through
inferring from word to sentence, from sentence to paragraph and from
word to paragraph. All detectors for structures (words, sentences and
paragraphs) are based on Real Adaboost algorithm and Associated
Pairing Comparison Features (APCFs) [34]. APCF describes invariance
of color and gradient of an object to some extent and it contains two
essential elements, Pairing Comparison of Color (PCC) and Pairing
Comparison of Gradient (PCG). A PCC (or PCG) is a Boolean color (or
gradient) comparison of two granules in which a granule is a square
window patch. Please refer to [4,34] for more details.

4.4. Foreground aware merging (FAM)


We then discuss the merging strategy after obtaining all detected
results. Different from previous approaches (e.g. [2,3]) which stick to
detection results, we integrate foreground information into postprocessing. We consider objects one by one after extending them to
the whole body through adding and deleting operations dened on
visible and invisible parts of objects. To reduce the complexity of
computation, the two operations are based on blocks as dened in
Section 3.
A hypothesis h is a detected response. We denote the block set and
foreground block set of h be Bh and Fh respectively. For a hypothesis set
H, we have BH hH Bh and F H hH F h correspondingly. The score
of adding h into H is dened as

8 

>
>
< F Hfhg F H 
;
scadd h B
 Hfhg BH 
>
>
:
0;





F Hfhg F H  > T M  F h ; hH

otherwise:

h can be added if scadd(h) > Tadd. TM is a threshold. The score of deleting


h from H is dened as

8 
 
>

>
< F H F Hfhg  

 ; F H F Hfhg  > T M  F h ;
scdel h B B


>
H
Hfhg
>
:
0;
otherwise:

hH

h can be deleted if scdel(h) b Tdel. Tadd and Tdel are empirical parameters.
The less Tadd, the more added objects. The larger Tdel, the more deleted
objects. In the implementation, we propose a greedy way to rst utilize the adding operation to nd possible hypotheses and then the deleting operation to delete some bad ones. Although the strategy is very
simple, it yields promising detection results in the experiments.
5. Block Assignment Tracking
The previous section mainly discusses accurately locating objects
in the scenes with occlusions and viewpoint changes. In this section,
we concentrate on robustly tracking them. In the next, we will rst
derive the formulation of our block assignment tracking problem,
and then present our solution.

Fig. 5. The hierarchical structure of pedestrian [4].

Author's personal copy


G. Duan et al. / Image and Vision Computing 30 (2012) 292305

5.1. Problem formulation

with the aid of robust appearance and motion models of objects estimated from V1:t 1,

Denoting object state sequences from frame 1 to frame T as S1 : T =


{S1, , ST} and the corresponding observation sequences collected
from the frame data as O1 : T = {O1, , OT}, a tracking problem can
be formulated to solve the following MAP (maximum a posteriori)
problem


St argmax pSt jO1:t :

St



Z t ; V t argmax pZ t ; V t jO1:t ; V t1 :
Z t ;V t

Zt

Afterwards, the third step is to achieve the nal assignment by


combining the previous results V~ t and Zt,
V t argmax pV t jZ t ; Ot ; V~ t :


 



V~ t argmax p V~ t V t1 ; Ot1:t :

The third step is based on the other two steps, which can make
blocks with the same label look like a part of some object and potentially rectify possible errors during initialization and tracking. After
integrating these three steps into Eq. (6), we obtain
 


pZ t ; V t jO1:t ; V t1 p V~ t V t1 ; Ot1:t pZ t jO1:t pV t jZ t ; Ot ; V~ t :

10

Therefore, Eq. (10) can be efciently solved by the max-product


algorithm. These three steps will be further explained in the next section correspondingly.
Now, we have elaborated our problem formulation. Since the last
step is to assign blocks each time, we term it as Block Assignment
Tracking. Compared to [5,7], our formulation provides a simple way
to integrate block and ensemble level information.
5.2. Solution
In this subsection, we present details of the three steps in Eq. (10)
which are Block Tracking, Ensemble Tracking and Ensemble to Block
Assignment correspondingly. At the end, we give a summary of our
tracking algorithm.

Compared to Eq. (5), Vt 1 in the right side of Eq. (6) takes the previous assignment into account. However, the optimization of Eq. (6) is
not tractable because V1:t and Zt are closely intertwined at time t. The inference between Vt and Vt 1 should hold the spatial and temporal persistence of block assignments. Meanwhile, Zt encourages blocks with
the same label in Vt to look like an object. Moreover, V1:t 1 can provide
robust appearance and motion models of objects for inferring Zt. To
make the optimization tractable, we propose to split Eq. (6) into three
steps. The rst step is to obtain an intermediate assignment V~ t through
inferences on the block level of two sequential frames ignoring Zt,

V~ t

Z t argmax pZ t jO1:t :

Vt

Generally, an object state can be modeled as the location and size


of the object on the ensemble level like [7] or a set of blocks forming
the appearance like [5]. Tracking on the ensemble level is efcient
when objects are isolated. However, it tends to have errors when objects interact with each other since ensemble observations can be ambiguous and missing because of occlusions. When objects are well
initialized, tracking on the block level is efcient even with heavy occlusions, which mainly considers the block persistence in spatial and
temporal spaces. But it cannot guarantee that a segmented region is
like an object part. In fact, it might contain none or several objects.
Moreover, it does not have an explicitly correcting mechanism to rectify errors that arose during initialization and tracking. In order to
combine their merits and get rid of their restrictions, we propose to
model object states on both ensemble and block levels as St = {Zt, Vt},
where Zt = {zt, k}kK= 1 is the ensemble level state of all K objects and
Vt = {vt, i}iN= 1 is the block level state of all N blocks. vt, i is the label for
block bt, i, indicating that bt, i belongs to object zt, vt, i if vt, i 0, or background if vt, i b 0. All blocks with the same label form the appearance
of an object, while the ensembles describe coarse shapes of objects
and cover some blocks assigned to them as illustrated in Fig. 6. Therefore, we modify Eq. (5) and formulate our problem as


297

This step can hold the persistence of block assignments in spatial


and temporal spaces. Then, the second step focuses on inferring Zt

5.2.1. Block Tracking


This step is to predict an intermediate result by taking advantages
of the constraints of label, color, and shape. Inspired by the similar
problem in [5], we dene

K X
N
 
 X

 



lnp V~ t V t1 ; Ot1:t
i bt;i ; zt;k V t1 ; Ot1:t vt;i ; k
k0 i1

N



X

i bt;i ; bt; j1 ; ; bt; jl V t1 ; Ot1:t
i1

11

where i is the penalty if bt,i is assigned to zt,k, i is the penalty when bt,i
and its neighbors are assigned to different objects, (i, j) is a Kronecker
function, equaling to 1 if i = j or 0 otherwise, l = |Nbt, i| and Nbt, i are 8neighbored blocks of bt,i. The observations here are actually image
sequences and object states are updated straightforwardly from their
previous states as zt, k = zt 1, k + rzt, k by their motions rzt, k = (rzt, k x, rzt, k y).
The motion of an object is represented by the most frequent motion

Fig. 6. Tracking Problem formulation. Left: original image. Middle: foreground block image. Right: an assignment where blocks in the same color (label) form the appearance of one
object and the quadrangles indicate coarse shapes of objects.

Author's personal copy


298

G. Duan et al. / Image and Vision Computing 30 (2012) 292305

among its all blocks, where the motion for one block can be obtained by
applying block matching. Then we give the denitions of i and i.

pZ t jO1:t1 DZ t jZ t1 pZ t1 jO1:t1 dZ t1





i bt;i ; zt;k V t1 ; Ot1:t aDLt;i;k bDMt;i;k cDAt;i;k :

DMt,i,k is a temporal constraint of the label consistency, dened as



2
DMt;i;k Mt;i;k Mi . Mt,i,k is the number of pixels in the overlapped area between zt 1,k and the region moving bt,i by rzt, k. Mi
is the total number of pixels in a block.
DAt,i,k is a color constraint, which measures the temporal color coherence. Letting It be the gray scale frame at time t, we dene









 It x dx; y dy





:
DAt;i;k

x dxr zt;k x ; y dyrzt;k y 
I

0 dxb8  t1



0 dyb8

n1

13

14
Similar to [5], we set a = 1, b = 1, c = 0.125, d = 0.00000025 and
g = 0.5, and adopt Gibbs Sampler algorithm [35] to solve Eq. (11).
Please refer to [5,35] for more details.
5.2.2. Ensemble Tracking
This step is to estimate object locations accurately in the ensemble
level and offer the potential to amend possible errors in initialization
and tracking as discussed earlier. The errors are not notable in a short
time (NE frames for simplicity), but will be magnied vastly as time
passes. For the former situation, object states updated by their motions are adequate. For the latter situation, we refer to the update
step for a sequential Bayesian estimation problem:
pZ t jO1:t LOt jZ t pZ t jO1:t1

15

pZ t jO1:t

Np
X
n1

t;k znt zt

17

in which Np is the total number of particles, and z() denotes the deltaDirac function at position z. The nth particle is denoted as pn = (xtn, stn,
Htn, {t,n k}1K). xtn = (xtn, ytn) is the location, stn is the scale, Htn is the appearn
ance model, t,k
is the weight for zt,k. Motivated by the successes of
[7,16], we dene
(


K 
l


2
X
X

2
i bt;i ; bt;j1 ; ; bt;jl V t1 ; Ot1:t d
N i;k Nk g
rt;i rt; in :
k1

where L(Ot|Zt) is the likelihood of observation and D(Zt|Zt 1) is the


dynamic model of the system, which is modeled as one order Gaussian
by considering object motions.
In order to approximate the ltering distribution, Particle Filter (PF)
approach [23] used a set of weighted particles. Its direct extension for
multiple object tracking models objects as unrelated. However, it may
cause ID switches when tracking adjacent objects, because observations
are ambiguous to be assigned to objects. Differently, we do not distinguish particles generated from different objects. Fig. 7 compares the
two strategies. Formally, we extend [23] by

i is the spatial constraint of the label consistency

16

12

DLt,i,k is a rough shape constraint to restrict the spread of block labels.


As object shapes are quadrangles, we need to eliminate the effects of scale
along axis and rotation in the 2D plane. Our idea is to make use of a nor


1=W
0
cos sin
malization matrix x~
, where [WH]T is
0
1=H
sin
cos
the minimum size of detection and is the angle between an object and

T
the vertical. Let the centers of bt,i and zt,k be xt;i xt;i ; yt;i and xzt;k

T


 
xzt;k ; yzt;k respectively. We dene DLt;i;k exp x~ xt;i xzt;k 2 .

in which p(Zt|O1 : t 1) is the prediction step

t;k

t;k 1t;k ; xt xzt;k b;


0;
otherwise
n;D

n;G

18

n,D
is a discriminative weight modeled using the detector conwhere t,k
n,G
dence and t,k
is an appearance weight measured from an online
learned appearance model. is a parameter (=0.5 here). is a distance
threshold. The appearance models for particles or objects come from
pixels in foreground blocks. We utilize HSV color space and the number
of bins for each channel is set to 16.
But objects may get lost sometimes during tracking. If an object
cannot get enough support particles (t,n k > ), it is lost and buffered
for possibly matching newly detected objects. We perform object detection (SAD) every NF frames to nd new objects. If a lost object cannot get matched in TW frames, it will be discarded.

5.2.3. Ensemble to Block Assignment


This step is to achieve the nal result with the intermediate assignment V~ t and the estimated object state Zt. This problem is a multi-label
problem, which can be easily converted to 2-label problem by adding
objects one by one and then solved by graph cut algorithms. Suppose
object map Vt is obtained after adding objects zt,1 ~ zt,k 1 and it will
add object zt,k. Then the target is to minimize the following energy
function each time
Ek

N
X



i bt;i ; zt;k

i1



i;j bt;i ; bt;j :

bt;j N b

19

t;i

Fig. 7. Comparisons of sampling strategies. (a) shows a scene with six persons. (b) PF [23] models objects as unrelated. (c) In our strategy, particles from different objects are not
distinct. But those far away from the concerned object are ignored (e.g., only particles from object D, C and E contribute to D).

Author's personal copy


G. Duan et al. / Image and Vision Computing 30 (2012) 292305

Unary item i encodes the data likelihood, which imposes penalties


for assigning block bt,i to object zt,k. We consider the shape model and
the prior knowledge in







n
i bt;i ; zt;k xt;i ; xzt;k  i 1 v~ t;i ; k

20

where (,) is a kernel function dened as (xt, i, xzt, k) = DLt, i, k. is an


occluded factor. Let ni be the number of objects that occlude zt,k in
block bt,i, where an object is occluded by others if they are overlapped
and its y-axis value is larger. Intuitively, the larger ni, the lower i. We
have 1 (set to 1.25 in our experiments).
Pairwise item i,j encourages the spatial coherence and imposes
penalties when bt,i and bt,j are assigned with different labels. As a
sub-modular energy function can be solved by graph cut algorithms,
we adopt Potts model for simplicity
1
0 2
0
1
2
At;i ; At;j

r
r

t;i
t;j
@
A 1exp@
A;
r
i;j exp
A
>
>
:
0;
8
>
>
<

vt;i vt;j

299

not assigned to it, the update ratio should be small. The more occlusion, the less update ratio. Based on this, we dene the update ratio
as = 0.5Nk/Na, where Nk is the number of blocks assigned to zt,k
and Na is the total number of blocks overlapped by zt,k. Given the previous and current appearance models for zt,k, Ap and Ac, the update is
described as A = (1 )Ap + Ac.
Until now, we have described the three key components of the
BAT. For easy reference, the entire procedure of the BAT is summarized in Fig. 8.
6. Experiments
In this section, we carry out extensive experiments to evaluate our
proposed detection and tracking system. We rst describe the training
and testing datasets, and then list some detection and tracking metrics
for evaluations, and then evaluate the performance of our system, and
make some discussions at last.
6.1. Datasets

otherwise
21

where At,l and rt,l are the appearance and motion of bt,l (l = i, j). is a
parameter (=0.5 here). A and r are normalization factors. Here the
appearances of blocks are modeled as 4 bins histogram in gray images.
A is set to be the number of pixels in a block (=64 here). Suppose the
maximum motion of a block is the block size (8 8), and thus we set
r = 8 2 + 8 2 = 128.
After achieving the nal assignment, we then update appearance
models of objects. Intuitively speaking, if an object is occluded by
others, meaning that some of its overlapped foreground blocks are

We have labeled 2470 positive masks of 24 58 as shown in


Fig. 4(b) for training the foreground pruning detector. We also have
collected 18,474 whole body positive samples of 24 58 for learning
object detectors as shown in Fig. 9. The positive masks and samples
of the other parts can be generated from those of whole body using
the denitions in Fig. 5.
We use a large variety of challenging test datasets with different
situations of occlusions and viewpoints for evaluation as summarized
in Fig. 10. Occlusions or viewpoint changes in these real world datasets make them valuable for evaluating detection and tracking systems. As the viewpoint in CAVIAR1 is frontal, learned detectors can

Fig. 8. The algorithm of our system.

Author's personal copy


300

G. Duan et al. / Image and Vision Computing 30 (2012) 292305

Fig. 9. Positive samples for the whole body.

Fig. 10. Test datasets. CAVIAR dataset can be downloaded from http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. PETS2007 can be downloaded from http://www.cvg.rdg.ac.uk/
PETS2007/. Humans in CAVIAR2 are too small, and therefore we double the original video size (384 288).

be applied directly. But since the viewpoints in CAVIAR2, PETS2007


and our dataset are tilted, we utilize camera models to cope with it.
In our experiments, we aim at improving both detection and tracking performances with off-line discriminative models. Therefore, test
datasets are totally independent from the training set and we employ
the generally trained detectors into all test sequences without retraining them specically for a certain scene.
6.2. Metrics
We use False Positive Per Image (FPPI) for detection evaluation.
When the intersection between a detection response and a groundtruth box is larger than 50% of their union, we consider it to be a successful detection. Only one detection per annotation is counted as
correct.
For multi-object tracking, there is no single established protocol.
We follow two current existing metrics. The metrics [36] count the
number of mostly tracked (MT), partially tracked (PT) and mostly
lost (ML) trajectories as well as the number of track fragmentations
(FM) and identity switches (IDS). The CLEAR-metrics [37] calculate
the Multiple Object Tracking Accuracy (MOTA) which take into account false positives, missed targets and identity switches; and the
Multiple Object Tracking Precision (MOTP) measuring the precision
with which objects are located using the intersection of the estimated
region with the ground truth region.
6.3. Performance evaluations
6.3.1. Detection evaluations
In this subsection, we concentrate on evaluating the performances
of the key components of our SAD, foreground aware pruning (FAP),
Structural Filter approach (ISF) and foreground aware merging (FAM).
Since the number of available frames in test datasets is quite huge, we
only select 200 representative frames from each test datasets for
evaluation.

6.3.1.1. Efciency of FAP. The aim of FAP is to efciently prune negatives


and false positives. Table 1 shows the pruned window proportions and
saved times on these datasets with default detection parameters. In
Table 1, we can see that about 79%94.4% windows are pruned,
which yields a plenty of time saving (0.29 s4.6 s). Since there are
lots of people in our datasets, the pruned proportion is less than
those of the other datasets. Compared to CAVIAR1, the other three
datasets need to rectify sub images, and thus they cost much more
times than CAVIAR1. However, as there are a few (b4) persons in
CAVIAR2, the cost time is not as huge as S02 and our datasets. This experiment sufciently demonstrates the efciency of FAP.
6.3.1.2. Efciency of SAD. We choose two state-of-the-art works [12,38]
for comparison with our SAD. ACF [38] has achieved good performances
for pedestrian detection, which is a strong competitor for frontal viewpoint detection. ACF is learned on the same training dataset as our ISF
for a fair comparison. Since there are no publically available detectors
for multiple viewpoints of humans, we use Deformable Part Model
(DPM) [12] as a baseline, which is very famous for detecting objects
with large variations. The original DPM detector is provided by the
author and trained on Pascal VOC 2008. For a fair comparison, we also
train a new DPM detector on the same training dataset as our ISF. To
distinguish them, we denote them as DPM1 and DPM2 separately.
In the following, for concise descriptions, we let MAP be the
Bayesian method in [2] and NAIVE be the simplest strategy to combine
Table 1
Evaluations of foreground aware pruning. NH is the average number of humans. PPW is
the average proportion of pruned windows in all scanned windows. T is the cost time
without foreground aware pruning and t is the saved time when using it.

NH
PPW
t (ms)
T(ms)

CAVIAR1

S02

Our dataset

CAVIAR2

6
94.4%
700
1210

9
90%
4600
10,400

11
79%
1200
7560

4
86%
290
650

Author's personal copy


G. Duan et al. / Image and Vision Computing 30 (2012) 292305

301

Fig. 11. Evaluation of our SAD compared to DPM [12] and ACF [38]. (a), (b), (c) and (d) compare our approach with several state-of-the-art works on CAVIAR1, S02, Our dataset and
CAVIAR2 separately. (e) and (f) zooms in our approach on S02 and our dataset respectively to illustrate more details.

detection results by the near locations in the following. Note that except CAVIAR1, the other three test datasets need rectications with
camera models. The methods using camera models are indicated by
CAM. The ROC curves are shown in Fig. 11.
6.3.1.2.1. Improvements of FAP. Compared to ISF + NAIVE, ISF +
NAIVE + FAP improves the detection rate about 3% on CAVIAR1. Compared to ISF + NAIVE + CAM, ISF + NAIVE + CAM + FAP improves the
performance about 4% on S02, 4% on our dataset, 1% on CAVIAR2. Similar performance improvements are achieved in ISF + MAP + CAM +
FAP. From the experiments in Fig. 11 and Table 1, we can see that
FAP not only works well on pruning but also improves the detection
performances.
6.3.1.2.2. Improvements of ISF and scene models. ISF + MAP performs better than or comparable to ACF + MAP on CAVIAR1, S02
and our dataset, demonstrating that ISF can detect occluded humans
in scenes without large viewpoint changes. And ISF (ISF + MAP and
ISF + NAIVE) is better than DPM (DPM1 and DPM2),where there
might be two main reasons: 1) the ability of deformable part based
model is limited on strong labeled samples like our training dataset,
and 2) the weak feature in DPM uses only gradient information,
while the weak feature in ISF combines both color and gradient information which is more discriminative than the former for pedestrian
detection. Note that, as DPM2 is more focused on pedestrians than
DPM1, it performs better than DPM1 on S02, and comparable to
DPM1 on CAVIAR1 and our dataset. But all these detectors fail in CAVIAR2 because of large viewpoint changes, which can be better solved
by camera models. Comparing with ISF + NAIVE, ISF + NAIVE + CAM
improves the performances about 3% on S02 and 26% on our dataset,
and it works well on CAVIAR2. Similar improvements are achieved in
ISF + MAP + CAM. As the viewpoint of CAVIAR1 is frontal, the linear
mapping from 2D coordinate to human height is used. In the experiment, we nd that the linear mapping does not reduce the detection
performance, while speeds up the detection about 0.6 s compared to
only using ISF on average.
6.3.1.2.3. Improvements of FAM. We replace the post processing
method by FAM to show further performance improvements. Compared

to ISF + MAP + FAP, our approach (ISF + FAM + FAP) can improve the
detection rate by about 11% in CAVIAR1. Compared to ISF + MAP +
FAP+ CAM, our approach (ISF + FAM + FAP + CAM) improves the detection rate about 16% on our dataset and 14% in S02. As MAP adds objects in y-decent order which is not true in large viewpoint changed
scenes, it does not work well in CAVIAR2 and even worse than NAIVE
sometimes. On contrast, our approach can still work well in such scenes
and achieves 52% detection rate at FPPI = 0.1 in CAVIAR2. We also observe another interesting phenomenon: the curves of our approach are
much cliffy than the others. It indicates that we can detect more objects
with less false samples. This is mainly because of pruned false positive
samples and used scene models. We zoom in the curves of Fig. 11(b)
and (c) to illustrate more details in Fig. 11(e) and (f) respectively.
6.3.1.2.4. Summary. These experiments have shown the effectiveness of the key components (FAP, ISF and FAM) of our SAD in occluded and viewpoint changed scenes. Therefore, our SAD as a whole
outperforms many of the state-of-the-art detection algorithms such
as [12,38]. But the speed is not satisfactory. The detection costs on average about 0.51 s, 5.8 s, 6.36 s and 0.36 s on CAVIAR1, our dataset,
S02 and CAVIAR2 respectively. Because of changed viewpoints and
heavy occlusions, it costs much more time on our dataset and S02
than on CAVIAR1 and CAVIAR2. For further speedup and performance
improvements, we recommend our proposed BAT, which is evaluated
in the next subsection.
6.3.2. Tracking evaluations
In this section, we report our BAT tracking performances on all test
datasets based on the SAD results without retraining detectors for
specic scenes. For concise descriptions, we let our BAT with and
without camera models be BAT + 3D and BAT + 2D respectively.
6.3.2.1. Algorithms for comparisons. We compare our approach with
some state-of-the-art tracking algorithms [26,24,15,7,36,27]. We utilize the implementation 1 of [27] to carry out experiments by
1

http://www.ics.uci.edu/~dramanan/.

Author's personal copy


302

G. Duan et al. / Image and Vision Computing 30 (2012) 292305

Fig. 12. Quantitative results on CAVIAR1. *The denitions of fragment and IDS numbers in [26,7] are obtained by looser evaluation metrics.

ourselves. In this implementation, the authors do not use appearances


after detecting objects. Therefore, it will obtain relatively more fragments and ID switches as well as missing detections. We improve its
performance by (1) utilizing background modelings to remove false
positive samples, and (2) building up appearance models for detected
objects to associate them and (3) adjusting some parameters to
achieve better tracking results. After this improvement, it can track
more humans, but there are still too many fragments and ID switches.
Thus, we only use it for comparisons on the following metrics, MT, PT,
ML and MOTP.
Besides these state-of-the-art algorithms, we also use two simplied versions of our BAT as baselines to demonstrate the improvement
of combining both block and ensemble information. One baseline only
uses Ensemble Tracking, shortened as BAT(ET). BAT(ET) + 2D, where
camera models are not used in detection, is similar to [7]. BAT(ET) +
3D, where camera models are used in detection, is a better way to
show the improvement of BAT + 3D with ensemble information. The
other baseline only uses Block Tracking, shortened as BAT(BT). Objects
can be well initialized in CAVIAR2 because of little occlusions, but not
in CAVIAR1, our dataset and S02 because of severe and frequent

occlusions. Therefore, BAT(BT) + 3D is fair in comparison with BAT +


3D on CAVIAR2, in which camera models are used in detection.
6.3.2.2. Quantitative results. The obtained results are shown in Figs. 12
and 13.
6.3.2.2.1. CAVIAR1. We compare our BAT with [26,24,15,7,36] in
Fig. 12. Among them, our method achieves the highest MT. Our FM
and IDS are a little higher than [36], which mainly because we handle
sequences online but [36] used all detection results to obtain a global
optimization. The MOTA and MOTP of our approach are both better
than those of [7], showing the efciency of combining block and ensemble information. In general, CAVIAR1 is relatively easy for many
tracking systems, however, the further used test datasets are more
challenging.
6.3.2.2.2. Our dataset and S02. We compare our approach with
[7,27] on these two datasets in Fig. 13(top) and (middle). As described
in Section 6.3.1, many state-of-the-art detection algorithms do not
perform as well as our detection approach in scenes with heavy occlusions and slightly changed viewpoints. The detection processes in
[7,27] lost many humans on our dataset and S02, which reduces the

Fig. 13. Quantitative results of our method on our dataset, S02 and CAVIAR2.

Author's personal copy


G. Duan et al. / Image and Vision Computing 30 (2012) 292305

tracking performances. While, our SAD can detect many humans and
our BAT generally performs much better and more stable than [7,27].
BAT(ET) + 3D can track more objects than [7,27], but it obtains many
fragments and ID switches. Compared to BAT(ET) + 3D, BAT + 3D
achieves a better performance. It obtains higher MT/PT/MOTP/MOTA
and lower FM and IDS. This improvement shows that combinations
of block and ensemble information are superior to only using ensemble information for tracking. Compared to BAT + 2D, the improvement
of BAT + 3D majorly lies in the usage of camera models because of

303

slight changed viewpoints. Furthermore, BAT + 3D is always better


than BAT + 2D in MT/PT/ML, but not in other metrics. A part of the reason is that the ground truths are labeled as rectangles, but the tracked
humans of BAT + 3D are quadrangles. However, because our dataset is
much more crowded than S02, there are still many partially tracked
objects.
6.3.2.2.3. CAVIAR2. Because of the extremely large viewpoint
changes, the methods without using camera models (such as [7,27]
and BAT + 2D) fail totally in this dataset. As far as we know, there

Fig. 14. Tracking results. (a) and (b) compares [7] (top) and our approach (bottom) on OneStopMoveEnter1cor of CAVIAR1 and S02. (c) and (d) illustrate sample results in our dataset and Meet_crowd of CAVIAR2 separately. The layouts of (a) and (b) are already shown in Fig. 3(c). The layouts of (c) and (d) are illustrated in the most left. More descriptions can
be found in Section 6.3.2.

Author's personal copy


304

G. Duan et al. / Image and Vision Computing 30 (2012) 292305

are no publically available implementations to deal with multiple


humans tracking in these scenes. Thus, we compare our BAT + 3D
with BAT(BT) + 3D and BAT(ET) + 3D in Fig. 13(bottom). Compared
to BAT(ET) + 3D, BAT(BT) + 3D achieves higher MT, MOTA and
MOTP, but more IDS and FM. Our BAT + 3D can integrate both of
their advantages and achieves better performances. It improves MT
by 13.8%, MOTA by 7.2% and reduces PT by 18.2% than the second best.
6.3.2.2.4. Summary. As described earlier, the application of Block
Tracking is limited, because it requires good initializations, but the
achievement of good initializations is difcult in occluded scenes. Comparing BAT+ 2D than [7] on CAVIAR1 and BAT + 3D than BAT(ET)+ 3D
on the other three datasets (our dataset, S02 and CAVIAR2), we can easily conclude that combinations of block and ensemble information can
improve the tracking performance. From the experiments in Figs. 12
and 13, we can see that our proposed detection and tracking system
can work robustly in the scenes with heavy occlusions and viewpoint
changes.
6.3.2.3. Sample results. Fig. 14 demonstrates some tracking results by
our tracking algorithm, where the green and red arrows point some
IDS, the purple dotted ellipses point some target missing or lost, and
the blue arrow points some false alarms. Panels (a) and (b) illustrate
scenes with targets walking against a crowd. We compare [7] in the
top with our approach in the bottom. Our method can consistently
track these objects, while [7] experiences several instances of IDS, target lost and false alarms. Panel (c) features a subway scene with many
people walking, where the occlusions are very severe and the viewpoint is slightly changed. Our tracker succeeds in tracking many of
them. Panel (d) shows a scene with several people walking where
the viewpoint is extremely changed. Our tracker tracks them successfully all over the sequence.

adding or deleting a detected response each time. Smaller Tadd will


add more objects and larger Tdel will delete more objects. For BAT, NO
and are key parameters. Larger NO (i.e. more particles) can improve
the performance, but it will cost more time. Larger considers more
consistencies in videos, which can improve the tracking performance
when the detection is not so accurate, especially in CAVIAR2 because
of large viewpoint changes. Therefore, we set most parameters by default and they are relatively robust in the experiments, except that we
set NO = 300 and = 5 for CAVIAR2.
6.4.2. Processing speeds
The entire system is implemented in one thread by C++, without
special code optimization and taking advantage of GPU processing. On
a workstation with an Intel Core(TM)2 2.33 GHz and 2 G memory, we
achieve real-time process speed of 2.715 fps (given the video size,
the object number and the changed viewpoint), as shown in Fig. 16
compared with detection only. The current bottleneck is the detection
stage. As not all speedup possibilities are explored yet, the current
run-time raises hope that online experiments in real world applications will not be too far away.
6.4.3. Failure cases
Objects are initialized by detection in our system. The failures of
detection (e.g. missing detection and false alarms) cannot be avoided
in the tracking process. If the initialized object is not so accurate, such
as object 8 in Frame 20 of Fig. 14(c), it drifts easily and tends to be lost.
In particular, camera models impacts a lot on detection in viewpointchanging scenes. Bad estimated camera parameters will lead to unexpected detected results. Besides, our system cannot handle the near
vertical viewpoint where camera is right over the top of objects,
since it is impossible to recover the objects' frontal viewpoint in this
situation as pointed out in [10].

6.4. Discussions
7. Conclusion
6.4.1. Parameters
There are some parameters in the SAD and BAT as listed in Fig. 15
with corresponding descriptions and default values. The affections of
some key parameters to our framework are presented as follows. For
SAD, the parameters TM, Tadd and Tdel directly impact on the postprocess of detection. Smaller TM indicates the more probability of

In this paper, we propose a robust system for multi-object detection and tracking in surveillance scenes with occlusions and viewpoint changes. Our SAD can achieve robust detection through:
(1) camera models to cope with viewpoint changes; (2) structural lter approach to handle occlusions; and (3) foreground aware pruning

Fig. 15. Default parameters.

Author's personal copy


G. Duan et al. / Image and Vision Computing 30 (2012) 292305

Fig. 16. Speed comparisons of detection and tracking (ms).

and foreground aware merging with the aid of some scene models.
Our BAT can track objects robustly even when all the part detectors
fail as long as the object has assigned blocks, which formulate tracking as a block assignment process. Its key factors are: (1) Block Tracking to maintain the spatial and temporal consistence of labels; (2)
Ensemble Tracking to precisely estimate locations and sizes of objects; and (3) Ensemble to Block Assignment to maintain the blocks
with the same label look like a part of human.
Although our method tracks remarkably well even through occlusions and viewpoint changes, one unavoidable drawback is fuzzy object boundaries. To overcome this, we can learn and extract some
discriminative patches to represent and track objects. Another drawback is that the tracking results are jittering, which can be amended
by estimated object trajectories. For detection improvement, we can
use online algorithms to make the ofine and general detectors become adaptive to a xed scene. Although the current system only considers human, the proposed mechanism can be easily to be extended
to other kinds of objects. Based on the detection and tracking results,
some high level analysis of object behaviors become possible. Furthermore, we hope to be able to make our approach applicable to real
world needs.
Acknowledgements
This work is supported in part by National Science Foundation of
China under grant No.61075026, National Basic Research Program of
China under Grant No.2011CB302203. Mr. Shihong LAO is partially
supported by R&D Program for Implementation of Anti-Crime and
Anti-Terrorism Technologies for a Safe and Secure Society, Special
Coordination Fund for Promoting Science and Technology of MEXT,
the Japanese Government.
Appendix A. Supplementary data
Supplementary data to this article can be found online at doi:10.
1016/j.imavis.2012.02.008.
References
[1] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in:
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Kauai, HI, USA, 2001, pp. I-511I-518.
[2] B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single
image by Bayesian combination of edgelet part detectors, in: Proc. IEEE Int.
Conf. Comput. Vis., Beijing, China, 2005, pp. 9097.
[3] C. Huang, R. Nevatia, High performance object detection by collaborative learning
of joint ranking of granules features, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., San Francisco, California, USA, 2010, pp. 4148.
[4] G. Duan, H. Ai, S. Lao, A structural lter approach to human detection, in: Proc.
Eur. Conf. Comput. Vis., Crete, Greece, 2010, pp. 238251.
[5] S. Kamijo, Y. Matsushita, K. Ikeuchi, M. Sakauchi, Trafc monitoring and accident
detection at intersections, IEEE Trans. Intell. Transp. Syst. 1 (2000) 108118.
[6] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations, IEEE Trans.
Pattern Anal. Mach. Intell. 26 (2004) 12081221.
[7] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets ltering and global tracklets association with detection responses, in: Proc. IEEE Int.
Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA, 2009, pp. 12001207.

305

[8] M. Andriluka, S. Roth, B. Schiele, People-tracking-by-detection and peopledetection-by-tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Anchorage, Alaska, USA, 2008, pp. 18.
[9] X. Wang, T.X. Han, S. Yan, An hog-lbp human detector with partial occlusion handling, in: Proc. IEEE Int. Conf. Comput. Vis., Kyoto, Japan, 2009, pp. 3239.
[10] Y. Li, B. Wu, R. Nevatia, Human detection by searching in 3D space using camera
and scene knowledge, in: Proc. IEEE Int. Conf. Image Process., Tampa, Florida,
USA, 2008, pp. 15.
[11] G. Duan, H. Ai, S. Lao, Human detection in video over large viewpoint changes, in:
Proc. IEEE Asi. Conf. Comput. Vis., Queenstown, New Zealand, 2010, pp. 683696.
[12] P. Felzenszwalb, D. McAllester, D. Ramaman, A discriminatively trained, multiscale, deformable part model, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Anchorage, Alaska, USA, 2008, pp. 18.
[13] C. Beleznai, H. Bischof, Fast human detection in crowded scenes by contour integration and local shape estimation, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Miami, FL, USA, 2009, pp. 22462253.
[14] D. Hoiem, A.A. Efros, M. Hebert, Putting objects in perspective, Int. J. Comput. Vis.
80 (2008) 315.
[15] C. Huang, B. Wu, R. Nevatia, Robust object tracking by hierarchical association of detection responses, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 788801.
[16] A. Senior, Tracking with probabilistic appearance models, in: ECCV Workshop on
Performance Evaluation of Tracking and Surveillance Systems, Copenhagen,
Denmark, 2002, pp. 4855.
[17] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 564577.
[18] P. Fieguth, D. Terzopoulos, Color based tracking of heads and other mobile objects
at video frame rates, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., San
Juan, Puerto Rico, 1997, pp. 2127.
[19] M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density,
in: Proc. European Conf. Computer Vision, Cambridge, UK, 1996, pp. 343356.
[20] J.C. Clarke, A. Zisserman, Detection and tracking of independent motion, Image
Vis. Comput. 14 (1996) 565572.
[21] M.D. Rodriguez, M. Shah, Detecting and segmenting humans in crowded scenes,
in: Proc. IEEE Int. Conf. Multimed., Augsburg, Germany, 2007, pp. 353356.
[22] P. Kelly, N.E. O'Connor, A.F. Smeaton, Robust pedestrian detection and tracking in
crowded scenes, Image Vis. Comput. 27 (2009) 14451458.
[23] M. Isard, A. Blake, Condensation-conditional density propagation for visual tracking, Int. J. Comput. Vis. 28 (1998) 528.
[24] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans
by Bayesian combination of edgelet based part detectors, Int. J. Comput. Vis. 75
(2007) 247266.
[25] H. Jiang, S. Fels, J.J. Little, A linear programming approach for multiple object
tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Minneapolis,
MN, USA, 2007, pp. 18.
[26] L. Zhang, Y. Li, R. Nevatia, Global data association for multi-object tracking using
network ows, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Anchorage,
Alaska, USA, 2008, pp. 18.
[27] H. Pirsiavash, D. Ramanan, C.C. Fowlkes, Globally-optimal greedy algorithms for
tracking a variable number of objects, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Colorado Springs, CO, USA, 2011, pp. 12011208.
[28] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007)
261271.
[29] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in:
British Machine Vision Conference, Edinburgh, British, 2006.
[30] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust
tracking, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 234247.
[31] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance
learning, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA,
2009, pp. 983990.
[32] J. Xing, L. Liu, H. Ai, Background subtraction through multiple life span modeling,
in: Proc. IEEE Int. Conf. Image Process., Brussels, Belguim, 2011.
[33] B. Wu, R. Nevatia, Y. Li, Segmentation of multiple partially occluded objects by
grouping merging assigning part detection responses, in: Proc. IEEE Int. Conf.
Comput. Vis. Pattern Recogni., Anchorage, Alaska, USA, 2008, pp. 18.
[34] G. Duan, C. Huang, H. Ai, S. Lao, Boosting associated pairing comparison features
for pedestrian detection, in: Proc. IEEE Workshop Visual Surveillance, Kyoto,
Japan, 2009, pp. 10971104.
[35] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution, and the Bayesian
restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721741.
[36] Y. Li, C. Huang, R. Nevatia, Learning to associate: Hybrid boosted multi-target
tracker for crowded scene, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni.,
Miami, FL, USA, 2009, pp. 29532960.
[37] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance:
the clear mot metrics, J. Image Video Process., 2008, 2008.
[38] W. Gao, H. Ai, S. Lao, Adaptive contour features in oriented granular space for
human detection and segmentation, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Miami, FL, USA, 2009, pp. 17861793.

You might also like