Professional Documents
Culture Documents
The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elseviers archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright
a r t i c l e
i n f o
Article history:
Received 18 July 2011
Received in revised form 7 February 2012
Accepted 10 February 2012
Keywords:
Visual surveillance
Object detection
Object tracking
Particle lter
a b s t r a c t
How far can human detection and tracking go in real world crowded scenes? Many algorithms often fail in
such scenes due to frequent and severe occlusions as well as viewpoint changes. In order to handle these difculties, we propose Scene Aware Detection (SAD) and Block Assignment Tracking (BAT) that incorporate
with some available scene models (e.g. background, layout, ground plane and camera models). The SAD is proposed for accurate detection through utilizing 1) camera model to deal with viewpoint changes by rectifying
sub-images, 2) a structural lter approach to handle occlusions based on a feature sharing mechanism in
which a three-level hierarchical structure is built for humans, and 3) foregrounds for pruning negative and
false positive samples and merging intermediate detection results. Many detection or appearance based tracking systems are prone to errors in occluded scenes because of failures of detectors and interactions of multiple
objects. Differently, the BAT formulates tracking as a block assignment process, where blocks with the same
label form the appearance of one object. In the BAT, we model objects on two levels, one is the ensemble
level to measure how it is like an object by discriminative models, and the other one is the block level to measure how it is like a target object by appearance and motion models. The main advantage of BAT is that it can
track an object even when all the part detectors fail as long as the object has assigned blocks. Extensive experiments in many challenging real world scenes demonstrate the efciency and effectiveness of our approach.
2012 Elsevier B.V. All rights reserved.
1. Introduction
Human detection and tracking are classic problems in computer
vision for the applications in visual surveillance, driver-aided system
and trafc managements, and have achieved signicant progresses
recently. Many existing detection and tracking methods, however,
encounter great challenges from radial distortions, illumination variations, viewpoint changes and occlusions, all of which are quite common in real world scenes.
The goal of our work is to cope with these difculties to detect and
track multiple humans in surveillance scenes using a single stationary
camera. Many detection and tracking systems developed so far assume that the viewpoint is frontal, a person enters the scene without
occlusions, a person appears or disappears in some special locations, a
person will exist in the scene for a given number of frames or the
human ow is gentle. In this paper, we present a robust detection
and tracking system attempting to minimize such constraining assumptions, which is able to handle the following difculties: 1) occlusion, when multiple persons crowdedly enter and move in the scene;
293
The rest of this paper is organized as follows. Related work is discussed in the next section. Our system is overviewed in Section 3.
Scene Aware Detection is presented in Section 4. Block Assignment
Tracking is described in Section 5. Experimental results on many challenging real world datasets are provided along with some discussions
in Section 6. Conclusions and future work are given in Section 7.
2. Related work
2.2.2.2. Online learning. Avidan [28] trained an ensemble of weak classiers online to distinguish between the object and the background.
Grabner et al. [29] described an online boosting algorithm for realtime tracking, which was very adaptive but may drift. To limit the
drifting problem, Grabner et al. [30] introduced a semi-supervised
learning algorithm using unlabeled data explored in a principled manner, while Babenko et al. [31] proposed an online Multiple Instance
Learning using one positive bag consisting of several image patches
to update a learned classier. However, manually initialization and focusing on single object tracking prevent their applications in our interested scenes.
Fig. 1. System overview. Round rectangle box: inputs and outputs. Rectangle box: procedure. Solid arrow: data ow. Double-line arrow: extra input models. The key factors of our
system are marked out in bold.
Fig. 2. Comparisons of BAT, bounding boxes and pixel level segmentations in one object. (a) an image; (b) the foreground image; (c) ideal pixel level segmentations labeled manually; (d) bounding boxes with extra pixels (left) and missed pixels (right); and (e) BAT with extra blocks (left) and missed blocks (right). Please see Section 3 for more discussions.
295
Fig. 3. Models in detection: (a) original images; (b) foregrounds; (c) scene layouts; (d) some searching points in red with lines whose lengths indicating the corresponding human
heights; (e) cropped sub-images and their foregrounds; and (f) detection results projected as quadrangles in original images. The top and bottom rows show a common frontal
viewpoint scene and a changed viewpoint one separately. Note that, in the latter occasion, camera models are adopted to handle the difculty of viewpoint changes.
Camera models are utilized to handle viewpoint changes in detection. We follow the method [10], which rst detects objects in subimages rectied from a changed viewpoint to a frontal viewpoint,
and then projects the detection results into the original image. This
kind of method is able to take advantage of detectors learned for a
frontal viewpoint and avoid a more difcult training for multiple
viewpoints samples. During detection, the sampling in 3D space is
projected into the image coordinate as shown in Fig. 3(d) (bottom).
Moreover, there is no need to do such rectications for frontal viewpoint scenes. To speed up detection in these scenes, we assume a
linear mapping from 2D coordinate (x,y) to the human height (Lh),
c1x + c2y + c3 = Lh. c1, c2 and c3 are unknown parameters and can be
estimated through a RANSAC style algorithm like [33]. During detection, the sampling in 2D space is a scanning window process restrained by the linear mapping as shown in Fig. 3(d) (top). Please
refer to [33,10] for details.
Layout models can be easily marked out for stationary scenes such
as Fig. 3(c). We assume that humans stand on the ground plane in the
layout. After integrating these two models with the linear mapping or
camera model mentioned earlier, we can obtain the sampled searching points and corresponding human heights in scenes as illustrated
in Fig. 3(d).
M rM I B r
B
B
f r; I
I
where |I B| is the total number of pixels in I B. We restrict r as a rectangle, and hence Eq. (1) can be calculated efciently through an integral
image without generating image pyramids like [1].
Positive samples for the pruning can be achieved by labeling manually as shown in Fig. 4(b). However, collecting negative samples are
impractical because of two reasons. One reason is that negative samples can be in any form, which is too time consuming for manually
labeling. The other reason is that when applying the pruning detector,
negative samples themselves are always inaccurate because of noises
in background modeling, and thus it is likely that parts of real objects
are missing in foregrounds and some backgrounds are included in
objects. In fact, negative samples are not necessary because 1) small
amount of negative samples may cause overtting, and 2) large
amount of negative samples might make the pruning detectors very
Negative
Fig. 4. Foreground pruning. (a) Typical pruned negative and false positive examples. (b) Whole body positive masks, from which other part positive masks can be generated.
(c) Five used features.
complex and thus they are inefcient to prune negative and false positive samples. Motivated by the above, pruning classiers are learned
with positive samples only. The classier on feature r is determined as
B
hr I
B
1; f r; I T r > 0
0; otherwise
where Tr = min xBi f(r, xiB) , is small positive(10 2), and xiB is a positive sample. In consideration of the inaccuracy of background modeling, positive samples are disturbed by moving 3 pixels left or right, or
2 pixels top or bottom.
This pruning should be fast and effective. Instead of automatically
selecting good features from a large feature pool as [1], we simply
design several features as shown in Fig. 4(c). All classiers learned
on these features are combined together to be one strong detector,
whose orders are not constrained. Then a searching window will be
considered if its corresponding foreground passes this strong detector. For a n m image, the pre-processing of an integral image costs
O(nm) time and space. Then our used feature can be calculated in
O(1) time and thus a bunch of classiers will cost approximately constant time. Its effectiveness will be evaluated in the experiment.
4.3. Structural lter approach
The detection is based on our previous work [4,34]. We proposed
to learn an Integral Structural Filter (ISF) detector in [4] to detect
humans with occlusions and articulated poses in a feature sharing
mechanism. We build up a three-level hierarchical model for human,
words, sentences and paragraphs, where words are the most basic
units, sentences are some meaningful sub-structures and paragraphs
are the appearance statuses (e.g., headshoulder, upper-body, leftpart, right-part and whole-body in occluded scenes). An example is
shown in Fig. 5. We integrate the detectors for the three levels through
inferring from word to sentence, from sentence to paragraph and from
word to paragraph. All detectors for structures (words, sentences and
paragraphs) are based on Real Adaboost algorithm and Associated
Pairing Comparison Features (APCFs) [34]. APCF describes invariance
of color and gradient of an object to some extent and it contains two
essential elements, Pairing Comparison of Color (PCC) and Pairing
Comparison of Gradient (PCG). A PCC (or PCG) is a Boolean color (or
gradient) comparison of two granules in which a granule is a square
window patch. Please refer to [4,34] for more details.
F Hfhg F H > T M F h ; hH
otherwise:
hH
h can be deleted if scdel(h) b Tdel. Tadd and Tdel are empirical parameters.
The less Tadd, the more added objects. The larger Tdel, the more deleted
objects. In the implementation, we propose a greedy way to rst utilize the adding operation to nd possible hypotheses and then the deleting operation to delete some bad ones. Although the strategy is very
simple, it yields promising detection results in the experiments.
5. Block Assignment Tracking
The previous section mainly discusses accurately locating objects
in the scenes with occlusions and viewpoint changes. In this section,
we concentrate on robustly tracking them. In the next, we will rst
derive the formulation of our block assignment tracking problem,
and then present our solution.
with the aid of robust appearance and motion models of objects estimated from V1:t 1,
St
Z t ; V t argmax pZ t ; V t jO1:t ; V t1 :
Z t ;V t
Zt
V~ t argmax p V~ t V t1 ; Ot1:t :
The third step is based on the other two steps, which can make
blocks with the same label look like a part of some object and potentially rectify possible errors during initialization and tracking. After
integrating these three steps into Eq. (6), we obtain
pZ t ; V t jO1:t ; V t1 p V~ t V t1 ; Ot1:t pZ t jO1:t pV t jZ t ; Ot ; V~ t :
10
Compared to Eq. (5), Vt 1 in the right side of Eq. (6) takes the previous assignment into account. However, the optimization of Eq. (6) is
not tractable because V1:t and Zt are closely intertwined at time t. The inference between Vt and Vt 1 should hold the spatial and temporal persistence of block assignments. Meanwhile, Zt encourages blocks with
the same label in Vt to look like an object. Moreover, V1:t 1 can provide
robust appearance and motion models of objects for inferring Zt. To
make the optimization tractable, we propose to split Eq. (6) into three
steps. The rst step is to obtain an intermediate assignment V~ t through
inferences on the block level of two sequential frames ignoring Zt,
V~ t
Z t argmax pZ t jO1:t :
Vt
297
N
X
i bt;i ; bt; j1 ; ; bt; jl V t1 ; Ot1:t
i1
11
where i is the penalty if bt,i is assigned to zt,k, i is the penalty when bt,i
and its neighbors are assigned to different objects, (i, j) is a Kronecker
function, equaling to 1 if i = j or 0 otherwise, l = |Nbt, i| and Nbt, i are 8neighbored blocks of bt,i. The observations here are actually image
sequences and object states are updated straightforwardly from their
previous states as zt, k = zt 1, k + rzt, k by their motions rzt, k = (rzt, k x, rzt, k y).
The motion of an object is represented by the most frequent motion
Fig. 6. Tracking Problem formulation. Left: original image. Middle: foreground block image. Right: an assignment where blocks in the same color (label) form the appearance of one
object and the quadrangles indicate coarse shapes of objects.
among its all blocks, where the motion for one block can be obtained by
applying block matching. Then we give the denitions of i and i.
pZ t jO1:t1 DZ t jZ t1 pZ t1 jO1:t1 dZ t1
i bt;i ; zt;k V t1 ; Ot1:t aDLt;i;k bDMt;i;k cDAt;i;k :
:
DAt;i;k
x dxr zt;k x ; y dyrzt;k y
I
0 dxb8 t1
0 dyb8
n1
13
14
Similar to [5], we set a = 1, b = 1, c = 0.125, d = 0.00000025 and
g = 0.5, and adopt Gibbs Sampler algorithm [35] to solve Eq. (11).
Please refer to [5,35] for more details.
5.2.2. Ensemble Tracking
This step is to estimate object locations accurately in the ensemble
level and offer the potential to amend possible errors in initialization
and tracking as discussed earlier. The errors are not notable in a short
time (NE frames for simplicity), but will be magnied vastly as time
passes. For the former situation, object states updated by their motions are adequate. For the latter situation, we refer to the update
step for a sequential Bayesian estimation problem:
pZ t jO1:t LOt jZ t pZ t jO1:t1
15
pZ t jO1:t
Np
X
n1
t;k znt zt
17
in which Np is the total number of particles, and z() denotes the deltaDirac function at position z. The nth particle is denoted as pn = (xtn, stn,
Htn, {t,n k}1K). xtn = (xtn, ytn) is the location, stn is the scale, Htn is the appearn
ance model, t,k
is the weight for zt,k. Motivated by the successes of
[7,16], we dene
(
K
l
2
X
X
2
i bt;i ; bt;j1 ; ; bt;jl V t1 ; Ot1:t d
N i;k Nk g
rt;i rt; in :
k1
16
12
t;k
n;G
18
n,D
is a discriminative weight modeled using the detector conwhere t,k
n,G
dence and t,k
is an appearance weight measured from an online
learned appearance model. is a parameter (=0.5 here). is a distance
threshold. The appearance models for particles or objects come from
pixels in foreground blocks. We utilize HSV color space and the number
of bins for each channel is set to 16.
But objects may get lost sometimes during tracking. If an object
cannot get enough support particles (t,n k > ), it is lost and buffered
for possibly matching newly detected objects. We perform object detection (SAD) every NF frames to nd new objects. If a lost object cannot get matched in TW frames, it will be discarded.
N
X
i bt;i ; zt;k
i1
i;j bt;i ; bt;j :
bt;j N b
19
t;i
Fig. 7. Comparisons of sampling strategies. (a) shows a scene with six persons. (b) PF [23] models objects as unrelated. (c) In our strategy, particles from different objects are not
distinct. But those far away from the concerned object are ignored (e.g., only particles from object D, C and E contribute to D).
20
r
r
t;i
t;j
@
A 1exp@
A;
r
i;j exp
A
>
>
:
0;
8
>
>
<
vt;i vt;j
299
not assigned to it, the update ratio should be small. The more occlusion, the less update ratio. Based on this, we dene the update ratio
as = 0.5Nk/Na, where Nk is the number of blocks assigned to zt,k
and Na is the total number of blocks overlapped by zt,k. Given the previous and current appearance models for zt,k, Ap and Ac, the update is
described as A = (1 )Ap + Ac.
Until now, we have described the three key components of the
BAT. For easy reference, the entire procedure of the BAT is summarized in Fig. 8.
6. Experiments
In this section, we carry out extensive experiments to evaluate our
proposed detection and tracking system. We rst describe the training
and testing datasets, and then list some detection and tracking metrics
for evaluations, and then evaluate the performance of our system, and
make some discussions at last.
6.1. Datasets
otherwise
21
where At,l and rt,l are the appearance and motion of bt,l (l = i, j). is a
parameter (=0.5 here). A and r are normalization factors. Here the
appearances of blocks are modeled as 4 bins histogram in gray images.
A is set to be the number of pixels in a block (=64 here). Suppose the
maximum motion of a block is the block size (8 8), and thus we set
r = 8 2 + 8 2 = 128.
After achieving the nal assignment, we then update appearance
models of objects. Intuitively speaking, if an object is occluded by
others, meaning that some of its overlapped foreground blocks are
Fig. 10. Test datasets. CAVIAR dataset can be downloaded from http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. PETS2007 can be downloaded from http://www.cvg.rdg.ac.uk/
PETS2007/. Humans in CAVIAR2 are too small, and therefore we double the original video size (384 288).
NH
PPW
t (ms)
T(ms)
CAVIAR1
S02
Our dataset
CAVIAR2
6
94.4%
700
1210
9
90%
4600
10,400
11
79%
1200
7560
4
86%
290
650
301
Fig. 11. Evaluation of our SAD compared to DPM [12] and ACF [38]. (a), (b), (c) and (d) compare our approach with several state-of-the-art works on CAVIAR1, S02, Our dataset and
CAVIAR2 separately. (e) and (f) zooms in our approach on S02 and our dataset respectively to illustrate more details.
detection results by the near locations in the following. Note that except CAVIAR1, the other three test datasets need rectications with
camera models. The methods using camera models are indicated by
CAM. The ROC curves are shown in Fig. 11.
6.3.1.2.1. Improvements of FAP. Compared to ISF + NAIVE, ISF +
NAIVE + FAP improves the detection rate about 3% on CAVIAR1. Compared to ISF + NAIVE + CAM, ISF + NAIVE + CAM + FAP improves the
performance about 4% on S02, 4% on our dataset, 1% on CAVIAR2. Similar performance improvements are achieved in ISF + MAP + CAM +
FAP. From the experiments in Fig. 11 and Table 1, we can see that
FAP not only works well on pruning but also improves the detection
performances.
6.3.1.2.2. Improvements of ISF and scene models. ISF + MAP performs better than or comparable to ACF + MAP on CAVIAR1, S02
and our dataset, demonstrating that ISF can detect occluded humans
in scenes without large viewpoint changes. And ISF (ISF + MAP and
ISF + NAIVE) is better than DPM (DPM1 and DPM2),where there
might be two main reasons: 1) the ability of deformable part based
model is limited on strong labeled samples like our training dataset,
and 2) the weak feature in DPM uses only gradient information,
while the weak feature in ISF combines both color and gradient information which is more discriminative than the former for pedestrian
detection. Note that, as DPM2 is more focused on pedestrians than
DPM1, it performs better than DPM1 on S02, and comparable to
DPM1 on CAVIAR1 and our dataset. But all these detectors fail in CAVIAR2 because of large viewpoint changes, which can be better solved
by camera models. Comparing with ISF + NAIVE, ISF + NAIVE + CAM
improves the performances about 3% on S02 and 26% on our dataset,
and it works well on CAVIAR2. Similar improvements are achieved in
ISF + MAP + CAM. As the viewpoint of CAVIAR1 is frontal, the linear
mapping from 2D coordinate to human height is used. In the experiment, we nd that the linear mapping does not reduce the detection
performance, while speeds up the detection about 0.6 s compared to
only using ISF on average.
6.3.1.2.3. Improvements of FAM. We replace the post processing
method by FAM to show further performance improvements. Compared
to ISF + MAP + FAP, our approach (ISF + FAM + FAP) can improve the
detection rate by about 11% in CAVIAR1. Compared to ISF + MAP +
FAP+ CAM, our approach (ISF + FAM + FAP + CAM) improves the detection rate about 16% on our dataset and 14% in S02. As MAP adds objects in y-decent order which is not true in large viewpoint changed
scenes, it does not work well in CAVIAR2 and even worse than NAIVE
sometimes. On contrast, our approach can still work well in such scenes
and achieves 52% detection rate at FPPI = 0.1 in CAVIAR2. We also observe another interesting phenomenon: the curves of our approach are
much cliffy than the others. It indicates that we can detect more objects
with less false samples. This is mainly because of pruned false positive
samples and used scene models. We zoom in the curves of Fig. 11(b)
and (c) to illustrate more details in Fig. 11(e) and (f) respectively.
6.3.1.2.4. Summary. These experiments have shown the effectiveness of the key components (FAP, ISF and FAM) of our SAD in occluded and viewpoint changed scenes. Therefore, our SAD as a whole
outperforms many of the state-of-the-art detection algorithms such
as [12,38]. But the speed is not satisfactory. The detection costs on average about 0.51 s, 5.8 s, 6.36 s and 0.36 s on CAVIAR1, our dataset,
S02 and CAVIAR2 respectively. Because of changed viewpoints and
heavy occlusions, it costs much more time on our dataset and S02
than on CAVIAR1 and CAVIAR2. For further speedup and performance
improvements, we recommend our proposed BAT, which is evaluated
in the next subsection.
6.3.2. Tracking evaluations
In this section, we report our BAT tracking performances on all test
datasets based on the SAD results without retraining detectors for
specic scenes. For concise descriptions, we let our BAT with and
without camera models be BAT + 3D and BAT + 2D respectively.
6.3.2.1. Algorithms for comparisons. We compare our approach with
some state-of-the-art tracking algorithms [26,24,15,7,36,27]. We utilize the implementation 1 of [27] to carry out experiments by
1
http://www.ics.uci.edu/~dramanan/.
Fig. 12. Quantitative results on CAVIAR1. *The denitions of fragment and IDS numbers in [26,7] are obtained by looser evaluation metrics.
Fig. 13. Quantitative results of our method on our dataset, S02 and CAVIAR2.
tracking performances. While, our SAD can detect many humans and
our BAT generally performs much better and more stable than [7,27].
BAT(ET) + 3D can track more objects than [7,27], but it obtains many
fragments and ID switches. Compared to BAT(ET) + 3D, BAT + 3D
achieves a better performance. It obtains higher MT/PT/MOTP/MOTA
and lower FM and IDS. This improvement shows that combinations
of block and ensemble information are superior to only using ensemble information for tracking. Compared to BAT + 2D, the improvement
of BAT + 3D majorly lies in the usage of camera models because of
303
Fig. 14. Tracking results. (a) and (b) compares [7] (top) and our approach (bottom) on OneStopMoveEnter1cor of CAVIAR1 and S02. (c) and (d) illustrate sample results in our dataset and Meet_crowd of CAVIAR2 separately. The layouts of (a) and (b) are already shown in Fig. 3(c). The layouts of (c) and (d) are illustrated in the most left. More descriptions can
be found in Section 6.3.2.
6.4. Discussions
7. Conclusion
6.4.1. Parameters
There are some parameters in the SAD and BAT as listed in Fig. 15
with corresponding descriptions and default values. The affections of
some key parameters to our framework are presented as follows. For
SAD, the parameters TM, Tadd and Tdel directly impact on the postprocess of detection. Smaller TM indicates the more probability of
In this paper, we propose a robust system for multi-object detection and tracking in surveillance scenes with occlusions and viewpoint changes. Our SAD can achieve robust detection through:
(1) camera models to cope with viewpoint changes; (2) structural lter approach to handle occlusions; and (3) foreground aware pruning
and foreground aware merging with the aid of some scene models.
Our BAT can track objects robustly even when all the part detectors
fail as long as the object has assigned blocks, which formulate tracking as a block assignment process. Its key factors are: (1) Block Tracking to maintain the spatial and temporal consistence of labels; (2)
Ensemble Tracking to precisely estimate locations and sizes of objects; and (3) Ensemble to Block Assignment to maintain the blocks
with the same label look like a part of human.
Although our method tracks remarkably well even through occlusions and viewpoint changes, one unavoidable drawback is fuzzy object boundaries. To overcome this, we can learn and extract some
discriminative patches to represent and track objects. Another drawback is that the tracking results are jittering, which can be amended
by estimated object trajectories. For detection improvement, we can
use online algorithms to make the ofine and general detectors become adaptive to a xed scene. Although the current system only considers human, the proposed mechanism can be easily to be extended
to other kinds of objects. Based on the detection and tracking results,
some high level analysis of object behaviors become possible. Furthermore, we hope to be able to make our approach applicable to real
world needs.
Acknowledgements
This work is supported in part by National Science Foundation of
China under grant No.61075026, National Basic Research Program of
China under Grant No.2011CB302203. Mr. Shihong LAO is partially
supported by R&D Program for Implementation of Anti-Crime and
Anti-Terrorism Technologies for a Safe and Secure Society, Special
Coordination Fund for Promoting Science and Technology of MEXT,
the Japanese Government.
Appendix A. Supplementary data
Supplementary data to this article can be found online at doi:10.
1016/j.imavis.2012.02.008.
References
[1] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in:
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Kauai, HI, USA, 2001, pp. I-511I-518.
[2] B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single
image by Bayesian combination of edgelet part detectors, in: Proc. IEEE Int.
Conf. Comput. Vis., Beijing, China, 2005, pp. 9097.
[3] C. Huang, R. Nevatia, High performance object detection by collaborative learning
of joint ranking of granules features, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., San Francisco, California, USA, 2010, pp. 4148.
[4] G. Duan, H. Ai, S. Lao, A structural lter approach to human detection, in: Proc.
Eur. Conf. Comput. Vis., Crete, Greece, 2010, pp. 238251.
[5] S. Kamijo, Y. Matsushita, K. Ikeuchi, M. Sakauchi, Trafc monitoring and accident
detection at intersections, IEEE Trans. Intell. Transp. Syst. 1 (2000) 108118.
[6] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations, IEEE Trans.
Pattern Anal. Mach. Intell. 26 (2004) 12081221.
[7] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets ltering and global tracklets association with detection responses, in: Proc. IEEE Int.
Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA, 2009, pp. 12001207.
305
[8] M. Andriluka, S. Roth, B. Schiele, People-tracking-by-detection and peopledetection-by-tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Anchorage, Alaska, USA, 2008, pp. 18.
[9] X. Wang, T.X. Han, S. Yan, An hog-lbp human detector with partial occlusion handling, in: Proc. IEEE Int. Conf. Comput. Vis., Kyoto, Japan, 2009, pp. 3239.
[10] Y. Li, B. Wu, R. Nevatia, Human detection by searching in 3D space using camera
and scene knowledge, in: Proc. IEEE Int. Conf. Image Process., Tampa, Florida,
USA, 2008, pp. 15.
[11] G. Duan, H. Ai, S. Lao, Human detection in video over large viewpoint changes, in:
Proc. IEEE Asi. Conf. Comput. Vis., Queenstown, New Zealand, 2010, pp. 683696.
[12] P. Felzenszwalb, D. McAllester, D. Ramaman, A discriminatively trained, multiscale, deformable part model, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Anchorage, Alaska, USA, 2008, pp. 18.
[13] C. Beleznai, H. Bischof, Fast human detection in crowded scenes by contour integration and local shape estimation, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Miami, FL, USA, 2009, pp. 22462253.
[14] D. Hoiem, A.A. Efros, M. Hebert, Putting objects in perspective, Int. J. Comput. Vis.
80 (2008) 315.
[15] C. Huang, B. Wu, R. Nevatia, Robust object tracking by hierarchical association of detection responses, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 788801.
[16] A. Senior, Tracking with probabilistic appearance models, in: ECCV Workshop on
Performance Evaluation of Tracking and Surveillance Systems, Copenhagen,
Denmark, 2002, pp. 4855.
[17] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 564577.
[18] P. Fieguth, D. Terzopoulos, Color based tracking of heads and other mobile objects
at video frame rates, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., San
Juan, Puerto Rico, 1997, pp. 2127.
[19] M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density,
in: Proc. European Conf. Computer Vision, Cambridge, UK, 1996, pp. 343356.
[20] J.C. Clarke, A. Zisserman, Detection and tracking of independent motion, Image
Vis. Comput. 14 (1996) 565572.
[21] M.D. Rodriguez, M. Shah, Detecting and segmenting humans in crowded scenes,
in: Proc. IEEE Int. Conf. Multimed., Augsburg, Germany, 2007, pp. 353356.
[22] P. Kelly, N.E. O'Connor, A.F. Smeaton, Robust pedestrian detection and tracking in
crowded scenes, Image Vis. Comput. 27 (2009) 14451458.
[23] M. Isard, A. Blake, Condensation-conditional density propagation for visual tracking, Int. J. Comput. Vis. 28 (1998) 528.
[24] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans
by Bayesian combination of edgelet based part detectors, Int. J. Comput. Vis. 75
(2007) 247266.
[25] H. Jiang, S. Fels, J.J. Little, A linear programming approach for multiple object
tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Minneapolis,
MN, USA, 2007, pp. 18.
[26] L. Zhang, Y. Li, R. Nevatia, Global data association for multi-object tracking using
network ows, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Anchorage,
Alaska, USA, 2008, pp. 18.
[27] H. Pirsiavash, D. Ramanan, C.C. Fowlkes, Globally-optimal greedy algorithms for
tracking a variable number of objects, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Colorado Springs, CO, USA, 2011, pp. 12011208.
[28] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007)
261271.
[29] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in:
British Machine Vision Conference, Edinburgh, British, 2006.
[30] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust
tracking, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 234247.
[31] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance
learning, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA,
2009, pp. 983990.
[32] J. Xing, L. Liu, H. Ai, Background subtraction through multiple life span modeling,
in: Proc. IEEE Int. Conf. Image Process., Brussels, Belguim, 2011.
[33] B. Wu, R. Nevatia, Y. Li, Segmentation of multiple partially occluded objects by
grouping merging assigning part detection responses, in: Proc. IEEE Int. Conf.
Comput. Vis. Pattern Recogni., Anchorage, Alaska, USA, 2008, pp. 18.
[34] G. Duan, C. Huang, H. Ai, S. Lao, Boosting associated pairing comparison features
for pedestrian detection, in: Proc. IEEE Workshop Visual Surveillance, Kyoto,
Japan, 2009, pp. 10971104.
[35] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution, and the Bayesian
restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721741.
[36] Y. Li, C. Huang, R. Nevatia, Learning to associate: Hybrid boosted multi-target
tracker for crowded scene, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni.,
Miami, FL, USA, 2009, pp. 29532960.
[37] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance:
the clear mot metrics, J. Image Video Process., 2008, 2008.
[38] W. Gao, H. Ai, S. Lao, Adaptive contour features in oriented granular space for
human detection and segmentation, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recogni., Miami, FL, USA, 2009, pp. 17861793.