You are on page 1of 9

Pattern Recognition Letters 34 (2013) 21352143

Contents lists available at ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

Localizing people in multi-view environment using height map


reconstruction in real-time
kos Kiss a,b,, Tams Szirnyi a
a
Distributed Events Analysis Laboratory, Institute for Computer Science and Control (MTA SZTAKI), Kende u. 13-17, H-1111 Budapest, Hungary
b
Department of Control Engineering and Information Technology, Budapest University of Technology and Economics, Magyar tudsok krtja 2., H-1117 Budapest, Hungary

a r t i c l e i n f o a b s t r a c t

Article history: In this article we address the problem of visual people localization, based on the detection of their feet.
Received 13 February 2013 Localization is based on searching cone intersections. The altitude of location is also retrieved, which
Available online 14 August 2013 eliminates the need of planar ground which is a common restriction in the related literature. We found
Communicated by D. Coeurjolly that positions can be computed accurately, and despite a large number of false positives, the height map
of the scene can be reconstructed with small error. Precision of the detector can be increased given the
Keywords: height map, so that results of our method are comparable to state of the art methods in case of planar
Multi-view localization
ground, but adding the ability to handle arbitrary ground. Our algorithm is capable of real-time operation,
3D position
Projection
based on two optimizations: decreasing the number of cones, and approximating intersection bodies.
Real-time processing Cones are back-projections of ellipses in images covering feet regions. Moreover, most demanding steps
are parallelizable, and distributable due to lack of data dependencies.
2013 Elsevier B.V. All rights reserved.

1. Introduction segments. This might be done using some descriptors (e.g. color
Mittal and Davis, 2001, 2002), however color calibration is neces-
Single camera object detection can rely on the visual appear- sary due to different sensor properties and different illumination
ance of the object, or can utilize its dynamic properties. Appear- conditions (Jeong and Jaynes, 2008).
ance models can incorporate some kind of descriptors, like color, Foreground masks can be projected to a ground plane, overlap-
shape, and texture. However, sometimes the temporal behavior ping pixels mark consistent regions (Iwase and Saito, 2004).
of the object is more characteristic, for example, in case of gait Robustness can be improved using multiple parallel planes (Khan
(Jung and Nixon, 2013). and Shah, 2009). Furthermore, looking for certain patterns in the
Non-typical and occluding objects are hard to detect, multiple projection plane increases reliability (Utasi and Benedek, 2011).
cameras are often used to overcome these limitations. Using more In many works the authors assume that the homography and
viewpoints the possibility of clear visibility increases, and by gain- the ground plane is known for carrying out the projections. In
ing spatial information the localization of the objects also becomes Havasi and Szlavik (2011) homography parameters are estimated
possible. from co-motion statistics from multimodal input videos, eliminat-
In such cases, consistency of object pixels among views will sig- ing the need of human supervision.
nal objects in 3D space. There are many methods for nding object The projection of whole foreground masks is computationally
pixels. Using stereo cameras, the disparity map can highlight/seg- expensive, but ltering pixels can reduce complexity. In some
ment foreground, or using wide-baseline stereo imaging, works, points associated with feet are searched, reducing fore-
image-wise foreground detection can be carried out (Benedek ground masks from arbitrary blobs to points and lines (Kim and
and Szirnyi, 2008). In this latter case, eliminating shadows and Davis, 2006; Iwase and Saito, 2004).
reections is a hard problem in itself (Benedek and Szirnyi, Another gain of ltering foreground points is that depending
2007; Prati et al., 2003). on the geometry feet are less occluded than whole bodies, elim-
The resulting set of object pixels can be further segmented, and inating a great source of errors. Reducing occlusion is especially
consistency checking might be reduced to likely correspondent important in dense crowds, for example, by using top-view cam-
eras (Eshel and Moses, 2010).

Corresponding author at: Distributed Events Analysis Laboratory, Institute for


2. Overview
Computer Science and Control (MTA SZTAKI), Kende u. 13-17, Budapest, H-1111
Hungary. Tel.: +36 703230688.
E-mail addresses: kiss.akos@sztaki.mta.hu (. Kiss), sziranyi.tamas@sztaki.mta. Our goal was to design an algorithm for localizing people in a
hu (T. Szirnyi). multiview environment where the geometry of the ground is

0167-8655/$ - see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.patrec.2013.08.007
2136 . Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143

arbitrary, which problem is not addressed in the eld. Moreover comparison to state of the art methods in Section 6 and conclude
we aimed at reaching real-time processing with possibly many our work in Section 7.
views to make it available for surveillance systems.
We utilize the fact that foreground pixels of a view correspond 3. Preprocessing
to lines in 3D scene space through projective back-projection
and lines coming from different views intersect in scene space at Foreground detection can be corrupted by shadows, reections
surface points of the object. Detecting such intersections however or changes in illumination conditions. To minimize such errors, we
would require too many computations, as the number of line pairs use a foreground detection method capable of eliminating shadows
np is proportional to and reections (described in Benedek and Szirnyi (2008)).
n Different color spaces characterize colors differently, which
np sx sy r fg 2 1 highly affects the appearance of shadows. Consequently, the selec-
2
tion of the color space highly impacts performance of shadow
where sx ; sy is the resolution of the camera image, r fg is the ratio of detection (Benedek and Szirnyi, 2007). We evaluated a number
foreground pixels in the image and n is the number of views. We of color spaces and use XYZ as it leads to best overall performance
found that it is not the number of views that can cause bottleneck of our algorithmhowever in certain situations the foreground
problems, but the increasing number of foreground pixels, either mask is still corrupted by shadows, as we can see in Fig. 2(a).
due to high resolutions or noisy foreground masks. Therefore we Digital cameras auto white balance function is aimed at adopt-
ing to changes in illumination. However, large objects entering or
1. lter the foreground mask to nd pixels relevant to detecting exiting the eld of view may affect the measurement of illumina-
people positions, and tion parameters. This results in changing chromaticity of the out-
2. cluster spatially coherent pixels into one primitive instead of put image, even if lightning conditions remain the same.
handling them separately. Consequently pixel values will not t the static background model
and the foreground mask becomes invalid.
Some aspects of this approach were briey introduced in Kiss In many digital cameras this feature cannot be turned off at all.
and Szirnyi (2013), in this work we give more details, explain To overcome this issue, we have to neglect its effect. We do this by
selection of algorithm parameters and show comparative test re- extending the background model with white balance adaptation: if
sults. We lter pixels possibly corresponding to feet, and replace we observe a tendency in relative changes of values for a color
these pixels with encompassing ellipses. These ellipses can be channel, we conclude that the gain of the channel changed, and up-
back-projected to cones in scene space, thus the detection of inter- date parameters of the static background accordingly. We com-
secting cones can replace the detection of many line pair intersec- puted the tendency of change T c for channel c only on
tions. Our approach has several advantages in means of both background pixels x 2 B to discard scene changes:
precision and speed:
P Ic;x
x2B ~Ic;x
 for determining cone parameters, undistortion may be carried Tc 2
jBj
out with few computations leading to accurate parameters, (
~Ic;x T c 2 0:97 . . . 1:03
 the number of cones is proportional to the number of objects ~I0 3
c;x
regardless of image resolution, which makes our cone matching ~Ic;x =T c otherwise;
approach scalable as opposed to pixel matching,
 ground does not have to be at (unlike when using homogra- where Ic;x is value for channel c at position x; ~I is the static back-
phies in Utasi and Benedek (2011), Khan and Shah (2009, ground parameter, ~I0 is the updated parameter. This method leads
2006), Iwase and Saito (2004) and Berclaz et al. (2006)), to a more reliable foreground detection according to our visual val-
 we can compute a synergy map without presumption about idation (see Fig. 1(a) for example).
subjects heights (which is required in Utasi and Benedek We lter out small areas from the foreground mask to suppress
(2011) and Khan and Shah (2009, 2006)) noise. In our tests, we eliminated areas covering less than 7 pixels,
as we found foreground blobs due to noise mostly did not exceed
However, the presented approach has some requirements and this size. For other videos, this threshold might be adjusted accord-
in some cases can have some drawbacks: ing to video quality. Note that this step is not necessary for our
method to work, it only speeds up further processing.
 Precise calibration is required for reliable estimation of
cone parameters (both intrinsic and extrinsic parameters 3.1. Filtering foreground pixels
of the camera). A small error in cone or camera parameters
can slightly change the direction of the cone, which can We assume that cameras are in upright position so that pixels of
cause large errors at locations far from the camera. feet are bottom pixels of vertical lines in the foreground mask. We
 Precise synchronization of the cameras is required, because consider these pixels as candidates for feet, calling them candidate
the positions of feet may change fast (comparable to the pixels. A sample of extracted candidate pixels can be seen in
size of feet in 0.1 s). Fig. 1(b).
 Our algorithm may fail on incorrect foreground masks. As Note that using different setups, the vanishing point for vertical
we will show in Section 3.2, if the foreground mask is incor- direction is known from calibration, so candidate pixel ltering can
rect, the extracted ellipses will not cover the feet regions also be carried out.
precisely, or will not even intersect any foot. We cluster the candidate pixel set by connecting pixels closer
than 3px (choice of parameter is based on experiments, see
We show the applied preprocessing steps for forming ellipses Fig. 5(d)). This is necessary since candidate pixels are often not
from foreground masks in Section 3. Section 4 introduces the steps connected because of noise and steep edges. Note that these do
of forming cones in scene space and matching these cones. not necessarily change when using higher resolutions, so this
Section 5 describes feet localization from cone matches as well threshold can be used for different image sizes, however, adjust-
as details of height map reconstruction. We show results and ment might be required for different image quality.
. Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143 2137

Fig. 1. Steps of extracting feet.

After clustering, we obtain a set of pixel clusters. We model Feet can appear in different orientations, therefore the ellipse
each cluster with an ellipse according to the moments of the pixel can also be rotated. Rotating around origin with a we get:
coordinates. This makes our algorithm robust to image resolution,     
x0 c s x
as increasing resolution results in more candidate pixels, but the 0
7
same number of clustersthus reducing the number of 3D primi- y s c y
tives, which drastically speeds up pairwise matching. Only the Z Z Z
steps of preprocessing are affected by the resolution. x02 c2 x2 s 2 y2 8

3.2. Forming ellipses Z Z Z


y02 s2 x2 c 2 y2 9
We want to model possible position of a foot in space with a
cone. Intersection of a cone and a planethe image plane in our Z Z Z 
caseis an ellipse lying on the plane. For this reason, we consider x0 y0 sc x2  y2 10
a cluster to be a sampling of an elliptical area. For an ellipse with
major and minor radii a and b parallel to axis, we know: R
2 x0 y0
tg2a R R 11
Z y02  x02
a3 bp
x2 4 Z Z Z Z
4 abp 2 2
Z 3 x02 y02 x2 y2 a b 12
2 ab p 4
y 5
4 R Z Z
Z
x0 y0 abp 2 2
xy 0 6 x2  y2 a  b 13
sc 4
2138 . Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143

Z
From the extrinsic parameters of the cameras we know the 3D
1 abp; 14
position of points of the image plane as well as the optical center O.
Thus computation of w is straightforward. We determine major
where s sina; c cosa. Now we can compute major and minor ra-
and minor axes directions by vector products to ensure the orthog-
dii and rotation. We reject too small and upright ellipses. Minimal
onality of vectors:
size can be adjusted for different setups and image resolution. In
our experiments ellipses covering less than 7 candidate pixels were v ui  w 16
rejected, this ensured feet far from the camera are still detected uwv 17
(choice of parameter is based on evaluation, see Fig. 5(d)). We de-
ned the maximal angle as 45 . Foot direction does not depend Finally, u and v are scaled according to major and minor bevel
on setup or resolution, so this threshold should work in any setup. angles so that (15) stands.
Sample results can be seen in Fig. 1(c).
4.2. Cone matching
4. Detecting correspondences
The intersection of cones is a complex body, especially when
We model candidate pixel clusters with ellipses in every view. bevel angles differ for major and minor axes. Our experiments
We then back-project ellipses in all images to cones in scene showed that we do not need the exact intersection, an approximate
spacethe 3D reference space. Cones corresponding to a foot will solution is sufcient for searching cone matches. We simplify com-
intersect close to its location. putations in two ways:
However, cones corresponding to different feet can intersect,
and candidate pixels can appear on arms or on foreground artifacts. 1. Generally a foot covers a small area of the image, so cones have
Both can lead to false detections, Fig. 2(b) shows examples. small bevel angles. Consequently in a relatively small space seg-
Intersection of cones is a complex bodyexcept for some ment they can be replaced by cylinders. We assume that an
extraordinary situations. Determining this body would be difcult, intersection will take place near the point where the two cone
but it is not even necessary. We use simple heuristics to detect axes are closest, so we compute the major and minor radii at
!
cone intersections which we call cone matching. this offset (CP close in Fig. 3(b)).
2. Computing the exact intersection of cylinders is still a hard
4.1. Forming cones problem, and the intersection body is still complex. We only
seek for an optimal point p in space, subject to minimal distance
We describe a cone with a vertex C and three orthogonal vectors from cylinder axesconsidering different major and minor radii.
u; v and w, where w is the unit length direction vector of the axis !
and u; v are direction vectors of major and minor axes (Fig. 3(a)). After computing the jCP close j distance, we scale u and v vectors
Bevel angles are determined by the length of u; v vectors. A cone to match major and minor cylinder radii:
consists of points p where !
u0 ujCPclose j
 2 18
!
p  Cu p  Cv 2 6 p  Cw2 ; p  Cw P 0 15 v 0 v jCPclose j

Fig. 2. Sources of errors.


. Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143 2139

Fig. 3. Steps of extracting feet.

Now, the cylinder equation becomes cylinder axes (the error of the approximate solution) we also assign
 2 a weight to matches.
2
p  Cu0 p  Cv 0 6 1 19
4.3. Indexing cones
The distance from axis left side of (19) is a linear function of
p, enabling us to write a linear equation system expressing p is on
In stereo vision, epipolar constraints are often used to reduce
both axes (index refers to cylinders):
the search space for nding correspondences (Rodrigues and
 2  2 Fernandes, 2004; Mittal and Davis, 2001). The corresponding lines
p  C 1 u01 p  C 1 v 01 0
in two images refer to a plane on both optical centers (O1 O2 ). The
 2  2 20
lines are characterized by only a rotation around this O1 O2 axis,
p  C 2 u02 p  C 2 v 02 0;
so a point in space could be indexed by this rotation angle.
Cones are extent, their index could be the angle interval they
which is equivalent to
touch. However due to the small number of primitives and the very
2 3 0 1 low computational time, applying indexing would not affect over-
u0T
1 u01 C 1
6 v 0T 7 B v0 C C all performance, and therefore we do not utilize epipolar
6 1 7 B 1 1C
6 0T 7p B 0 C 21 constraints.
4 u2 5 @ u2 C 2 A
v 0T2 v 0 2 C2
5. Localizing feet
Ap b 22
5.1. Detection
Of course, axes will practically never intersect, we can only
compute an approximate solution. Solving subject to least square One match itself can come from a false correspondence. How-
error is straightforward and practical in our case. It is straightfor- ever, for two matches, the foot has to be visible in at least three
ward, because it is a distance from the axes considering different views, which is not guaranteed in a dense crowd. Matches from
major and minor radii, and it is practical, because this optimization random correspondences tend to have low weights, so detecting
is easy to carry outfor example using pseudo inverse. a foot from a single match is possible by thresholding the matchs
If p is outside both cylinders, we conclude that the cones are not weight.
intersecting, otherwise they intersect and p is the position of the We merge matches close in space to form one detection. We
intersection which we call a match. Using the distances of also assign weights to such match sets, computed from the weights
2140 . Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143

Fig. 4. Sample results of localization and determining height map.

of the matches so that the possibility of detection highly increases Consequently these false positives appear as a noise. To sup-
with the number of matches. press this noise, we collect all feet locations from sample videos
We merge matches over a grid on a horizontal plane. As feet do and lter dense regions in scene space. This is easily done using
not appear above each other in a cell of the grid we do not mix statistical ltering by assuming normal distribution. The resulting
matches from different altitudes. We collect matches in overlap- dense points will be height points of ground surface.
ping 2  2 cell blocks, then compute the weights of match sets in The phrase height map can be somewhat misleading, because at
every block. Using overlapping blocks is necessary to obtain a the border of surfaces in different altitudes, it is possible to have
smooth weight map over the grid. height points above each other, however in our tests this never
Every local maximum in this 2D weight map with a weight occurred. Thus we kept the height map expression.
above a given threshold is considered a detected foot. This thresh- After extracting the height map, we ignore detections far from
old is a parameter of our algorithm. A higher threshold leads to in- any height point, leading to a much more reliable method. We
creased precision along with decreased recall. The 3D location of speed up height point searching by accumulating detections in a
the foot is computed as the barycenter of matches in the block, 3D grid, which makes periodic reruns possible. This enables us to
the resulting z coordinate is the altitude (it is parallel to upright rene the height map in time, as more and more detections occur.
direction). A sample height map of our non planar test case is shown in
Fig. 4(a). There are certain areas where few detections took place,
which led to an incomplete height map.
5.2. Height map

As we mentioned before, and showed in Fig. 2(b), false corre- 6. Experiments


spondences can occur due to incorrect foreground detection or ran-
dom intersections of non-related cones. Especially in a dense We tested our algorithm on a commonly used test sequence for
crowd, many false matches appear. However the altitude of these multiview detection, the EPFL terrace sequence (EPFL, 2011) (sam-
detections is quite random, and they usually occur far from the ple results can be seen in Fig. 4(b)), and our own test videos (SZTAKI
ground surface - either above or below. sequence).
. Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143 2141

Fig. 5. Evaluation of height point calculation and detection.

We made test videos of a scene where three planar surfaces 6.1. Height map reconstruction
were present at different altitudes. This was required to demon-
strate the capabilities of our method as available multiview test Our algorithm reconstructed the height maps for both EPFL and
sets present planar ground. We recorded videos from 4 views with SZTAKI sequences with high precision in both cases.
different kinds of consumer digital cameras with video capture
function. Our algorithm performed well on this dataset despite 6.1.1. EPFL sequence
the different camera parameters, distortion and image quality. In case of the EPFL sequence, we measured that the recon-
Our approach for detecting foot has advantages and disadvan- structed ground surface is a at plane with a small error:
tages as well. Feet are always near the ground, which makes height l 1:6 cm (mean), r 1:7 cm (std. deviation). Fig. 5(a) shows
map detection possible and helps rejecting false positives. Also, for the histogram of altitude values.
certain camera setups, feet are less likely to be occluded. Assuming noise-like error, altitude values on a at ground
On the other hand, in case of detecting foot positions, precise should follow normal distribution, however we observed a differ-
time synchronization is necessary, as a small time skew can cor- ent distribution. A possible source of systematic error is an incor-
rupt detection. In Fig. 2(c), we can see the moving leg is bended rect plane normal, so we computed the optimal normal vector by
in one view while straight in the other. Hardly noticeable in the tting a plane on height points. The optimal normal vector was
small image, however the distance in space is greater than the size very close to the upright direction specied by camera calibration.
of the foot itself. It diverged by only 0:36 , resulting in a slightly smaller error:
2142 . Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143

Table 1
Numerical evaluation of two main aspects: accuracy and running time.

Surface Height l r # of points min max


(a) Statistical information on surfaces found in scene.
Floor 0 cm 1.1 cm 1.7 cm 131 0.3 cm 11.2 cm
Box 50 cm 50.2 cm 0.6 cm 6 49.1 cm 50.8 cm
Table top 73 cm 74.4 cm 1.2 cm 23 70.3 cm 75.9 cm
P P
Test set Foreground detection Candidate ltering Cone forming (per camera) Cone matching Detection (per system)
(b) Average processing time of each step for all 4 views (single threaded implementation, 2.4 GHz Core 2 Quad CPU). Foreground detection, ltering and cone forming can be done
parallel (per camera), cone matching and detection requires all cone information to be present.
EPFL 51.2 ms 3.37 ms 63 ls 13.7 ms 879l s 28l s 907l s
SZTAKI 32.5 ms 2.88 ms 1.99 ms 9.35 ms 600l s 18l s 618l s

Fig. 6. Comparing our results to a SoA method.

l 1:6 cm;r 1:5 cm. However, using this normal led to a signif- and false positives from the remaining people and locations
icantly better distribution (Fig. 5(a), right histogram). accordingly.
Other multiview pedestrian localization methods localize the
whole bodies unlike our method, which localizes feet. Conse-
6.1.2. SZTAKI sequence
quently, a different evaluation criteria had to be used: we consid-
In case of our sequence, all three surfaces were found, and their
ered a person detected, if either of his legs was detected. With this
altitude was determined with high accuracy. Measurements are
denition, our results become comparable to other results.
summarized in Table 1(a). Fig. 6(b) shows the histogram of altitude
The performance of our algorithm depends on some parame-
for points on ground surface.
ters. As mentioned in Section 5.1, a detection threshold is applied
There are 3 outlier points (above 6 cm height) for the oor.
to the weight map. The trade-off between precision and recall is
These appear on the edge of the area of interest, as can be seen
balanced with this threshold. To evaluate our method, we ran tests
in Fig. 4(a). This is due to sagging people which lead to false can-
with different threshold values, resulting in precision-recall curves.
didate pixels on image borders. Without these outlier points, mean
Results are shown in Fig. 5(c).
and r for oor becomes 0.9 cm and 1.1 cm respectively.
Two other parameters were introduced in Section 3: maximal
distance for candidate pixel clustering and minimal cluster size.
6.2. Localizing people We measured achievable maximum of minprecision; recall (in
function of the detection threshold) and averaged over the data-
We tested our method using manually created ground truth sets, Fig. 5(d) shows results. We found best overall performance
information of feet positions. We implemented an application to at 3px clustering distance and minimal cluster size of 7. Increased
create ground truth, in which we have to mark a foot in at least minimal cluster size might increase precision, but leads to ltering
2 views and its position is determined by solving a linear equation out small feet, which decreases achievable recall.
similar to (21). If we mark a foot in more views, the number of
rows of A and length of b increases.
We collected feet positions frame by frame, person by person. 1 6.2.1. Comparison to SoA methods
or 2 legs can be specied for a person, because sometimes only one We compared our method to two SoA methods referred to as
foot is visible from more views. POM (Fleuret et al., 2008) and 3DMPP (Utasi and Benedek, 2011).
For the evaluation, we dened a region of interest (ROI), a rect- Comparison is not straightforward, as these methods localize
angle on the ground plane and matched locations to ground truth bodies projected to a ground plane, in contrast we localize feet in
inside this area. In case of the EPFL dataset, this rectangle is spec- space. We consider someone detected if any foot is detected.
ied along with the dataset, for our test case this area was chosen Fig. 6(a) shows results found in Utasi and Benedek (2011) for
so that every part is visible from at least 3 views. POM and 3DMPP evaluated on EPFL and PETS (PETS, 2009) data-
We accepted a location as true positive (or detected foot), if it sets compared to our measurements. We evaluated our method
was not further than 25 cm from a ground truth foot position. This on the same dataset, but we found that PETS did not meet the
is approximately the length of a foot. Afterwards we eliminated requirements of our algorithm (in means of precise calibration
people and positions outside the ROI, and computed false negatives and synchronization), so we could only use the EPFL dataset.
. Kiss, T. Szirnyi / Pattern Recognition Letters 34 (2013) 21352143 2143

We used similar amount of test data: 179 test frames (every After height map reconstruction we measured precision and re-
25th frame from the test sequence) with 661 objects appearing call values comparable to SoA methods on a commonly used data-
in the ground truth data with at least one leg inside the ROI, set. Our algorithm worked well also on our test videos we made to
which was the same as in Utasi and Benedek (2011). Results show demonstrate the capabilities of handling non-planar ground
that our method is comparable to these SoA methods in case of at surface.
ground surface. In the future we plan to examine tracking people by their lean-
We also compared results on our SZTAKI dataset to show the ing leg positions (Havasi et al., 2007).
effects of a non-planar ground surface. For this we chose POM as
it was easy to generate input for its available implementation. Acknowledgements
For our dataset, foreground detection was challenging due to
reective ground and poor video quality. Consequently both The authors thank Franois Fleuret for making the implementa-
methods performed worse in this case, however our algorithm tion of POM available, and kos Utasi for sharing his experiments
performed noticeably better (see Fig. 6(b)). of evaluating POM and 3DMPP algorithms. This work has been sup-
ported by the Hungarian Scientic Research Fund grant OTKA
6.3. Running time #106374.

We measured the average running time of the processing steps, References


with results shown in Table 1(b). Every step highly depends on the
input data, thus the processing time is not linear in the number of Benedek, C., Szirnyi, T., 2007. Study on color space selection for detecting cast
shadows in video surveillance. International Journal of Imaging Systems and
pixels in the images. In these cases, the video resolution was Technology 17 (3), 190201.
360  288 for EPFL and 320  240 for SZTAKI sequences. Benedek, C., Szirnyi, T., 2008. Bayesian foreground and shadow detection in
Results show that our method is capable of real-time processing uncertain frame rate surveillance videos. IEEE Transaction on Image Processing
17 (4), 608621.
for both EPFL and SZTAKI sequences even with single threaded Berclaz, J., Fleuret, F., Fua, P. 2006. Robust people tracking with global trajectory
implementation (including foreground detection). However, paral- optimization. In: IEEE CVPR, pp. 744750.
lel processing is also possible with some restrictions. EPFL. 2011. Multi-camera pedestrian videos. http://www.cvlab.ep.ch/data/pom/
Eshel, R., Moses, Y., 2010. Tracking in a dense crowd using multiple cameras.
Several steps of the processing run on single images, therefore International Journal of Computer Vision 88, 129143.
can be parallelized. Clearly, foreground detection, ltering, ellipse Fleuret, F., Berclaz, J., Lengagne, R., Fua, P., 2008. Multicamera people tracking with a
and cone forming is done image-wiseno information exchange probabilistic occupancy map. IEEE Transactions on Pattern Analysis and
Machine Intelligence 30 (2), 267282.
is required between different views at all. Consequently, process-
Havasi, L., Szlavik, Z. 2011. A method for object localization in a multiview
ing individual images can run on separate threads or even be dis- multimodal camera system. In: CVPRW, pp. 96103.
tributed to different computerspossibly smart camerasdue to Havasi, L., Szlvik, Z., Szirnyi, T., 2007. Detection of gait characteristics for scene
registration in video surveillance system. IEEE Transaction on Image Processing
lack of data dependency.
16 (2), 503510.
Cone matching and localization requires all cone information to Iwase, S., Saito, H. 2004. Parallel tracking of all soccer players by integrating
be present. Still, these steps could be parallelized but only in a non- detected positions in multiple view images. In: IEEE ICPR, pp. 751754.
trivial manner. Therefore, due to their short processing time these Jeong, K., Jaynes, C., 2008. Object matching in disjoint cameras using a color transfer
approach. Machine Vision and Applications 19, 443455.
steps are preferred to run sequentially. Jung, S.-U., Nixon, M.S., 2013. Heel strike detection based on human walking
In case of the above mentioned SoA methods, mean processing movement for surveillance analysis. Pattern Recognition Letters 34 (8), 895
time of foreground masks is in the order of 0:5 s for POM, and 1 s 902.
Khan, S., Shah, M. 2006. A multiview approach to tracking people in crowded scenes
for 3DMPP, using unoptimized code on similar processor (data pro- using a planar homography constraint. In: ECCV 2006, Lecture Notes in
vided by author of Utasi and Benedek (2011)). However in our case, Computer Science, pp. 133146.
processing the foreground took less than 5 ms. Khan, S., Shah, M., 2009. Tracking multiple occluding people by localizing on
multiple scene planes. PAMI 31 (3), 505519.
Moreover, many methods implicate data dependencies which Kim, K., Davis, L.S. 2006. Multi-camera tracking and segmentation of occluded
make distributed computing impossiblelike projecting masks in people on ground plane using search-guided particle ltering. In: ECCV, pp. 98
3DMPP, or iterative probability computation in POM. 109.
Kiss, Szirnyi, T. 2013. Multi-view people detection on arbitrary ground in real-
time. In: VISAPP, pp. 675680.
7. Conclusion Mittal, A., Davis, L. 2001. Unied multi-camera detection and tracking using region-
matching. In: IEEE Multi-Object Tracking, pp. 310.
Mittal, A., Davis, L. 2002. M2tracker: a multi-view approach to segmenting and
We proposed a multiview-localization algorithm that extracts tracking people in a cluttered scene using region-based stereo. In: ECCV 2002,
3D position of people using multiple calibrated and synchronized pp. 1833.
views. In our case, unlike other algorithms, non-planar ground sur- PETS. 2009. Performance evaluation of tracking and surveillance.
Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R., 2003. Detecting moving shadows:
face can be present. This is done by modeling possible feet loca- algorithms and evaluation. IEEE Transactions on PAMI 25, 918923.
tions with 3D primitives, cones in scene space and searching for Rodrigues, R., Fernandes, A.R. 2004. Accelerated epipolar geometry computation for
intersections of these cones. 3D reconstruction using projective texturing. In: SCCG 04, pp. 200206.
Utasi, ., Benedek, C. 2011. A 3-D marked point process model for multi-view
For good precision, the height map of the ground surface should people detection. In: IEEE CVPR, pp. 33853392.
be known. Our method can compute the height map on the y,
reaching high precision after a startup time.

You might also like