You are on page 1of 104

i

A Thesis for the degree of Master





A Study on Moving Object Detection
and Tracking with Partial Decoding
in H.264|AVC Bitstream Domain




















Wonsang You
School of Engineering
Information and Communications University
2008
ii



A Study on Moving Object Detection
and Tracking with Partial Decoding
in H.264|AVC Bitstream Domain

















i
Abstract
An object detection and tracking technique has been an important issue
traditionally in the field of computer vision and video processing since it enables
efficient analysis of video contents. It can be utilized not merely for surveillance
systems but also for interactive broadcasting services.
However, most of current object detection and tracking techniques which
utilize only raw pixel data are not practical due to tremendously high computa-
tional complexity. Furthermore, most of videos tend to be communicated in the
form of encoded bitstreams in order to enhance the transmission efficiency. In
that case, the pixel domain approach requires additional computation time to ful-
ly decode the encoded bitstream.
In the meantime, H.264|AVC technology has been a popular compression
tool for videos due to its high coding efficiency and the availability of its real-
time encoding devices. Fortunately, the H.264|AVC bitstream contains encoded
information such as motion vectors, residual data, and macroblock types which
can be directly utilized as effective clues for object detection and tracking. The
traditional compressed domain algorithms which make use of such encoded in-
ii


formation have shown fast computation time with low computational complexity.
However, these algorithms are available only under limited circumstances. In
addition, they are difficult to be followed by the color extraction of objects or the
object recognition which distinguishes one object from other objects.
In this thesis, two methods for moving object detection and tracking with
partial decoding in H.264|AVC bitstream domain are introduced. While one ap-
proach is the semi-automatic method which users can initially select a target ob-
ject in stationary or non-stationary scenes, another approach is the automatic me-
thod which all moving objects are automatically detected and tracked especially
in stationary scenes. The former is beneficial to metadata authoring tools which
generate additional contents like the position information of an object for the in-
teractive broadcasting service. On the other hand, the latter is effective for sur-
veillance systems with fixed cameras. Unlike conventional compressed domain
algorithms, the proposed methods utilize partially decoded pixel data for object
detection and tracking. Therefore, these methods show reliable performance in
various scene situations as well as fast processing time enough to be performed
in real-time. Also, these methods can support the color extraction of objects or
the object recognition.
iii


Contents
A Thesis for the degree of Master ....................................................................... i
Abstract ................................................................................................................. i
Contents ............................................................................................................... iii
List of Tables ........................................................................................................ v
List of Figures ..................................................................................................... vi
List of Abbreviations .......................................................................................... ix
I Introduction .................................................................................................. 1
II Related Works ............................................................................................... 7
2.1 Overview of MPEG-4 Advanced Video Coding ...................................... 8
2.2 Pixel Domain Approach ......................................................................... 10
2.2.1 Region-based Methods ...........................................................................10
2.2.2 Contour-based Methods .......................................................................... 11
2.2.3 Feature-based Methods ...........................................................................12
2.2.4 Template-based Methods ........................................................................13
2.3 Compressed Domain Approach ............................................................. 14
2.3.1 Clustering-based Methods ......................................................................16
2.3.2 Filtering-based Methods .........................................................................21
2.3.4 Issues in Compressed Domain Approach ...............................................29
III Proposed Schemes for Moving Object Detection and Tracking with
Partial Decoding in H.264|AVC Bitstream Domain ....................................... 32
3.1 Semi-automatic Approach ...................................................................... 33
3.1.1 Forward Mapping of Backward Motion Vectors ....................................34
iv


3.1.2 Texture Dissimilarity Energy ..................................................................36
3.1.3 Form Dissimilarity Energy .....................................................................39
3.1.4 Motion Dissimilarity Energy ..................................................................40
3.1.5 Energy Minimization ..............................................................................42
3.1.6 Adaptive Weight Factors.........................................................................43
3.2 Automatic Approach .............................................................................. 45
3.2.1 Block Group Extraction ..........................................................................46
3.2.2 Spatial Filtering ......................................................................................48
3.2.3 Temporal Filtering ..................................................................................49
3.2.4 Region Prediction of Moving Objects in I-frames ..................................55
3.2.5 Partial Decoding and Background Subtraction in I-frames ....................58
3.2.6 Motion Interpolation in P-frames ...........................................................59
IV Experiments ................................................................................................ 61
4.1 Semi-automatic Approach ...................................................................... 61
4.2 Automatic Approach .............................................................................. 71
V Conclusions and Future Works ................................................................. 83
............................................................................................................. 86
References .......................................................................................................... 88

v


List of Tables
Table 1. The processing time of compressed domain algorithms. ...................... 29
Table 2. The processing time of the proposed automatic method. ...................... 77

vi


List of Figures
Figure 1. The region-matching method for constructing the forward motion field
............................................................................................................ 34
Figure 2. The search region is centered at the predicted point located by a
forward motion vector. A candidate point inside the search region has
its neighborhood of square form to compute E
C
................................. 37
Figure 3. The structure of partial decoding in the first P-frame of a GOP which
contains one I-frame and three P-frames. Two decoded block sets
D
k,n
(k+1) and D
k,n
(k+2) in the first P-frame are projected from two
predicted search regions P
k,n
(k+1) and P
k,n
(k+2). .............................. 39
Figure 4. The network of feature points in the previous frame and the network of
candidate points in the current frame. ................................................ 40
Figure 5. The reliability of forward motion vectors. The great gap between a
forward motion vector and a backward motion vector results in low
reliability. ............................................................................................ 41
Figure 6. The neural network for updating weight factors. ................................. 44
Figure 7. A procedure of object region extraction and refinement ...................... 46
Figure 8. Block groups before and after spatial filtering .................................... 47
Figure 9. Temporal filtering based on the occurrence probability of active group
trains ................................................................................................... 50
Figure 10. Train tangling. (a) Train merging. (b) Train separation. .................... 54
Figure 11. Optimizing the feature vector of an object through background
subtraction in an I frame. (a) The background Image. (b) The I frame
in the original sequence. (c) A partially decoded image from
H.264|AVC bitstream. (d) A background-subtracted image. .............. 57
vii


Figure 12. Motion interpolation. The dotted rectangle boxes are estimated simply
by enclosing active groups corresponding to the real object. These
boxes are replaced by the rectangle boxes through motion
interpolation. ....................................................................................... 60
Figure 13. The object tracking in Coastguard with 100 frames....................... 62
Figure 14. The object tracking in Stefan with 100 frames. .............................. 63
Figure 15. The object tracking in Lovers with 300 frames. Partially decoded
regions are shown in Lovers. .......................................................... 64
Figure 16. (a) The processing time which includes partial decoding in
Coastgurad, and (b) the processing time which does not include
partial decoding. ................................................................................. 67
Figure 17. (a) Dissimilarity energies in Stefan and (b) Coastguard ............. 68
Figure 18. (a) The variation of weight factors in Coastguard, and (b) the
squared error of dissimilarity energy in Stefan. .............................. 69
Figure 19. (a) The average reliabilities of forward motion vectors in
Coastguard and (b) in Stefan. ...................................................... 70
Figure 20. The performance measurement of spatial filtering and temporal
filtering in the indoor sequence. (a) The plot of spatial filtering rates,
and (b) The temporal filtering results in which one active group train
become the real object. ....................................................................... 72
Figure 21. The performance measurement of spatial filtering and temporal
filtering in the outdoor sequence. (a) The plot of spatial filtering rates,
and (b) The temporal filtering results in which three active group
trains become the real objects. ............................................................ 73
Figure 22. The effect of motion interpolation on correction of object trajectory.
(a)-(b) are object locations and sizes in one GOP before motion
interpolation, and (c)-(d) after motion interpolation. ......................... 75
viii


Figure 23. The performance measurement of spatial filtering and temporal
filtering. (a) The plot of spatial filtering rates in the indoor sequence,
(b) The temporal filtering results in the indoor sequence one active
group train become the real object. .................................................... 79
Figure 24. The performance measurement of spatial filtering and temporal
filtering. (a) The plot of spatial filtering rates in the outdoor sequence,
(b) The temporal filtering results in the outdoor sequence three active
group trains become the real objects. ................................................. 81
Figure 25. The measurement of computational complexity. The processing time
taken for (a) the indoor sequence, and (b) the outdoor sequence. ...... 82

ix


List of Abbreviations
AC Alternating Current
ADE Accumulated Dissimilarity Energy
AVC MPEG-4 Part 10: Advanced Video Coding
DC Direct Current
DCT Discrete Cosine Transform
DSP Digital Signal Processing
HS Hue Saturation
ISDB-T Integrated Services Digital Broadcasting - Terrestrial
ISO International Organization for Standardization
IT Integer Transform
MAF Multimedia Application Format
MPEG Moving Picture Experts Group
MRF Markov Random Field
MV Motion Vector
RD Rate Distortion
ROI Region of Interest
S-DMB Satellite Digital Multimedia Broadcasting
STMF Spatial and Temporal Macroblock Filter
T-DMB Terrestrial Digital Multimedia Broadcasting




1
I Introduction
The recent technologies for video processing and computer vision have
evolved more intelligent since they are required to be aware of video contents
like background, object motion and behaviors, scene situation, and so on. In this
sense, the extraction and analysis of moving objects in a video scene becomes an
indispensable function for intelligent systems. The visual information of objects
are something like what the objects are, how the objects behave in a footage,
how long the objects are present and which direction the objects are headed for.
The object information as illustrated above can be utilized for surveillance
or interactive broadcasting service. For example, it is beneficial to monitoring a
suspected person in such public places as airports, railways, banks, supermarkets,
parking lots, and so on. Likewise, the recent broadcasting service is required to
provide additional information like objects and scene situations as well as audio-
visual contents in order to support user interactivity.
To extract the object information from a video, the object detection and
tracking tool can be utilized in real-time surveillance systems and metadata au-
thoring tools for interactive broadcasting service. In general, the object detection
and tracking tool is required to be performed with fast processing speed or in
real-time. In the case of tracking a criminal in public place, as soon as any object
is appeared inside the monitoring screen, the surveillance system has to imme-
diately detect and track this object in real-time. Also, the metadata authoring tool

2
which extracts the object information from an input video should have fast com-
putation time in order to be practically used in the industrial field.
Before discussing the computational complexity of current object detec-
tion and tracking techniques, we need to introduce two classes of object detec-
tion and tracking techniques: the pixel domain approach and the compressed
domain approach. Traditionally, most of object detection and tracking tech-
niques belong to the pixel domain approach which utilizes only raw pixel data as
resource. However, the pixel domain algorithms are actually difficult to be per-
formed with fast computation time or in real-time because they require a tre-
mendously the great amount of computations. Moreover, most of video contents
used in industry and in public are generally encoded by any compression tool
like MPEG in order to enhance the efficiency of communication by reducing the
size of video contents. In that case, the pixel domain approach requires addition-
al computation time to fully decode the encoded bitstream before initiating the
main algorithm for object detection and tracking. Although the difficulty is re-
cently more alleviated than before since the special-purpose DSP circuits for
surveillance have been developed and the performance of personal computers
have been steadily enhanced, the problem under limited hardware resources still
remains unresolved especially in such applications as large-scale distributed sur-
veillance systems which have to deal with several surveillance video contents at
the same time. Likewise, the pixel domain algorithms give a great burden of
computation to the metadata authoring tool, which makes it impractical when it

3
is performed in general-purpose PC. In this reason, maximizing the performance
of object detection and tracking under the restricted hardware resources has be-
come an important issue in designing real-time surveillance systems or metadata
authoring tools for interactive broadcasting service.
As an effective alternative of this problem, the compressed domain algo-
rithms have been proposed by many researchers. Unlike the pixel domain algo-
rithms, the compressed domain algorithms utilize the encoded information like
motion vectors, DCT coefficients, and macroblock types which are included in
the encoded bitstream. It should be noticed that the encoded information is bene-
ficial to reduce the computational complexity because it can be directly ex-
ploited as effective clues of object detection and tracking. Accordingly, employ-
ing the compressed domain approach is more effective to implement real-time or
fast object detection and tracking systems under restricted hardware resources
than employing the pixel-domain approach.
Nevertheless, the conventional compressed domain algorithms have lethal
drawbacks even though these algorithms have lower computational complexity
than pixel domain algorithms. First of all, these algorithms tend to show poor
performance of object detection and tracking due to unreliability or insufficiency
of data extracted from the encoded bitstreams. Especially, since some assump-
tions which are adopted in these algorithms are not available for various situa-
tions in a video scene, they do not guarantee the reliable performance in such
scene situations and can result in the failure of object detection and tracking.

4
The second problem in the compressed domain approach is that these al-
gorithms are not possible to support the color extraction of objects which is ne-
cessary for the object recognition or the metadata construction for interactive
broadcasting service. Since they exploit only the encoded information instead of
raw pixel data, the pixel data of the object region do not be extracted. Although
some algorithms perform partial decoding, the decoded regions are restricted in
the boundaries of objects in order to refine the edges of object regions. That is
why the representative color information of objects cannot be obtained from the
compressed domain algorithms.
Lastly, most of the compressed domain algorithms deal with only video
contents encoded by MPEG-1, MPEG-2, or MPEG-4 Visual (ISO/IEC Standard
14496 Part 2). However, H.264|AVC technology recently has become a popular
compression tool for video contents due to its high coding efficiency and the
availability of its real-time encoding devices. As an example, five representative
standards for the mobile broadcasting service such as T-DMB, S-DMB, ISDB-T,
DVB-H, and MediaFLO have adopted or have considered adopting H.264|AVC
as video compression technology. Also, MPEG recently started to standardize
the surveillance MAF which adopts H.264|AVC as the video resource. With such
a trend, the necessity of the object detection and tracking algorithm which is
available for H.264|AVC encoded videos has been emphasized. Nevertheless,
most of the traditional compressed domain algorithms are not perfectly applica-
ble for H.264|AVC videos since intra or inter prediction schemes in H.264|AVC

5
are slightly different from that of MPEG-1, MPEG-2, and MPEG-4 Visual. Al-
though some researchers have proposed some algorithms which are specified for
H.264|AVC videos, these algorithms do not ensure reliable performance in vari-
ous scenes due to such reasons as I noticed above.
In this thesis, the proposed methods for object detection and tracking in
H.264|AVC bitstream domain exploit partially decoded pixel data as well as the
encoded information in order to overcome the limitations of conventional com-
pressed domain algorithms. The main contribution in this thesis is that, unlike
conventional compressed domain algorithms, the proposed methods show relia-
ble performance in various scene situations as well as fast computation time
enough to be performed in real time.
The proposed methods are divided as the semi-automatic method and the
automatic method. In the semi-automatic method, users can manually choose
what they want to track in all kinds of scenes. This method is valuable for meta-
data authoring tools which quickly generate the position and motion information
of a predefined target object as the form of metadata for the interactive broad-
casting service. On the other hand, the automatic method is able to automatically
detect and track all moving objects in the environment that a camera is fixed. It
is beneficial to real-time surveillance systems in which monitoring video con-
tents are sent as the form of H.264|AVC bitstreams from a camera to the main
processing unit.
This thesis is organized as follows: In Section II, the MPEG-4 Advanced

6
Video Coding is briefly explained. Not only that, but the related research works
are also described and compared with the proposed methods. Section III intro-
duces two proposed schemes for moving object detection and tracking with par-
tial decoding in H.264|AVC bitstream domain: the semi-automatic method and
the automatic method. Especially, the dissimilarity minimization algorithm is
applied for the semi-automatic method; furthermore, the spatial and temporal
macroblock filter (STMF) is adopted for the automatic method. In Section IV,
the experimental results for the proposed methods are provided and analyzed in
terms of performance and computation time. Finally, the conclusion and the fu-
ture works are addressed in Section V.

7
II Related Works
The object detection and tracking techniques can be categorized as the
pixel domain approach and the compressed domain approach. The pixel domain
approach utilizes original pixel data which are perfectly decoded from com-
pressed bitstreams such as MPEG videos. On the other hand, the compressed
domain approach exploits the encoded information like motion vectors, DCT
coefficients, and macroblock types which are extracted in a compressed bit-
stream. Traditionally, the main researches on object detection and tracking have
been concentrated on the pixel domain approach since it can provide powerful
capability of object tracking by using computer vision technologies. However,
the pixel domain approach takes a long time to perform object detection and
tracking even though it detects and tracks any object precisely. Since the late
1990s, the compressed domain approach has been seriously considered to reduce
the computational complexity of object detection and tracking. It can greatly re-
duce the computational complexity and make real-time or fast processing possi-
ble although its precision is not better than the pixel domain approach. Recently,
the H.264|AVC-based algorithms, which deal with videos encoded by
H.264|AVC that is the most popular compression technology, started to be pro-
posed. The conventional H.264|AVC-based algorithms utilize not DCT coeffi-
cients but motion vectors. In this chapter, the H.264|AVC technology is summa-
rized just regarding to its baseline profile. Then, the pixel domain approach and

8
the compressed domain approach for object detection and tracking are explained
respectively.
2.1 Overview of MPEG-4 Advanced Video Coding
H.264|AVC, that is, H.264|MPEG-4 Advanced Video Coding is a kind of
video compression standard developed by the Moving Picture Experts Group
(MPEG) which is a working group of the International Organization for Standar-
dization (ISO). Contrary with MPEG-4 Visual which emphasizes high flexibility
regarding to coding techniques and resources, H.264|AVC concentrates on effi-
ciency and reliability of video compression and transmission. To support popular
applications of video compression, it defines only three profiles such as the
Baseline profile, the Extended profile, and the Main profile. The Baseline profile
is particularly beneficial for the real-time applications such as video conferenc-
ing, wireless mobile systems since it contains the error resilience functions as
well as the basic technology for video compression. The Main profile is defined
in consideration of applications like broadcasting and multimedia storage. Since
it deals with the great amount of contents, it emphasizes the technical functions
which enhance compression efficiency although it requires high computational
complexity. The Extended profile is useful for video streaming applications.
Video streaming applications put their purpose into the real-time playing of a
pre-encoded video content which is delivered in serial order. In this reason, it
does not consider real-time encoding techniques, but pursue high compression

9
efficiency.
In this thesis, the Baseline profile is only considered among three profiles
because it is suitable for applications which need the function of object detection
and tracking like surveillance systems and metadata authoring tools for digital
broadcasting service. This profile allows only I-slices and P-slices. The I-slices
include only I-type macroblocks which are encoded by using the intra prediction
while the P-slices contain I-type and P-type macroblocks which are predicted by
using inter prediction as well. The encoded information includes motion vectors,
macroblock types, and DCT coefficients which are the differences between orig-
inal pixel values and predicted values.
I-type macroblocks can either be encoded in 4x4 Intra or in 16x16 Intra
prediction in the baseline profile. For intra prediction, some neighbor blocks of
each block in the I-type macroblock are used for intra prediction as reference
data. On the other hand, P-type macroblocks can be split into macroblock parti-
tions (e.g. 16x16, 16x8, 8x16, and 8x8) or sub-macroblock partitions (e.g. 8x4,
4x8, and 4x4) for inter prediction. Each partition can have one single motion
vector. P-type macroblocks can also be encoded by the skip mode by which the
pixels in the motion-compensated region can be directly used as the recon-
structed data without residual data. Such macroblocks can generate mainly in the
background regions which have no color change according to camera motion. It
should be noticed that a macroblock with the skip mode can have its motion vec-
tor.

10
2.2 Pixel Domain Approach
In the special case of two-dimensional rigid object detection and tracking,
the pixel domain approach can be classified as four categories such as 1) region-
based methods, 2) contour-based methods, 3) feature-based methods, and 4)
template-based methods. Region-based methods perform object detection and
tracking by using the Region of Interests (ROIs) characteristics like color his-
togram and motion distribution, and so on. Contour-based methods are the way
to find ROIs position and form by modeling the contour of objects. Feature-
based methods calculate the motion parameters of feature points which are au-
tomatically or semi-automatically defined inside objects; some algorithms which
belong to this method use cross-correlation or Gabor wavelet. Template-based
methods define the template or model of objects in advance and extract the
ROIs area which is well matched with such a model. In this chapter, four kinds
of pixel domain methods are explained respectively.
2.2.1 Region-based Methods
In the region-based methods, the region of objects can be defined as the
set of pixels which have similar properties. Such an object region can be sepa-
rated from an image sequence by using motion information and object properties
like color histogram [27-29]. Especially, color information is very useful for re-
gion-based methods in the case that the representative colors of an object are
evidently distinguishable from background or other objects. Since the region-
based methods which make use of color information tend to be sensitive to illu-

11
mination change, they use illumination invariance or color correction as an alter-
native to illumination change [30].
Modeling the color distribution of objects in advance is effective to im-
prove the performance of object detection and tracking. This color modeling can
be categorized as parametric methods and nonparametric methods. The former
applies Gaussian models to the color space normalized in regard to illumination
[31]. In every frame, the best matched regions with the color reference model are
searched for. Since the color distribution of objects can be changed according to
illumination condition, it is statistically estimated and updated frame by frame.
On the other hand, the latter utilizes lookup table or Bayes probability map to the
Hue Saturation (HS) color space [32].
Background subtraction is also one of popular techniques which are fre-
quently used in the region-based methods [29,33-35]. It should be noticed that
although most of object regions can be extracted by background subtraction,
background-subtracted images can be erroneous due to measurement errors or
scene environment change. Thus, techniques like morphological filtering or dy-
namical updating of the background model are used to extract more precise fore-
ground regions [36].
2.2.2 Contour-based Methods
The contour-based methods for object detection and tracking are the way
that finds out both the position and shape of objects through modeling the con-

12
tour information. It is not only more robust from partial occlusion but also can
be applicable for deformable objects as well as rigid objects although its compu-
tational complexity is higher than region-based methods.
An active contour is a terminology which has been mainly used in con-
tour-based algorithms [18,37-38]. An active contour is a polygon which consists
of several points which are placed at specific features such as lines and edges.
Then, the total energy of an active contour is defined as the sum of internal and
external energies which are related to contour elasticity and image contents re-
spectively. Among several candidates for the active contour of an object, the best
one is determined as one which has the minimum energy.
Another type of contour-based methods is the graph-cut algorithm [39-40].
It defines the object region as a graph which is made from the combination of an
inner and an outer boundary. The two boundaries consist of the set of points
called the sources in the inner boundary and the sinks in the outer boundary.
Then, the final deformation is decided by computing the minimum cut of the
graph.
2.2.3 Feature-based Methods
The feature-based methods for object detection and tracking in pixel do-
main calculate the motion parameters of feature points. Such motion parameters
are related with affine transformation which is composed of rotation and transla-
tion in 2D space. In this way, an object which users want to track is usually de-

13
fined as a bounding box or convex hull.
This method can have more tracking errors in object detection and track-
ing than other methods since it is sensitive to partial occlusion. Furthermore, its
performance is definitely subject to the selection of feature points. That is to say,
feature points should be visually prominent like edges or boundaries which are
exactly separated and recognized from the neighborhood. To select such feature
points with distinctive properties, various techniques such as Hough transform
and Gabor wavelets are able to be used [41-42].
Once feature points are selected, the displacement of these points can be
computed by minimizing the dissimilarity as described in [43]. Instead of the
dissimilarity minimization, the cross-correlation method is also effective to track
feature points as introduced in [44-45]. It searches for the best candidate, which
maximizes the cross-correlation, among the square neighborhood of the point in
the previous frame. Otherwise, some researchers make use of the 2D golden sec-
tion algorithm based on a mesh which can be created by interconnecting feature
points [42].
2.2.4 Template-based Methods
Template-based methods are the way of tracking special objects like face
by using the predefined template. First of all, a template can be created by ob-
serving during a particular period or from a database which is statistically made
[46]. The next step is the template matching which searches the best region

14
matched with the template. The template is projected onto the target image
through minimizing the distance measure; in other words, the parameters of this
geometric transformation are estimated [47-48].
However, since such an algorithm is available simply for rigid objects, the
algorithms for deformable objects have been introduced [49-51]. In these algo-
rithms, some parameters of a deformable template are obtained by minimizing
the template energy which is composed of terms attracting the template into
prominent features such as edges. Then, the deformable template can be acquired
by deformation of the template based on such parameters. In the meantime, to
cope with problems like viewpoint change and illumination change, the color
invariant features can be extracted by updating the template based on Kalman or
particle filters [48].
2.3 Compressed Domain Approach
The conventional compressed domain algorithms exploit motion vectors
or DCT coefficients instead of original pixel data as resources in order to reduce
computational complexity of object detection and tracking. These encoded data
are not enough credible or insufficient to detect and track moving objects. For
example, the motion vectors in the encoded bitstreams are not always coincident
with 2D projected true motion (sometimes called optical flow) since the block
matching algorithm for producing motion vectors in a video encoder is designed
to pursue data reduction instead of optical flow estimation. Furthermore, the mo-

15
tion vectors are sparsely distributed in an image in the units of blocks such as
8x8, 16x16, and so forth; that is, the motion vector field is not dense. With re-
spect to DCT coefficients in MPEG-1 or MPEG-2, the DC image can be con-
structed from DCT coefficients in an I-frame which are directly produced from
original pixel value without intra prediction. However, it provides insufficient
information about texture because its resolution is worse than the original im-
ages resolution. Due to these reasons, the compressed domain algorithms have
been concentrated on overcoming these limitations.
Most of compressed domain algorithms for object detection and tracking
carry out the object segmentation which partitions an image into several seg-
ments which represent background or moving objects with block unit boundaries.
Since these algorithms extract the boundaries of objects as well as their location
and size, they are able to require more computation time than those which ex-
tract only object location and size without describing object boundaries. Never-
theless, it is worth surveying these object segmentation algorithms because they
are intimately involved in object detection and tracking procedure.
In general, the compressed domain algorithms contain two steps such as
the clustering step and the filtering step. Depending on how these steps are orga-
nized, the compressed domain algorithms can be categorized as follows: the
clustering-based methods and the filtering-based methods. The clustering-based
methods attempt to perform grouping and merging all blocks into several regions
according to their spatial or temporal similarity. Then, these regions are merged

16
each other or classified as background or foreground. On the other hand, the fil-
tering-based methods extract foreground regions by filtering blocks which are
expected to belong to background or by classifying all blocks into foreground
and background. Then, the foreground region is split into several object parts
through clustering procedure. In this chapter, two types of the compressed do-
main algorithms are described respectively.
2.3.1 Clustering-based Methods
As the most important measurement for extracting the moving object re-
gion, the clustering-based methods emphasize the local similarity of blocks ra-
ther than the global similarity in a whole image. First of all, they split an image
into several regions which consist of blocks with homogeneous properties of
motion vectors or DCT coefficients. Then, after similar regions are merged,
these regions are classified as background or foreground. In most clustering-
based methods, a preferential clue for block-clustering is the similarity of motion
vectors while the similarity of DCT coefficients is complementarily employed to
improve the performance or refine object boundaries.
In the simplest algorithm introduced in [11], some blocks can be merged
by grouping similar nearby motion vectors, and the merged block group is con-
sidered as a moving object. After the target object is manually selected among
several block groups, then it is tracked by searching the corresponding block
group which has similar average motion vector. If such a group cannot be found

17
due to occlusion, the similarity measure based on DCT coefficients can be em-
ployed.
The leveled watershed technique is also another possible way for block-
clustering. As described in [10], the leveled watershed technique can be applied
to low resolution images which are generated from DC and first two AC coeffi-
cients. Then, it constructs the motion map in which the dominant motion blocks
are extracted from the histogram of accumulated motion vectors. Especially, the
intra-coded macroblocks are also added to the motion map. Based on such a mo-
tion map, all connected regions which have similar motion vectors are merged.
However, the above similarity measure of motion vectors is not accurate
and credible since motion vectors do not always correspond to optical flow. Thus,
the performance of object detection and tracking can be improved by measuring
the reliability of motion vectors. As described in [9], reliable motion vectors can
be extracted by the noise adaptive soft switching median (NASM) filter. Then,
for spatial segmentation in P-frames, it clusters NASM-filtered reliable motion
vectors into an optimal number of homogeneous groups according to motion
homogeneity. To compensate the limitation of spatial segmentation due to sparse
distribution of motion vectors, temporal segmentation is additionally employed.
After moving object regions in a P-frame are projected onto the current I-frame,
more precise boundaries can be obtained by maximizing the entropy based on
the DC image constructed from DCT coefficients.
In addition to the unreliability of motion vectors, the motion vector field

18
extracted from compressed videos is too sparse since each motion vector is as-
signed per macroblock. Therefore, the clustering-based methods have been
evolved to overcome the insufficiency of motion information due to sparse mo-
tion vector field. In [13], Babu et al. have introduced the advanced technique of
extracting more reliable and dense motion information from sparse and unrelia-
ble motion vector field. The algorithm calculates the reliability of motion vectors
based on the energy of DCT coefficients. Only reliable motion vectors are tem-
porally accumulated over several frames, and then are spatially interpolated by
median filtering to get the dense motion field; that is, one motion vector is as-
signed to each pixel. Basically, the dense motion field can be clustered by incor-
porating affine parametric motion model; however, such a clustering cannot be
precise since the dense motion field still remains unreliable. The problem can be
coped with by the expectation maximization (EM) algorithm which is an itera-
tive technique that alternately estimates and refines the segmentation and motion
estimation. Also, the optimal number of motion models is estimated by K-means
clustering. Such initially segmented object partitions are temporally tracked over
frames. Finally, the edge refinement process is done based on partially decoded
data from each edge block and its eight neighboring blocks.
Some algorithms make use of DCT coefficients rather than motion vectors
as a main resource because they never trust the reliability of motion vectors. An
example of such algorithms is introduced in [3]. The algorithm merges spatially
similar blocks based on DCT coefficients. A region with spatial homogeneity can

19
contain both true motion and false motion blocks. The decision rule for true mo-
tion blocks is dependent on the motion-compensated error which is derived from
motion vectors and DCT coefficients. Basically, a region which includes more
true motion blocks than false motion blocks can be considered as a part of mov-
ing objects and called a dynamic region. If a non-dynamic region is over-
lapped with the regions projected from moving objects in the previous frame, its
status can also be altered into projected dynamic region according to the num-
ber of true motion blocks in the non-dynamic region. All dynamic regions and
projected dynamic regions are merged into moving objects in the current frame.
The algorithm described in [8] never utilizes motion vectors; it exploits
only DCT coefficients for object detection and segmentation. It initially clusters
a frame into several fragments according to the similarity of AC components and
DC image which is constructed from DCT coefficients. Next, it merges homoge-
neous fragments based on two spatiotemporal similarity measures. One measure
is to merge spatiotemporally similar fragments, and the other is to merge the
fragments with lower spatiotemporal similarity but high average temporal
change within the fragments. These similarity measures are defined as the com-
bination of the spatial similarity and the temporal similarity. While the spatial
similarity is based on the entropy of AC coefficients, the temporal similarity is
measured through performing the 3D Sobel filter along x-, y-, and t-axes. Finally,
the fragments with high average temporal change are classified as objects, and
others are classified as background. Then, detailed features like edge information

20
can be extracted by decoding DCT coefficients around the boundaries of objects.
The region growing is one of typical techniques which are popularly em-
ployed in clustering-based methods. Chen et al. suggested a region growing al-
gorithm which is available in MPEG-2 compressed domain [26]. At the first
stage, the algorithm extracts DC images in I-frames and P-frames. While the DC
image in an I-frame is extracted directly from DC coefficients, the DC image in
a P-frame can be estimated from the DC image of its reference frame as de-
scribed in [21]. The second stage is the object segmentation which is composed
of three steps: (1) obtaining object fragments by extracting blocks with high
temporal change in the DC image, (2) region growing which is performed by
merging interconnected nearby object fragments, and (3) merging the regions
with similar motion vectors and spatially close regions. In the case of non-
stationary cameras, global motion compensation can be applied. The last stage is
the object tracking which is performed by searching a corresponding object in
the subsequent fame. The correspondence of objects is judged based on the simi-
larity of numbers of pixels and the similarity of center positions.
Another region growing algorithm has been proposed by Porikli and Sun
in [17]. The algorithm defines the frequency-temporal data structure which is
constructed from DCT coefficients and motion vectors. The feature vector at a
bock indexed by (m,n) in the tth frame is defined as follow:
| |
T
y x d v h v u y n m t
mv mv e ac ac ac dc dc dc f , , , , , , , , :
, ,
(1)

21
where
v u y
dc dc dc , , mean DC coefficients of the Y-, U-, and V-channels;
d v h
ac ac ac , , denote the averages of horizontal, vertical, and diagonal DC coef-
ficients; e describes the energy of DCT coefficients in a block;
y x
mv mv , de-
note the x- and y-components of the forward motion vector. It should be noticed
that the original motion vectors with backward direction are converted to the
backward motion vectors. After constructing the frequency-temporal data struc-
ture, a seed region grows the volume in spatial and temporal directions by merg-
ing homogeneous blocks which have similar feature vectors. Next, each volume
is fitted to a motion model by estimating the affine motion parameters. Lastly,
each segmented volume is hierarchically clustered into objects by using its mo-
tion parameters. To terminate the iteration of hierarchical clustering, the algo-
rithm measures the validity score that evaluate the result of object segmentation.
The algorithm provides the processing time of 0.9~2ms per frame.
2.3.2 Filtering-based Methods
While the first step in clustering-based methods is related to block cluster-
ing, the filtering-based methods first extract the foreground region by removing
all blocks which are unreliable or judged to belong to background. After such a
global segmentation is performed, the foreground region is split into multiple
objects by an appropriate clustering technique. A variety of filtering-based tech-
niques have been proposed.
Wang et al. employed three (spatial, temporal, and directional) confidence

22
measures and global motion compensation for filtering unreliable macroblocks.
The spatial confidence measure assesses how a motion vector is conformable to
the local motion smoothness constraint within a neighborhood region in terms of
its magnitude and direction. The temporal confidence measure assesses how a
motion vector is smoothly changed over neighborhood frames. On the other
hand, the texture confidence measure is based on the assumption that a low-
textured region tend to cause false motion vectors which do not coincide with
optical flow. In other words, the average energy of AC coefficients in four
neighborhood blocks is computed; if the AC energy is higher than a predefined
threshold, the current block has the perfect texture confidence. Combining the
three confidence measures, motion vectors with low confidence score are re-
jected, and the holes occurred in the motion field are repaired by spatial and
temporal motion filtering. The dominant foreground region is separated through
iterative estimation of global motion parameters such as zoom, vertical, and ho-
rizontal translations. Then, it is split into multiple objects by performing K-
means and EM clustering based on spatial and motion features. These objects are
tracked by their location and motion.
As shown in the above algorithm, the global motion estimation is one of
popular filtering-based techniques for extracting the foreground region. Similarly,
the foreground/background segmentation can be performed by the iterative ma-
croblock rejection which iteratively estimates the parameters of the global-
motion model [22]. Then, the foreground is clustered by examining the temporal

23
consistency of the iterative rejection output.
The background subtraction technique is also beneficial to extracting the
foreground in the filtering-based method. Zeng et al. have also proposed the
change-based algorithm which extracts moving objects by the background detec-
tion based on the inter-frame difference of DC images. First, the background
subtraction is performed by applying the moment-preserving thresholding me-
thod to the histogram of inter-frame difference, on the basis of the experimental
observation such that the background tends to have low value of inter-frame dif-
ference. Then, if moving objects are assumed to be non-Gaussian signal, they
can be detected by the fourth-order moment measure which is computed within a
moving window of inter-frame difference in the block unit. The blocks with high
moment are considered as a part of moving objects.
The advanced background subtraction technique is introduced by Aggar-
wal et al. in [2]. In this algorithm, a user manually selects a target object in an I-
frame by drawing a rectangle box. The location of the object in the subsequent
frames of one GOP is estimated by using motion vectors within the target object
region. The object region in the subsequent I-frame can be found by the back-
ground subtraction of DC images. In other words, all foreground regions are ex-
tracted by subtracting the DC image which is constructed from DC coefficients
from the low-resolution background image. The foreground is clustered into sev-
eral candidate objects, and then the target object is found among these candidate
objects. To be specific, the previous target object is projected into the current I-

24
frame based on its motion vector. Then each candidate object is compared with
the projected object in terms of the histograms of DCT coefficients of chromin-
ance components (Cb and Cr) and the distance from the projected object; that is,
a candidate object which has the smallest difference with the projected object is
considered as the target object in the current I-frame. Finally, the locations of the
target object in the previous P- or B-frames are updated by object interpolation.
This method is available only for surveillance videos which are taken from a sta-
tionary camera.
Exceptionally, the algorithm proposed by Yu et al. in [25] combines the
background subtraction with the region-growing technique as a kind of cluster-
ing-based methods. It makes a motion mask, by clustering the given motion vec-
tor field based on the region growing algorithm, as well as a difference mask by
applying the background subtraction to the DC image extracted from DCT coef-
ficients. Two masks are combined to obtain the final mask as moving object re-
gions.
Another technique for moving object segmentation is based on the Mar-
kov random field (MRF) theory. The MRF-based algorithms provide more relia-
ble performance; however, it does not show significant reduction in computa-
tional complexity. Benzougar et al. first have proposed the algorithm based on
Markovian macroblock labeling framework [6]. First of all, a 2D affine motion
model of the dominant image motion in each P-frame is computed from motion
vectors. Then, a frame is divided into two groups: the regions corresponding to

25
the estimated dominant image motion and those not corresponding to it. Since
the dominant image motion represents the global motion, the regions not con-
forming to the dominant image motion can be thought as the moving object re-
gions. However, it is not still reliable due to the drawbacks of motion vectors. To
make segmentation more reliable, the algorithm employs the displaced frame
difference (DFD) between successive DC images constructed from DC coeffi-
cients of P-frames. That is to say, if any block has low DFD value, the above
labeling decision for this block is clarified to be more reliable. In this reason, this
algorithm considers two factors such as the DFD and the difference between mo-
tion vectors of each block and the dominant image motion. Two factors are com-
bined in the form of energy, and then this energy is minimized by Markovian
labeling framework to find out the optimal configuration of moving object re-
gions.
As an another MRF-based algorithm, Treetasanatavorn et al. have pro-
posed the algorithm that applies the Gibbs-Markov random field theory and the
Bayesian estimation framework to separate the significant foreground object
from compressed motion vector field [15]. This algorithm performs object detec-
tion and tracking by maximizing the following probability density based on the
maximum a posteriori probability (MAP) estimation and the Gibbs-Markov ran-
dom field theory:
( ) ( ) ( ) ( ) S S V S V S V S S Pr Pr , Pr , , Pr ' ' (2)

26
which S denotes an initial partition, S' is its predicted partition, and V is
the compressed motion vector field. At first, this method use only reliable mo-
tion vectors after evaluating the reliability of motion vectors. The object segmen-
tation in the first frame is performed by the stochastic motion coherency analysis
introduced in [23]. In the subsequent frames, it additionally applies the partition
projection and relaxation scheme introduced by [24] in order to predict S' . The
conflict between two predictions, like that from the stochastic motion coherency
analysis and that from the partition projection and relaxation scheme, is resolved
by checking the incongruity between two predicted partitions. In the next step,
the method attempts to get the most optimal partition from the predicted partition
obtained in the first step by using the Bayesian estimation framework. For this
work, the algorithm relax region boundaries and then search the optimal configu-
ration of partition which maximizes ( ) V S S , , Pr ' . Finally, it classifies partitions
into background and foregrounds based on the reliability of motion vectors.
Recently, a MAF-based algorithm for H.264|AVC compressed video has
been proposed by Zeng et al. in [12]. The basic structure of this algorithm is sim-
ilar with other MRF-based algorithms; that is, similar motion vectors are merged
into multiple moving objects by minimizing the MRF energy. Its unique trait is
that it considers variable block sizes (such as 4x4, 4x8, 16x16, and so forth)
which are supported not by MPEG-1 or MPEG-2 but by H.264|AVC. The algo-
rithm has two steps: (1) motion vector classification and (2) moving block ex-
traction. In the first step, there are four types of motion vectors such as back-

27
ground MVs, edge MVs, foreground MVs, and noise MVs. In stationary scenes,
motion vectors with small magnitude are considered as background MVs while
motion vectors with big magnitude are classified as foreground MVs. Motion
vectors with intermediate magnitude are thought as noise MVs. On the other
hand, a motion vector which is similar with the average motion vector of its
neighborhood blocks, it can be considered as an edge MV. The number of neigh-
borhood blocks is decided by the macroblock partition type. In the second step,
the MRF classification is applied to find the optimal configuration of block labe-
ling (foreground or background) and extract the blocks which represent moving
objects, on the consideration of two clues such as (1) the MV spatial similarity
within a moving region and (2) the temporal consistency of moving objects. The
object segmentation in I-frames is achieved through projecting the object regions
in the previous P-frame to the current I-frame according to the inverted motion
vectors.
Thilak and Creusere also have proposed the algorithm for H.264|AVC
compressed videos which use the probabilistic data association filter (PDAF)
that is a kind of Bayesian algorithm [25]. The algorithm has two separated steps
like the detection step and the tracking step. In the detection step, it constructs
the binary image; the pixels whose motion vectors have the neither small nor big
magnitude are considered to belong to moving objects. Mathematically, such
blocks should satisfy the following condition:

28

' s s '
=
otherwise
M M M
H i L
i
0
1
| (3)
where
i
| is the pixel in location i ,
i
M is the magnitude of motion vector,
and
L
M' and
H
M' are the lower and upper threshold for the motion vector
magnitude. To improve the classification performance, the optimal threshold
values of
L
M' and
H
M' can be obtained by minimizing Bayesian risk which
are constructed from the probability densities and prior probabilities of two
classes (target and foreground) Then, all pixels which are interconnected each
other in the binary image are merged; and several fragments are formed. In the
tracking step, the motion of the target object is modeled as follow:
k k k
k k k
w Hx z
v Fx x
+ =
I + =
+1
(4)
where
k
x is the state vector of the target at time k ,
k
z is the observation vec-
tor of the target,
k
v and
k
w are a zero mean, white, and Gaussian noise se-
quences with covariance matrix I' I
k
Q and
k
R respectively, F and H are
matrices that are independent from time. Classically, this kind of objects can be
tracked by the Kalman filter; however, since the Kalman filter can track only one
fragment, it can have serious error in the case that one object is split into several
fragments. Therefore, the PDAF can be applied to handle such cases. It seems to
show reliable performance, but it does not assure the reduction of computational
complexity.

29
2.3.4 Issues in Compressed Domain Approach
The essential goal of compressed domain approach is to significantly re-
duce the computation complexity although it slightly deteriorates the perfor-
mance of object detection and tracking. The processing time for major algo-
rithms is shown in Table 1; other algorithms have not verified how fast they are.
Table 1. The processing time of compressed domain algorithms.
Authors Frames/sec PC Note
Zen et al. [11] 5~10 unknown
Wang et al. [7] 2 450 MHz
Chen et al. [26] 43 unknown
Benzougar et al. [6] 40 400 MHz Excluding video decoding
Mezaris et al. [22] 200 800 MHz Excluding video decoding
Zeng et al. [12] 2~16 700 MHz Available for H.264|AVC
Treetasanatavorn et al. [15] 0.1 500 MHz
Porikli and Sun [17] 111~500 4.3 GHz
Aggarwal et al. [2] 100 1.8 GHz

The algorithms which show significantly fast processing time are Chen et al.s,
Mezaris et al.s, Porikli and Suns, and Aggarwal et al.s algorithms. Especially,
although Mezaris et al.s algorithm and Porikli and Suns algorithm simulta-
neously performs object segmentation as well as object detection and tracking,
their processing time is remarkably fast.
Nevertheless, these algorithms have some lethal shortcomings which
cause poor performance of object detection and tracking. First of all, they are
available only in extremely restricted environments; that is, they can have se-
rious error in special scene situations. For instance, Chen et al.s algorithm first

30
extracts the foreground region from the difference image of temporally neigh-
boring DC images [26]. It is not reasonable because most internal parts of object
region can be excluded from the extracted foreground region when the interior of
objects is low-textured. In other words, it can achieve successful results only in
the cases that the texture of most object regions is obviously altered. In the case
of Mezaris et al.s algorithm, the foreground is obtained through global-motion
compensation based on iterative macroblock rejection scheme [22]. That is, a
motion vector which is greatly different from the global motion is considered as
background. However, it can fail to extract the whole foreground region in the
case that the motion of moving objects is not exactly distinguishable from the
global motion. Porikli and Suns algorithm can also make an error due to the li-
mitation of region merging technique [17]. For spatiotemporal segmentation, it
merges blocks that have similar motion vectors and DCT coefficients. However,
an object region can contain chaotic motion vectors; for example, when a de-
formable object moves in the same direction as that of a camera looking toward
or it consists of homogeneous texture in large portion, a chaotic set of motion
vectors is produced with various amplitudes or directions in unpredictable pat-
terns. Likewise, the limitation of Aggarwal et al.s algorithm is that it does not
consider the change in the size of the target object which is manually selected as
a rectangle box [2]. The algorithm is limitedly applicable only when the object
size is constant over frames.
Another problem in these algorithms is that they are not compatible with

31
H.264|AVC. These algorithms commonly exploit the DC images which are
formed from DCT coefficients in an I-frame. In MPEG-1 or MPEG-2 bitstreams,
the DC image formation is possible because raw pixel data in I-frames is directly
converted by the discrete cosine transform (DCT) without intra prediction. On
the other hand, since in H.264|AVC the difference between original pixel data
and intra-predicted pixel value is converted by the integer transform (IT), the DC
image cannot be built in I-frames and P-frames.
Additionally, these algorithms do not support consistent object recognition
based on color information. For example, when we track multiple persons, a per-
son can repeatedly come in and go out from the camera screen. Then, in
H.264|AVC videos it is difficult to recognize the persons identity based on mo-
tion vectors and IT coefficients.
In the proposed methods, the above problems are coped with by three
ways: (1) reinforcing the adaptability about various scenes, (2) reflecting the ex-
traordinary features of H.264|AVC bitstreams, and (3) decoding the ROIs par-
tially. As a result, the proposed methods cannot only maintain fast computation
time, but also have more reliable performance than that of the traditional algo-
rithms.

32
III Proposed Schemes for Moving Object De-
tection and Tracking with Partial Decod-
ing in H.264|AVC Bitstream Domain
In this chapter, two algorithms for object detection and tracking in
H.264|AVC bitstream domain are introduced. One approach is the semi-
automatic method for interactive broadcasting services, and the other approach is
the automatic method especially for real-time surveillance applications. The
semi-automatic method adopts the dissimilarity minimization algorithm, whereas
the automatic method is based on the spatial and temporal macroblock filter
(STMF). Two techniques commonly concentrate on improving the performance
in various scenes where the traditional compressed domain algorithms are not
available.
It should be noticed that unlike traditional compressed domain algorithms,
the proposed algorithms exploit partially decoded pixel data as well as encoded
information like motion vectors or IT coefficients in order to detect and track
moving objects. Even though some compressed domain algorithms contain par-
tial decoding process, it does not positively contribute to object detection and
tracking procedure; it is just for boundary refinement [8,13]. The partial decod-
ing in the proposed algorithms can increase the processing time; however, it
makes a great contribution to finding more accurate locations and sizes of mov-

33
ing objects. Not only that, but it also gives the color information of multiple ob-
jects which can be used for object recognition or metadata formation.
3.1 Semi-automatic Approach
In order to extract location information of a predefined target object from
stationary or non-stationary scenes encoded by H.264|AVC, the dissimilarity
energy minimization algorithm can be exploited. It makes use of motion vectors
and partially decoded luminance signals to perform tracking adaptively accord-
ing to properties of the target object in H.264/AVC videos. It is one of the semi-
automatic feature-based approaches that tracks some feature points selected by a
user. First, it roughly predicts the position of each feature point using motion
vectors extracted from H.264/AVC bitstream. Then, it finds out the best position
inside the given search region by considering three clues such as texture, form,
and motion dissimilarity energies. Since just neighborhood regions of feature
points are partially decoded to compute this energy, the computational complexi-
ty is greatly saved. The set of the best positions of feature points in each frame is
selected to minimize the total dissimilarity energy by dynamic programming.
Also, weight factors for dissimilarity energies are adaptively updated by the
neural network. Compared with the traditional compressed domain algorithms,
the algorithm can successfully track the target object even when its shape is de-
formable over frames or its motion vectors are not homogeneous due to high-
textured background.

34
3.1.1 Forward Mapping of Backward Motion Vectors
The motion vectors extracted directly from H.264|AVC bitstream can be
used to predict roughly the motion of feature points. Since all motion vectors in
P-frames have backward direction, it should be changed to have forward direc-
tion. Following Porikli and Sun [17], the forward motion field is built by the re-
gion-matching method. First, motion vectors of blocks with various sizes are
dispersed to 4x4 unit blocks. After each block is projected to the previous frame,
the set of overlapping blocks is extracted as shown at Figure 1.

Figure 1. The region-matching method for constructing the forward motion field
Forward motion vectors of overlapped blocks in the previous frame are
updated with respect to the ratio of the overlapping area to the whole block area.
Assuming that the jth 4x4 block b
k,j
in the kth frame is overlapped with the ith
4x4 block b
k-1,i
in the k-1th frame, the forward motion vector fmv
k-1
(b
k-1,i
) is giv-
en by
( )
( )
( )
,
1
1, ,
1
16
1
N
S i j
k
fmv b mv b
k
k i k j
k
j
| |

=
|

|
= \ .

(5)

35
where S
k-1
(i,j) stands for the overlapping area between b
k,j
and b
k-1,i
, and
mvk(b
k,j
) denotes the backward motion vector of b
k,j
with i,j=1,2,,N. We as-
sume that H.264/AVC videos are encoded in the baseline profile which each
GOP contains just one I-frame and several P-frames. It should be noticed that the
above region-matching method cannot be applied in the last P-frame in one GOP
since the next I-frame does not have backward motion vectors. Assuming that
the motion of each block is approximately constant within a small time interval,
the forward motion vector of any block in the last P-frame can be assigned as a
vector with the reverse direction of the backward motion vector as expressed by
( ) ( )
1
1, 1,
1
fmv b mv b
k
k i k i
k
=


. (6)
Thereafter, positions of feature points in the next frame are predicted using
forward motion vectors. If the nth feature point in the k-1th frame has the dis-
placement vector f
k-1,n
=(fx
k-1,n
,fy
k-1,n
) and is included in the ith block b
k-1,i
, the
predicted displacement vector p
k,n
=(px
k,n
,py
k,n
) in the kth frame is defined as
( )
1,
, 1, 1
p f fmv b
k i
k n k n k
= +



. (7)
Since the predicted position of any feature point is not precise, we need
the process of searching the best position of any feature point inside the search
region centered at the predicted position p
k,n
= (px
k,n
,py
k,n
). It is checked whether

36
each candidate point inside the search region is the best position using the dissi-
milarity energies related to texture, form, and motion. The set of candidate points
with the minimum total dissimilarity energy is selected as the optimal configura-
tion of feature points.
3.1.2 Texture Dissimilarity Energy
The similarity of texture means how the luminance property in neighbor-
hood of a candidate point is similar with that in the previous frame. The set of
candidate points inside the square search region is denoted as C
k,n
={c
k,n
(1),
c
k,n
(2),, c
k,n
(L)} with L= (2M+1)(2M+1) in the case of the nth feature point in
the kth frame. Then, the texture dissimilarity energy E
C
for the ith candidate
point c
k,n
(i)=(cx
k,n
(i),cy
k,n
(i)) is defined as
( )
( )
( ) ( ) ( )
( ) ( ) ( ) i cy y i cx x s
i cy y i cx x s
W
i n k E
n k n k k
W
W x
W
W y
n k n k k C
, , 1
, ,
2
,
,
1 2
1
, ;
+ +
+ +
+
=

= =

(8)
where s
k
(x,y) stands for the luminance value in a pixel (x,y) of the kth frame, and
W is the maximum half interval of neighborhood. The smaller E
C
is, the more the
texture of its neighborhood is similar with that of the corresponding feature point
in the previous frame. This energy forces the best point to be decided as the posi-
tion with the most plausible neighbor texture as far as possible. Figure 2 shows
how the search region and the neighborhood of a candidate point are applied to
calculate E
C
.

37

Figure 2. The search region is centered at the predicted point located by a forward
motion vector. A candidate point inside the search region has its neighborhood of square
form to compute E
C
.
Only necessary blocks can be partially decoded in P-frames to reduce the
computational complexity. On the other hand, intra-coded blocks are impossible
to be partially decoded since these are spatially intra-coded from these neighbor
blocks.
General partial decoding takes long time since decoding particular blocks
in P-frames requires many reference blocks to be decoded in the previous frames.
We can predict decoded blocks to reduce the computation time. To predict de-
coded blocks in the kth P-frame, we assume that the velocity inside one GOP is
as uniform as the forward motion vector of the k-2th frame. For the ith frame
with i=k,k+1,,K, the predicted search region P
k,n
(i) is defined as the set of pix-
els which are necessary to calculate the texture dissimilarity energies of all poss-
ible candidate points for the nth feature point. Then, the half maximum interval
T
k,i
of P
k,n
(i) is T
k,i
=(i-k+1)M+W+ where denotes the prediction error. Then,

38
P
k,n
(i) is given as follows:
( ) ( ) ( ) ( ) {
( ) }
i k i k m m m m
n k n k k n k
T T y x y x m
f m f b fmv k i p p i P
, ,
, 1 , 2 2 ,
,..., , ; , ,
1
= =
+ + + = =


(9)
where b(f
k-2,n
) stands for the block which includes the nth feature point f
k-2,n
. The
decoded block set D
k,n
(i) is defined as the set of blocks which should be decoded
to reconstruct P
k,n
(i). Using the motion vector of the k-1th frame, D
k,n
(i) is given
by
( )
( )
( )
( ) ( )
( ) ,
1
,
1,
,
D i b d d i k mv b f p p P i
k
k n
k n
k n

= = + e `


)

(10)
Assuming that there exist F feature points, the total decoded block set D
k

in the kth frame can be finally computed as
( )
,
1
F
K
D D i
k n
k
n
i k
=
=
=


(11)
Figure 3 shows how partial decoding is performed in the first P-frame of
one GOP which contains one I-frame and three P-frames. It should be noticed
that the time for calculating the total decoded block set is proportional to the
GOP size.

39

Figure 3. The structure of partial decoding in the first P-frame of a GOP which contains
one I-frame and three P-frames. Two decoded block sets D
k,n
(k+1) and D
k,n
(k+2) in the
first P-frame are projected from two predicted search regions P
k,n
(k+1) and P
k,n
(k+2).
3.1.3 Form Dissimilarity Energy
The similarity of form means how the network of candidate points is simi-
lar with the network of feature points in the previous frame. Each feature point is
jointly linked by a straight line like Figure 4. After a feature point is initially se-
lected, it is connected to the closest one among non-linked feature points. In this
way, the feature network in the first frame is built by connecting all feature
points successively.
To calculate the form dissimilarity energy of each candidate point, we as-
sume that each feature point is arranged in the order named at the first frame.
The feature point f
k-1,n
in the k-1th frame has its difference vector fd
k-1,n
(i)=f
k-
1,n
(i)-f
k-1,n-1
(i) as shown at Figure 4. Likewise, the ith candidate point of the nth
feature point in the kth frame has its difference vector cd
k,n
(i)=c
k,n
(i)-c
k,n-1
(j).
Then, the form dissimilarity energy E
F
for the ith candidate point of the nth fea-

40
ture point (n>0) is defined as follows:
( ) ( )
1/ 2
; ,
,
1,
E k n i cd i fd
k n
F
k n
=


(12)
All candidate points of the first feature point (n=0) have zero form dissi-
milarity energy E
F
(k;0,i)=0. The smaller E
F
is, the less the form of the feature
network will be transformed. The form dissimilarity energy forces the best posi-
tion of a candidate point to be decided as the position where the form of the fea-
ture network is less changed as far as possible.

Figure 4. The network of feature points in the previous frame and the network of
candidate points in the current frame.
3.1.4 Motion Dissimilarity Energy
The reliability of a forward motion vector means how it is similar with
true motion enough to get a predicted point as exactly as possible. Following Fu
et al. [6], if the predicted point p
k,n
which has located by the forward motion vec-

41
tor fmv
k-1
returns to its original location in the previous frame by the backward
motion vector mv
k
, fmv
k-1
is highly reliable. Assuming that p
k,n
is included to the
jth block b
k,j
, the reliability R can be given as follows:
( )
( ) ( )
2
1, ,
1
exp
,
2
2
fmv b mv b
k
k i k j
k
R p
k n
o
| |
+
|

=
|
|
|
\ .

(13)
where is the variance of reliability. Figure 5 shows forward motion vectors
with high and low reliability. In a similar way of Fus definition [18], the motion
dissimilarity energy E
M
for the ith candidate point is defined as follows:
( )
( )
( ) ; ,
,
, ,
E k n i R p c i p
k n
M
k n k n
=

(14)
With high reliability R, E
M
has greater effect on finding the best point than
E
C
or E
F
since it is sharply varying according to the distance between a predicted
point and a candidate point.

Figure 5. The reliability of forward motion vectors. The great gap between a forward
motion vector and a backward motion vector results in low reliability.

42
3.1.5 Energy Minimization
The dissimilarity energy E
k,n
(i) for the ith candidate point of the nth fea-
ture point is defined as follows:
( ) ( ) ( ) ( ) ( ) ( ) ( ) ; , ; , ; ,
,
E i k E k n i k E k n i k E k n i
k n C C F F M M
e e e = + +
(15)
where w
C
(k), w
F
(k), and w
M
(k) are weight factors for texture, form, and
motion dissimilarity energy. If the configuration of candidate points is denoted
as I={c
k,1
(i
1
), c
k,2
(i
2
),,c
k,F
(i
F
)}, the optimal configuration I
opt
(k) in the kth frame
is selected as what minimizes the total dissimilarity energy E
k
(I) expressed by
( ) ( )
,
1
F
E I E i
k k n n
n
=
=
(16)
When all possible configurations of candidate points are considered, it
takes so much time ((2M+1)
2F
) that causes high computation complexity espe-
cially in cases of large search region or many feature points. We can reduce the
amount of computations by (F) using the discrete multistage decision process
called the dynamic programming which corresponds to two steps [19]:
A. The accumulated dissimilarity energy (ADE) E
local
(n,i) for the ith can-
didate point of the nth feature point (n>0) is calculated as follows:
( ) ( ) ( ) , min , 1,
,
E n i E i j E n j
local k n local
j
(
= +

(17)

43
The ADE for the first feature point is E
local
(0,i)=E
k,0
(i). Then, the point
which minimizes the ADE is selected among candidate points of the n-
1th feature point; the index of this point is saved as
( ) ( ) ( ) , argmin , 1,
,
s n i E i j E n j
k n local
j
(
= +

(18)
B. In the last feature point, the candidate point with the smallest ADE is
selected as the best point o
F
. Then, the best point o
n
for the nth feature
point is heuristically decided as follows:
( ) argmin , o E F i
F local
i
( =

and
( ) 1,
1
o s n o
n n
= +
+ (19)
The best position for nth feature point f
k,n
is f
k,n
=c
k,n
(o
n
).
3.1.6 Adaptive Weight Factors
The arbitrarily assigned weight factors for texture, form, and motion dis-
similarity energy can give rise to tracking error since the target object can have
various properties. In this reason, weight factors need to be decided adaptively
according to properties of the target object. For instance, for an object which tex-
ture is scarcely changing, the weight factor w
C
should be automatically set up as
high value.
Weight factors can be automatically updated in each frame by using the
neural network as shown in Figure 6. The dissimilarity energy E
k
is transformed

44
to its output value E

k
by the nonlinear activation function . The update of
weight factors is per-formed by the backpropagation algorithm which minimizes
the square output error
k
defined as follows:
( )
2 1
2
E E
d k
k
c =

(20)
where E
d
denotes the ideal output value. If the activation function is the unipo-
lar sigmoidal function ((x)=1/(1+e
-x
)), the gradient of a weight factor is calcu-
lated as
( ) ( ) ( ) ( ) 1 k E E E E E k
d k k
x k x
e q A =

(21)
where x can be T (texture), F (form), or M (motion), and is the learning con-
stant [20].

Figure 6. The neural network for updating weight factors.

45
3.2 Automatic Approach
For the automatic detection and tracking of moving objects in H.264|AVC
bitstream domain, a novel method based on the spatial and temporal macroblock
filter (STMF) is introduced. The STMF exploits macroblock types and IT coeffi-
cients which represent the existence of motion and the temporal texture change
in a macroblock; the encoded information is exploited to extract foreground re-
gions.
As depicted in Figure 7, the method is composed of two stages: the object
extraction and the object refinement. In the object extraction stage, all object re-
gions are roughly extracted by the STMF based on the occurrence probability of
the objects. The STMF first removes blocks which are judged to be background
based on macroblock types and IT coefficients, and then clusters them into sev-
eral fragments called block groups. Since some block groups can also belong to
background, it calculates the occurrence probability of each block group based
on its temporal consistency. Only block groups with high probability are consi-
dered as real objects. In the object refinement stage, the location and size of ob-
ject regions are then precisely refined by background subtraction with partial de-
coding in I-frames and motion interpolation in P-frames.

46

Block Group Extraction
Spatial Filtering
Temporal Filtering
Partial Decoding
Background Subtraction
Motion Interpolation
O
b
j
e
c
t

E
x
t
r
a
c
t
i
o
n

O
b
j
e
c
t

R
e
f
i
n
e
m
e
n
t

Region Prediction
P
-
f
r
a
m
e
s

I
-
f
r
a
m
e

P
-
f
r
a
m
e
s


Figure 7. A procedure of object region extraction and refinement
3.2.1 Block Group Extraction
To detect and track moving objects in surveillance videos encoded by an
H.264|AVC baseline profile encoder, we assume that the surveillance camera is
fixed so that there is no camera motion and I frames were periodically inserted
less than every 10 frames in surveillance video. It is observed that in a fixed
camera, most macroblocks of the background tend to be encoded in the skip
mode in P-frames while most parts of the foreground tend to be encoded in non-
skip modes. From these observations, we may consider sets of non-skip blocks
as the foreground candidates for moving object detection and tracking.

47

B4
F1
B1
B2
B3 B5
B6
B7
B8

Figure 8. Block groups before and after spatial filtering
Figure 8 shows that the approximate foreground in a P-frame consists of a
set of block groups which consists of the blocks with non-skip modes which
are connected in the horizontal, vertical, or diagonal directions. However, such
simple segmentation as block groups is not enough to define moving objects
since there are also the blocks of non-skip modes that may occur in the back-
ground or the blocks of skip mode in the foreground region. For example, some
macroblocks in a homogeneous region of the background are encoded as inter-
coded blocks with motion vectors instead of skip mode blocks. Likewise, in the
case that the visual change of object motion is negligible, the whole or some
parts of the object can be encoded as skip mode blocks. Moreover, one object
region can be separated into one or more block groups which are disconnected

48
one another. Therefore, the block grouping based on the simple classification of
skip mode blocks and non-skip mode blocks is not sufficient to define moving
objects as ROIs. To decide whether each block group represents a real object or
a part of background, we use the spatial and temporal macroblock filter (STMF)
which are performed only in P-frames. The filter consists of two modules: spa-
tial filtering and temporal filtering.
3.2.2 Spatial Filtering
The spatial filtering removes most of block groups in the background by
using IT coefficients. That is, the block groups which contain just one non-skip
macroblock or do not contain non-zero IT coefficients are considered belonging
to the background since these groups tend to occur in the background rather than
the foreground. It means that we regard as a candidate of a real object the block
groups which contain more than one non-skip macroblocks and include non-zero
IT coefficients. Although some block groups of a real foreground object can be
considered the background, this rarely happens in the foreground instead many
more block groups are removed in background. So, we have better chance of
removing a number of such false block groups of background by the spatial fil-
tering.
As shown in Figure 8, nine block groups (indicated as F1, and B1~B8) in
a frame can be detected first. After spatial filtering, two active block groups (F1,
B4) are left while the other block groups are removed. It can be seen that most of

49
the block groups belonging to background consist of only one single macroblock
except B3 and B4. After spatial filtering, B3 is removed due to its all zero IT
coefficient values but B4 remains survived due to its non-zero IT coefficient val-
ues. Each frame after spatial filtering can contain several active block groups. So,
our proposed method can support for multiple object detection and tracking
problems.
3.2.3 Temporal Filtering
The temporal filtering process further removes the block groups in the
background which survive after spatial filtering. The survived block groups after
spatial filtering are called the active block groups. Then, the active block groups
are labeled with their object IDs by object detection and tracking through tem-
poral evolution. Each block group can be classified as a real object or the back-
ground. Especially, the active block groups which are not determined yet wheth-
er they are the real objects or the background are called the candidate objects.
Hence, an active block group can be labeled as a candidate object C , a real ob-
ject R , or background B.
For the classification of active block groups, a newly appeared (or de-
tected) active block group is regarded initially as a candidate object. The candi-
date object is regarded as a real object when it exhibits its temporal coherence
for which high occurrence probability is obtained, during an observation period.
On the other hand, the active block groups in the background tend to randomly

50
appear and disappear in time while those in the foreground tend to move
smoothly and appear during a relative long period of time in subsequent frames. .
If a candidate object occurs more frequently during a given observation
period, its occurrence probability would be more increased. The longer the ob-
servation period is, the more precise the classification is taken. The structure of
temporal filtering is illustrated in Figure 9.
frame
1
T

2
T

3
T

4
T

5
T

6
T

A G =
1
6

2
6
G

i
G
6

+
6
G

Real object
Real object
Real object
Observation
period
3
6
G


Figure 9. Temporal filtering based on the occurrence probability of active group trains
Before applying temporal filtering for an initial active block group A, it
is assigned the active group train
l
T
which is labeled by l and is defined as

51
follows:
{ } + = = = , , 1 ,
1
i A G G T
l
i
l l
(22)
where + indicates the length of an observation period, and
i
l
G
, called the
succeeding active block groups, denotes the set of active groups corresponding
to A, in the ith frame during the observation period as follows:
{ }
i i
l
i
l
C X G X X G c = =

,
1
|
(23)
where
i
C
denotes the set of all active block groups in the ith frame during the
observation period and X is an active block group. In other words,
i
l
G
con-
sists of all overlapped active block groups in the ith frame with
1 i
l
G
. If
| =
i
l
G
, we let
1
=
i
l
i
l
G G
assuming that the corresponding object does not
move or there is no or little change in the intensity of the active block group for
which
3
6
G
in Figure 9 corresponds to such a case.
In this way, we compute
i
l
G
recursively for
+ s si 1
, and then obtain
l
T
(a sequence of
i
l
G
) by accumulating the initial active block group and its
succeeding active block groups through all frames in the observation period.
Thereafter, in the last frame of the observation period, we calculate the oc-
currence probability
l
P
for the active block group train
l
T
which is defined
as follows:

52
( )
+
= =
l l l l l
G G G L P P ,..., , R
2 1
(24)
where
l
L
indicates a type of an active group for
l
T
after the observation pe-
riod. That is,
l
P
describes the probability that all candidate objects which cor-
respond to an active group train
l
T
would be real objects. According to the
Bayes rule, we have:
( )
( )
( )
( )
( )
( )
+
+
=

+
+
+
=
= =
=
=
=
[
l l l
l
i
l l
i
l
i
l
l l l
l l l l
l l l l
G G G P
L P
L G G G P
G G G P
G G G L P
G G G L P
,..., ,
R
R , ,...,
,..., ,
,..., , , R
,..., , R
2 1
1
1 1
2 1
2 1
2 1
(25)
Suppose that the succeeding candidate object
i
l
G
in the current frame depends
on only
1 i
l
G
in the previous frame. Then, we have
( ) ( ) R , R , ,...,
1 1 1
= = =

l
i
l
i
l l l
i
l
i
l
L G G P L G G G P
(26)
From (25) and (26), we have
( )
( )
( )
( )
( )
[
[
+
=

+
+
=

+
=
=
= =
=
1
1
2 1
1
1
2 1
R ,
,..., ,
R
R ,
,..., , R
i
l
i
l
i
l
l l l
l
i
l
i
l
i
l
l l l l
L G G P
G G G P
L P
L G G P
G G G L P
(27)

53
Since
( ) R =
l
L P
and
( )
+
l l l
G G G P ,..., ,
2 1
are the nature of scenes, that is, a
priori probabilities, we only consider the conditional probability in (27). Accor-
dingly, we judge that the active group train
l
T
is a real object if the following
condition is satisfied:
( ) O < =

+
=

1
1
R , ln
i
l
i
l
i
l
L G G P
(28)
where O is the threshold of occurrence with
0 > O
. If Equation (28) does
not hold true, then the active group train
l
T
is removed because it is regarded
as a part of the background. If
| =
i
l
G
,
( ) R ,
1
=

l
i
l
i
l
L G G P
can be calculated
as follows:
( )
( )
( )
1
1
1
R ,

= =
i
l
i
l
i
l
l
i
l
i
l
G n
G G n
L G G P

(29)
where
( )
1 i
l
G n
denotes the number of macroblocks in the region of
1 i
l
G
. If
| =
i
l
G
, we have
( )
( )
+
= =

l c
L G G P
l
i
l
i
l
R ,
1
(30)
which
( ) l c
is the number of frames where the succeeding candidate objects for
the active group train
l
T
are found during the observation period.
Once an active block group train is regarded as motion trajectory of a real
object, the object tracking is performed by searching the candidate objects that

54
are overlapped with the corresponding real object group in the previous frame
throughout the subsequent frames after the observation period. In this case, the
train becomes the real objects one and is extended towards the subsequent
frames. The real object tracking is performed in the same way as done for the
candidate object tracking in Equation (23). If a real object does not have its suc-
ceeding candidate objects in any subsequent frame, it is assumed that the real
object does not move by staying at a location.
When we detect and track multiple objects with active block groups, we
may have train tangling problem in which at least two trains are merged together,
called the train merging, or one train gets separated into more than two individu-
al trains, called the train separation. Train merging occurs under the situation
that one active block group is overlapped with several candidate or real objects
in the previous frame as shown in Figure 10(a). For simplicity in this paper, we
only consider the case of train merging by two active group trains. Figure 10(b)
shows the train separation where an active group train is divided into two active
groups.




(a) (b)
1
l
T

2
l
T

l
T
A
1
A

2
A


Figure 10. Train tangling. (a) Train merging. (b) Train separation.

55
When two active group trains,
1
l
T
and
2
l
T
are overlapped with a single
active block group and their corresponding objects are labeled with candidate
objects
C
1
=
l
L
and
C
2
=
l
L
, one of two trains is removed. In the case of
having one real object and one candidate object, the candidate object train is re-
moved. When both trains are real objects, then the overlapped active block group
is split into two active block groups, both (
1
l
T
and
2
l
T
) of which correspond to
the real object. That is, if both active group trains are for real objects, two ob-
jects are not merged which means that two overlapped real objects are consi-
dered to move independently.
On the other hand, train separation occurs under the situation that several
active block groups are overlapped with one candidate or one real object in the
previous frame as shown in Figure 10(b). If the active group train (
l
T
) in Figure
10(b) in the previous frame were a candidate object, the two active block groups
(
1
A
and
2
A
) overlapped with
l
T
are merged into one candidate object. In
case of the active group train (
l
T
) being a real object, the two active block
groups are considered independent objects for which one is regarded as the real
object corresponding to
l
T
and the other as a new candidate object.
3.2.4 Region Prediction of Moving Objects in I-frames
Finally, the location and size of the real object is determined by a rectangle
that encompasses the exterior of the active block group. We define the feature
vector
l i
f
,

of a real object that corresponds to the train


l
T
in the ith the frame

56
as follows:
( )
l i l i l i l i
w h p f
, , , ,
, ,

=
(31)
where
( )
l i l i l i
y x p
, , ,
, =

denotes the location of


l i
f
,

and
( )
l i l i
w h
, ,
,
is the
size of the object with its height and width.
In practice, there are somewhat discrepancy in object size between the rec-
tangle box and the object. Therefore, the object region defined by the rectangle
box must be refined every frame during object detection and tracking. For this,
we employ background subtraction and motion interpolation as shown in Figure
11-12. That is, we periodically update the size and location of a real object every
GOP by background subtraction. The background subtraction is performed on
every I-frame by comparing it with the background and is followed by the re-
finement process for the real object region in the I-frame. Then the motion inter-
polation is performed over P-frames between the current I-frame and its previous
I-frame.
Since each I-frame does not contain macroblock partition types and tem-
poral prediction residuals, their object regions need to be estimated by projecting
the real object regions in the previous P-picture onto the I-frame. The projection
of a real object in a P-frame onto the next I-frame is made as follows:
|
.
|

\
|
' ' ' = '

=

=
l k i
N k
l k i
N k
l i l i
w h p f
,
1 ,..., 1
,
1 ,..., 1
, 1 ,
max , max ,

(32)

57

(a) (b)

(c) (d)
l
D
l
S

Figure 11. Optimizing the feature vector of an object through background subtraction in
an I frame. (a) The background Image. (b) The I frame in the original sequence. (c) A
partially decoded image from H.264|AVC bitstream. (d) A background-subtracted
image.
where
l i
f
,
'

denotes the predicted object feature vector of


l i
f
,

, and N de-
notes the length of one GOP. The predicted location
l i
p
,
'

is the same as the lo-


cation
l i
p
, 1
'

in the previous P frame. Likewise, the predicted height and width


are determined by the respective maximums of heights and widths in P-frames
between two consecutive I-frames (GOP), which may increase the possibility of
encompassing the entire region of the real object. Then, the estimated region by

58
the maximum height and width is partially decoded. The partial decoded regions
in Figure 11(c) are subtracted from the initial background image in Figure 11(a).
After subtraction, the final real object region is determined by the rectangle that
most tightly encompasses the real object as shown in Figure 11(d).
3.2.5 Partial Decoding and Background Subtraction in I-frames
In the H.264|AVC baseline profile, I-frame decoding can be performed in
either 16x16 macroblock or 4x4 sub-macroblock unit. To be more specific, each
unit block refers its neighbor block pixels for spatial prediction. In order to par-
tially decode a certain block in an I-frame, its neighbor blocks need to be de-
coded a priori for spatial prediction. In the worst case with the most bottom-
right block for partial decoding, a lot of blocks leftward and upward must be de-
coded a priori, which increases computational complexity in partial decoding. In
order to avoid this problem, we substitute the reference pixels in the neighbor
blocks with the pixels obtained by the initial background without actual decod-
ing. In this case, perfect reconstruction is not possible, which then causes imper-
fect reconstructions of the blocks in each MB. However, we observe that this
approach is reasonable for the surveillance environment with fixed camera con-
dition and not significant illumination change. The imperfect reconstruction
problem can further be alleviated by comparing with a preset threshold the dif-
ference between partial decoding and the initial background as indicated in Equ-
ation (33).

59
( ) ( ) { }
l B l l
D x x p x p x S e A > =

,
(33)
where
l
S
is the region of the real object found by comparing with a predefined
threshold A the difference between the pixels of the partially decoded region
and the initial background pixels.
( ) x p
l

and
( ) x p
B

are the pixels belonging


to the partially decoded region and the initial background, respectively. A is
used to judge the foreground and the background, and
l
D
denotes partially
decoded area . Then, the size
( )
l i l i
w h
, ,
,
is refined as the width and height of the
rectangle box which most tightly encloses
l
S
, and then the location
( )
l i l i l i
y x p
, , ,
, =

is set to its center point of the rectangle box.


3.2.6 Motion Interpolation in P-frames
Object tracking is performed by projecting the active block groups in the
current P-frame onto the previous P-frame. In this case, it is observed that the
sizes and locations of the real objects significantly vary over P-frames. In Figure
12, the sizes and locations of the regions (
t t t
R R R
4 3 2
, ,
) for a real object being
tracked are changing over three P-frames. If we assume that an object moves
slow enough with uniform motion between two successive I-frames, then linear
interpolation can be made for the object feature vector (sizes and locations) so
that the interpolated regions (shaded rectangles in P-frames in Figure 12) be-
come the final object regions in P-frames.

60

I

P

P

P
I
t
2
R

t
3
R

t
4
R


Figure 12. Motion interpolation. The dotted rectangle boxes are estimated simply by
enclosing active groups corresponding to the real object. These boxes are replaced by
the rectangle boxes through motion interpolation.
Therefore, the interpolation for the object feature vector in a P-frame can be
computed as follows:
( )
l i l N i l i l k i
f f
N
k
f f
, , , ,

+ =
. (34)
where
N
is the length of a GOP and
( ) N k k < < 0
is the index for P-frames.
It is noticed that as the length of one GOP gets longer, the updated feature vec-
tors are less reliable because the linearity assumption for uniform motion no
longer holds true.

61
IV Experiments
4.1 Semi-automatic Approach
To demonstrate the performance of the proposed semi-automatic method,
the tracking results of various objects were extracted from videos such as Ste-
fan, Coastguard and Lovers with CIF size. Each video was encoded as the
GOP structure of IPPP in the baseline profile, and included P-frames whose
previous frame only can be a reference frame. Figure 13 shows the tracking re-
sults of a rigid object with slow motion in Coastguard. Four feature points
were well tracked in the uniform form of feature network. Figure 14 also shows
the tracking result of a deformable object with fast motion in Stefan. We can
observe that tracking is successful even though the form of feature network is
greatly changing due to fast three-dimensional motion.
Figure 15 represents the visual results of partial decoding in P-frames of
Lovers when the search half interval M and the neighborhood half interval W
are assigned as 5 and 3. Only the neighborhood region of three feature points
was partially decoded. Even in a sequence Lovers with 300 frames, no track-
ing errors were found.

62




Figure 13. The object tracking in Coastguard with 100 frames.

63




Figure 14. The object tracking in Stefan with 100 frames.

64




Figure 15. The object tracking in Lovers with 300 frames. Partially decoded regions
are shown in Lovers.

65
Numerical data of tracking from two video samples is shown at Figure 16-
19. In Figure 17(a) and (b), dissimilarity energies in Coastguard are lower than
those in Stefan. We can see from this result that the variation of texture, form,
and motion in Coastguard is smaller than Stefan. Figure 19(a) and (b) shows
that forward motion vectors in Stefan is less reliable than those in Coast-
guard due to Stefans complex motion. The average percentage of reliabilities
in Coastguard is 93.9% higher than 81.7% in Stefan; it indicates that the
forward motion vector field in Stefan is less reliable than that in Coastguard.
As a matter of fact, the Stefan sequence contains high-textured background
(e.g. many spectators) as well as a fast moving deformable object (e.g. the tennis
player), which causes false and chaotic motion vectors during motion estimation
process of the encoder. Even in such an intricate sequence as Stefan, the track-
ing performance is satisfactory as shown in Figure 14 since three traits like tex-
ture, form, and motion of the target object are jointly considered.
Through the neural network, the square error of dissimilarity energy is mi-
nimized over a few frames as shown in Figure 18(b). When the learning constant
was equal to 5, this error had approximately zero value after the 15
th
frame.
Moreover, weight factors converge on optimal values as shown at Figure 18(a).
We can observe that weight factor variations and dissimilarity energies increase
greatly from the 61
th
frame to the 66
th
frame in Coastguard; it illustrates that
weight factors are adaptively controlled when another ship is approaching.
When the JM10.2 reference software was used to read H.264/AVC bit-

66
stream, the processing time which includes partial decoding process in Coast-
guard is shown at Figure 16(a); especially, it is observed that the processing
time abruptly increases every I-frame due to full decoding in I-frames. The aver-
age processing time was about 58.9ms per frame (17frames per second) on the
hardware environment of Intel Pentium 4 CPU 3.2GHz and 1GB RAM. Howev-
er, since most of the time (about 45ms per frame) originates in the partial decod-
ing process in JM10.2, the processing time can be effectively reduced by using a
decoder with faster decoding process than JM10.2. As illustrated in Figure 16(b),
in the case that the partial decoding time which is dependent on the capability of
decoder is not considered, the processing time is 14.2ms per frame (70.3 frames
per second). Thus, the proposed algorithm can be well performed in real-time
applications.
It should be noticed that the computation time is abruptly variant accord-
ing to the size of search range. For example, when the search half interval M is
doubled (M=10) with margin 10, the processing time increases nearly seven
times (430ms per frame) as shown in [16].
Conclusively, the experimental results demonstrate that the proposed
semi-automatic method guarantees reliable performance even in such scenes that
include complicated background or deformable objects. Moreover, its processing
time is kept to be remarkably fast so that the algorithm can be built into the ap-
plications, which are required to work fast, such as metadata authoring tools.

67

(a)

(b)

Figure 16. (a) The processing time which includes partial decoding in Coastgurad,
and (b) the processing time which does not include partial decoding.

68

(a)

(b)

Figure 17. (a) Dissimilarity energies in Stefan and (b) Coastguard

69

(a)

(b)
Frame
Frame

Figure 18. (a) The variation of weight factors in Coastguard, and (b) the squared error
of dissimilarity energy in Stefan.

70

(a)

(b)

Figure 19. (a) The average reliabilities of forward motion vectors in Coastguard and
(b) in Stefan.

71
4.2 Automatic Approach
To test the proposed automatic method for object detection and tracking in
H.264|AVC bitstream domain, we used two sequences which were taken by one
fixed camera in indoor and outdoor environments. While only one person walk-
ing in a corridor of a university building appears in the indoor sequence, three
persons entering individually into the screenshot appear in the outdoor sequence
without visual occlusion between persons. In each sequence, there is no signifi-
cant illumination change of the background. Each sequence with 320x240 sizes
was encoded at 30 frames per second by the JM 12.4 reference software with the
GOP structure of IPPIP based on the H.264|AVC baseline profile. Espe-
cially, P-frames were set to have no intra-coded macroblocks. Also, the length of
observation period for temporal filtering was set to 8 frames.
To evaluate the performance of spatial filtering, we consider the spatial
filtering rate as the ratio of the number of filtered block groups to the total num-
ber of block groups. It represents how many block groups are filtered in each
frame. Figure 20(a) and 21(a) shows the spatial filtering rates in each frame by
calculating the averages of these rates in 60 frames. It is observed that the aver-
age of spatial filtering rates in the indoor sequence is 70.8% and that is 64.2% in
the outdoor sequence, which means that most of block groups are removed by
the spatial filtering process.

72
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700

0
0.5
1
1.5
2
2.5
3
0 100 200 300 400 500 600 700

(a)
(b)
merged active trains
frames
frames
Spatial filtering rates
A
c
t
i
v
e

g
r
o
u
p

t
r
a
i
n
s

real object

Figure 20. The performance measurement of spatial filtering and temporal filtering in
the indoor sequence. (a) The plot of spatial filtering rates, and (b) The temporal filtering
results in which one active group train become the real object.

73

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900
0
0.5
1
1.5
2
2.5
3
3.5
0 100 200 300 400 500 600 700 800 900
(a)
(b
)
merged active trains
frames
frames
Spatial filtering rates
A
c
t
i
v
e

g
r
o
u
p

t
r
a
i
n
s

r
e
a
l

o
b
j
e
c
t
s


Figure 21. The performance measurement of spatial filtering and temporal filtering in
the outdoor sequence. (a) The plot of spatial filtering rates, and (b) The temporal
filtering results in which three active group trains become the real objects.


74
Figure 20(b) and 21(b) illustrates the result of temporal filtering in the in-
door and outdoor sequences. Among several active group trains which survive
after spatial filtering, one active group trains are decided as real objects in the
indoor sequence whereas three active group trains are decided as real objects in
the outdoor sequence. These results are exactly coincident with real situations in
two sequences. It should be noticed that all active group trains which are not real
objects are not always removed by temporal filtering but sometimes they get
merged into their neighboring real object. As a result, 96% of all active group
trains are removed both in two sequences.
Then, to obtain more precise object regions, we have used background
subtraction and motion interpolation as explained before. Figure 11 shows three
steps of background subtraction in I frames: partial decoding, foreground extrac-
tion, and optimization of object location and size. Especially as shown in Figure
25, partial decoding in I-frames is significantly faster than full decoding. When
we partially decoded only the object regions in the indoor sequence, the frame
rate was 49.5 frames per second while the frame rate was 20.46 frames per
second in full decoding mode. That is, it is sure that the computational complexi-
ty is greatly enhanced in the partial decoding mode rather than in the full decod-
ing mode.
It can be observed in Figure 22(a) and (b) that before the ROI refinement,
some parts of the real object are not included in the rectangle boxes. After back-
ground subtraction in I frames and motion interpolation in P-frames, the refined

75
ROIs in the rectangle boxes enclose the whole part of the real object as shown
in Figure 22(c) and (d).

(a) (b)

(c) (d)

Figure 22. The effect of motion interpolation on correction of object trajectory. (a)-(b)
are object locations and sizes in one GOP before motion interpolation, and (c)-(d) after
motion interpolation.
Figure 23 and 24 show the object detection and tracking results for the in-
door sequence with one single moving object and the outdoor sequence with
three moving objects, respectively. The proposed method of tracking moving

76
objects in H.264|AVC bitstream domains exhibits a satisfactory performance
over 720 and 990 frames of the indoor and outdoor sequences, respectively.
It can be noticed in Figure 23 that when the object moving toward the
camera-looking direction, the detection and tracking performance is kept good
even though the object gets scaled on the move. Moreover, although the parts
(head, arms, and legs) of the object have different motion, the rectangle box al-
ways encloses the whole body precisely. Likewise, even in the outdoor sequence
which contains multiple objects (persons), the proposed method of object detec-
tion and tracking works very well as shown in Figure 24. Although they move in
different directions, the proposed algorithm does not fail to detect and track the
three persons separately.
Computation of object detection and tracking as well as the refinement of
the resulting ROIs involves three processes: (1) partial decoding in I frames; (2)
the extraction of MB types and IT coefficients in P frames, and (3) object detec-
tion and tracking. It does not include the loading time of H.264|AVC bitstream
since it mostly depends on the performance of the used decoder. The processing
times taken for two sequences are shown in Figure 25. The processing times
were taken 2.02 milliseconds per frame (49.5 frames per second) in the indoor
sequence and 2.69 milliseconds per frame (37.12 frames per second) in the out-
door sequence on a PC with Pentium 4 CPU of 3.2 GHz and RAM of 1G Bytes.
The proposed algorithm is remarkably fast enough to be applied for real-time
surveillance systems.

77
In case of full decoding of I- and P-frames using the proposed method, the
processing times were taken 4.89 milliseconds per frame (20.46 frames per
second) in the indoor sequence and 5.22 milliseconds per frame (19.17 frames
per second) in the outdoor sequence. The partial decoding approach is twice
faster than the full decoding. The comparison of partial decoding with full de-
coding is summarized in Table 2 in terms of processing time.
Table 2. The processing time of the proposed automatic method.
Partial decoding Full decoding
Indoor sequence 49.50 frames/sec 20.46 frames/sec
Outdoor sequence 37.12 frames/sec 19.17 frames/sec

Notice that the traditional pixel domain approaches to object detection and
tracking often take extremely large amounts of time. On the other hand, it is
demonstrated that the proposed automatic algorithm can be performed in real-
time with fast processing time as shown in Table 2. It is valuable to compare the
processing time and performance of the proposed algorithm with other recent
technologies as listed in Table 1. The proposed algorithm seems to have higher
computational complexity than these traditional algorithms. However, its per-
formance is more reliable and powerful even in various situations which these
algorithms cannot handle successfully as discussed in the chapter 2.3.4.

78

(a) (d)

(b) (e)

(c) (f)


79

(g) (j)

(h) (k)

(i) (l)

Figure 23. The performance measurement of spatial filtering and temporal filtering. (a)
The plot of spatial filtering rates in the indoor sequence, (b) The temporal filtering
results in the indoor sequence one active group train become the real object.

80

(a) (d)

(b) (e)

(c) (f)


81

(g) (j)

(h) (k)

(i) (l)

Figure 24. The performance measurement of spatial filtering and temporal filtering. (a)
The plot of spatial filtering rates in the outdoor sequence, (b) The temporal filtering
results in the outdoor sequence three active group trains become the real objects.

82
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
1 11 21 31 41 51 61 71 81 91
Partial Decoding
Full Decoding

(a)
sec
frames
I-frames

I-frames


0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
1 11 21 31 41 51 61 71 81 91
Partial Decoding
Full Decoding

sec
frames
I-frames

(b)

Figure 25. The measurement of computational complexity. The processing time taken
for (a) the indoor sequence, and (b) the outdoor sequence.

83
V Conclusions and Future Works
Recently, moving object detection and tracking techniques have become a
necessary component for intelligent visual systems like surveillance or interac-
tive broadcasting. In this thesis, two methods for moving object detection and
tracking with partial decoding in H.264|AVC bitstream domain are proposed; the
semi-automatic method and the automatic method. The semi-automatic method
exploits the dissimilarity minimization algorithm which tracks feature points
adaptively according to their properties like texture, form, and motion. The au-
tomatic method, which is available for surveillance with fixed cameras, exploits
several techniques such as spatial and temporal macroblock filtering, back-
ground subtraction, and motion interpolation. Especially, it can detect and track
multiple objects at the same time. While the former method utilizes only motion
vectors, the latter makes use of integer transform (IT) coefficients and macrob-
lock types to detect and track multiple moving objects. Unlike the traditional
compressed domain algorithms, the proposed methods reflect the extraordinary
features of the encoded information in H.264|AVC bitstreams.
The main contribution of the proposed methods, above all, is that they
have not only low computational complexity enough to be performed in real-
time, but also have excellent performance even in more manifold situations than
those which have been considered at the traditional algorithms. To be specific,
the proposed semi-automatic method successfully tracks a predefined target ob-

84
ject which is deformable or moves fast in complicatedly textured background.
On the other hand, the proposed automatic method is able to detect and track
moving objects which move in the same or opposite direction as that of camera
looking toward.
It should be noticed that the proposed methods include the partial decod-
ing process to get detailed texture information from H.264|AVC bitstreams. Un-
like the traditional compressed domain algorithms, the partial decoding process
makes color extraction or object recognition possible. In the case of MPEG-7
metadata authoring tools, the partially decoded color information of each object
can be converted as a form of MPEG-7 metadata for interactive broadcasting
services. Likewise, the color information can be utilized to distinguish one ob-
ject from other objects by using object recognition techniques in pixel domain.
Consequently, the proposed methods make it possible to combine the com-
pressed domain techniques with the pixel domain techniques for powerful per-
formance and extended functions.
In the future works, the proposed methods need to be extended to other
profiles like Main profile as well as Baseline profile of H.264|AVC standard; in
other words, B-frames also can be handled for object detection and tracking with
I-frames and P-frames. Also, the novel compressed domain techniques which
deal with specific situations like illumination change, occlusion, and silhouette
need to be developed for better performance. In addition, the proposed methods
can evolve into an object segmentation technique based on partial decoding in

85
compressed domain; that is, partially decoded data can be used to obtain the ex-
act boundaries of objects in the pixel unit. Lastly, the proposed methods is bene-
ficial to the large-scale distributed surveillance systems which the main server is
designed to process several compressed videos which get together from multiple
cameras at the same time. Such a system with multiple cameras needs color in-
formation of objects in order to distinguish an object from other objects and to
track the target object over the network of broadly distributed cameras. Since the
proposed methods can extract color information of objects, they are applicable to
such a system unlike the traditional compressed domain algorithms.

86



H.264|AVC







.
,
.
,

.
, MPEG
,
. ,

,
.
, H.264|AVC

.

87
, ,
.
, ,
,
.


. ,

,
. ,
.

.

.

, ,
, .



88
References
[1] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra, Overview of the
H.264|AVC Video Coding Standard, IEEE Trans. Circuits Syst. Video Technol.,
vol. 13, No. 7, pp. 560576, July 2003.
[2] A. Aggarwal, S. Biswas, S. Singh, S. Sural, and A.K. Majumdar, Object Tracking
Using Background Subtraction and Motion Estimation in MPEG Videos, ACCV
2006, LNCS, vol. 3852, pp. 121-130, Springer, Heidelberg (2006).
[3] S. Ji and H. W. Park, Moving object segmentation in DCT-based compressed vid-
eo, Electronic Letters, Vol. 36, No. 21, October 2000.
[4] X. -D. Yu, L.-Y. Duan, and Q. Tian, Robust moving video object segmentation in
the mpeg compressed domain, in Proc. IEEE Int. Conf. Image Processing, 2003,
vol. 3, pp.933-936.
[5] W. Zeng, W. Gao, and D. Zhao, Automatic moving object extraction in MPEG
video, in Proc. IEEE Int. Symp. Circuits Syst., 2003, vol. 2, pp.524-527.
[6] A. Benzougar, P. Bouthemy, and R. Fablet, MRF-based moving object detection
from MPEG coded video, in Proc. IEEE Int. Conf. Image Processing, 2001, vol.
3, pp.402-405.
[7] R. Wang, H.-J. Zhung, Y.-Q. Zhang, A confidence measure based moving object
extraction system built for compressed domain, in Proc. IEEE Int. Symp. Circuits
Syst., 2000, vol. 5, pp.21-24.
[8] O. Sukmarg and K. R. Rao, Fast object detection and segmentation in MPEG
compressed domain, in Proc. TENCON 2000, vol. 3, pp.364-368.
[9] H.-L. Eng and K.-K. Ma, Spatiotemporal segmentation of moving video objects
over MPEG compressed domain, in Proc. IEEE Int. Conf. Multimedia and Expo,
2000, vol. 3, pp.1531-1534.

89
[10] M. L. Jamrozik and M. H. Hayes, A compressed domain video object segmenta-
tion system, in Proc. IEEE Int. Conf. Image Processing, 2002, vol. 1, pp.113-116.
[11] H. Zen, T. Hasegawa, and S. Ozawa, Moving object detection from MPEG coded
picture, in Proc. IEEE Int. Conf. Image Processing, 1999, vol. 4, pp.25-29
[12] W. Zeng, J. Du, W. Gao, and Q. Huang, Robust moving object segmentation on
H.264|AVC compressed video using the block-based MRF model, Real-Time Im-
aging, vol. 11(4), 2005, pp.290-299.
[13] R. V. Babu, K. R. Ramakrishnan, and S. H. Srinivasan, Video object segmenta-
tion: A compressed domain approach, IEEE Trans. Circuits Syst. Video Technol.,
vol. 14, No. 4, pp. 462474, April 2004.
[14] V. Thilak and C. D. Creusere, Tracking of extended size targets in H.264 com-
pressed video using the probabilistic data association filter, EUSIPCO 2004,
pp.281-284.
[15] S. Treetasanatavorn, U. Rauschenbach, J. Heuer, and A. Kaup, Bayesian method
for motion segmentation and tracking in compressed videos, DAGM 2005,
LNCS, vol. 3663, pp.277-284, Springer, Heldelberg (2005).
[16] W. You, M.S. H. Sabirin, and M. Kim, "Moving Object Tracking in H.264/AVC
bitstream," MCAM 2007, LNCS, vol. 4577, pp.483-492.
[17] Fatih Porikli and Huifang Sun, Compressed domain video object segmentation,
Technical Report TR2005-040 of Mitsubishi Electric Research Lab, 2005.
[18] Y. Fu, T. Erdem, and A. M. Tekalp, "Tracking visible boundary of objects using
occlusion adaptive motion snake," IEEE Trans. Image Processing, vol. 9, pp.
2051-2060, Dec. 2000.
[19] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algo-
rithms. Cambridge, MA: MIT Press, 2001.
[20] R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification. New York: John Wiley
& Sons, 2001.

90
[21] B. Yeo, B. Liu, "Rapid scene analysis on compressed video," IEEE Trans. Circuits
Syst. Video Technol., vol. 5, No. 6, pp. 533-544, December 1995.
[22] V. Mezaris, I. Kompatsiaris, E. Kokkinou, and M.G. Strintzis, "Real-time
compressed-domain spatiotemporal video segmentation," in Proc. CBMI03, Sep-
tember 2003, pp.373-380.
[23] S. Treetasanatavorn, U. Rauschenbach, J. Heuer, and A. Kaup, Stochastic motion
coherency analysis for motion vector field segmentation on compressed video se-
quences, in Proc. WIAMIS, April 2005.
[24] S. Treetasanatavorn, U. Rauschenbach, J. Heuer, and A. Kaup, Model based seg-
mentation of motion fields in compressed video sequences using partition projec-
tion and relaxation, in Proc. VCIP, July 2005, pp.111-120.
[25] V. Thilak and C.D. Creusere, Tracking of extended size targets in H.264 com-
pressed video using the probabilistic data association filter, in Proc. EUSIPCO-
2004, September 2004.
[26] H. Chen, Y. Zhan and F. Qi, Rapid object tracking on compressed video, in Proc.
2nd IEEE Pacific Rim Conference on Multimedia, pp.1066-1071, October 2001.
[27] M.J. Swain and D.H. Ballard, Color indexing, International Journal of Comput-
er Vision 7, pp.11-32, 1991.
[28] M. Vezhnevets, Face and facial feature tracking for natural Human-Computer In-
terface, in International Conference on Computer Graphics between Europe and
Asia (GraphiCon-2002), pp.86-90, September 2002.
[29] K. Schwerdt and J.L. Crowley, Robust face tracking using color, in Internation-
al Conference on Automatic Face and Gesture Recognition (AFGR2000), pp.90-
95, March 2000.
[30] G. Finlayson, S. Hordley, and P. Hubel, Color by correction: A simple, unifying
framework for colour constancy, IEEE Trans. Pattern Anal. Mach. Intell. 23,
pp.1209-1221, 2001.

91
[31] B.D. Zarit, B.J. Super, and F.K.H. Quek, Comparison of five color models in skin
pixel classification, in ICCV99 International Workshop on Recognition, Analysis,
and Tracking of Faces and Gestures in Real-Time Systems (RATFG-RTS99),
pp.58-63, September 1999.
[32] C. Terrillon, M. David, and S. Akamatsu, Automatic detection of human faces in
natural scene images by use of a skin color model and invariant moments, in
Third IEEE International Conference on Automatic Face and Gesture Recognition
(AFGR98), pp.112-117, April 1998.
[33] W. Lu, J. Yang, and A. Waibel, Skin-color modeling and adaptation, in Third
Asian Conference on Computer Vision (ACCV98), vol. 2, pp.687-694, January
1998.
[34] P. Perez, C. Hue, J. Vermaak, and M. Gangnet, Color-based probabilistic track-
ing, in European Conference on Computer Vision (ECCV2002), vol. 1, pp.661-
675, May-June 2002.
[35] G.R. Bradski, Real time face and object tracking as a component of a perceptual
user interface, in Workshop on Applications of Computer Vision (WACV98),
pp.214-219, October 1998.
[36] I. Haritaoglu, D. Harwood, and L.S. Davis, W4: Real-time surveillance of people
and their activities, IEEE Trans. Pattern Anal. Mach. Intell. 22, pp. 809-830,
2000.
[37] M. Kass, M. Witkin, and A. Terzopoulos, Snakes: Active contour models, Inter-
national Journal of Computer Vision 1, pp.321-331, 1988.
[38] H. Wang, J. Leng, and Z.M. Guo, Adaptive dynamic contour for real-time object
tracking, in Image and Vision Computing New Zealand (IVCNZ2002), December
2002.
[39] N. Xu and N. Ahuja, Object contour tracking using graph cuts based active con-
tours, in Proc. ICIP2002, vol. 3, pp.277-280, September 2002.

92
[40] N. Xu, R. Bansal, and N. Ahuja, Object segmentation using graph cuts based ac-
tive contours, in Proc. CVPR2003, vol. 2, pp.46-53, June 2003.
[41] A. Nikolaidis and I. Pitas, Probabilistic multiple face detection and tracking using
entropy measures, Pattern Recognition 33, pp.1783-1791, 2000.
[42] H. Chao, Y.F. Zheng, and S.C. Ahalt, Object tracking using the Gabor wavelet
transform and the golden section algorithm, IEEE Transactions on Multimedia 4,
pp.528-538, 2002.
[43] C. Tomasi and T. Kanade, Detection and tracking of point features, Technical
Report CMU-CS-91-132, School of Computer Science, Carnegie Mellon Univer-
sity, Pittsburgh, 1991.
[44] A. Shokurov, A. Khropov, and D. Ivanov, Feature tracking in images and video,
in International Conference on Computer Graphics between Europe and Asia
(GraphiCon-2003), pp.177-179, September 2003.
[45] P. Beardsley, P.H.S. Torr, and A. Zisserman, 3D model acquisition from extended
image sequences, in European Conference on Computer Vision (ECCV96), vol. 2,
pp.683-695, April 1996.
[46] M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive Neu-
roscience 3, pp.71-96, 1991.
[47] H.T. Nguyen and A.W.M. Smeulders, Template tracking using color invariant
pixel features, in Proc. ICIP2000, vol. 1, pp.569-572, September 2000.
[48] H.T. Nguyen and A.W.M. Smeulders, Fast occluded object tracking by a robust
appearance filter, IEEE Trans. Pattern Anal. Mach. Intell. 26, pp.1099-1104,
2004.
[49] L.V. Tsap, D.B. Goldgof, and S. Sarkar, Fusion of physically-based registration
and deformation modeling for nonrigid motion analysis, IEEE Trans. Image
Process. 10, pp.1659-1669, 2001.
[50] Y. Wang and S. Zhu, Analysis and synthesis of textured motion, particles and

93
waves, IEEE Trans. Pat. Anal. Mach. Intell. 26, pp.1348-1363, 2004.
[51] T. Schoepflin, V. Chalana, D.R. Haynor, and Y. Kim, Video object tracking with a
sequential hierarchy of template deformations, IEEE Trans. Circuits Syst. Video
Technol. 11, pp.1171-1182, 2001.