You are on page 1of 10

Image Analysis Towards Very Low Bitrate Video Coding

Diogo Cortez, Paulo Nunes, Manuel Menezes de Sequeira, Fernando Pereira


INSTITUTO SUPERIOR TÉCNICO - Secção de Telecomunicações
Av. Rovisco Pais, 1096 Lisboa Codex - Portugal, email: ddcf@tele1.ist.utl.pt

Abstract

Very low bitrate video coding became in the last years one of the most important areas of im-
age communication due to the identification of several very low bitrate applications such as mobile
videotelephony, multimedia mail, electronic newspapers, entertainment, traffic control, and interac-
tive data bases.
Since conventional video coding techniques are reaching a saturation point, a new generation
of techniques, aiming at a deeper “understanding” of the image, is being studied. In this context,
image analysis, particularly the identification of regions or objects in images (segmentation), is an
important step in very low bitrate video coding, since it will lead to a better representation of images
and consequently to an improvement of the encoded image quality (for a fixed bitrate).
This paper describes a segmentation algorithm based on split & merge and shortest spanning
trees (SST). The image is first split according to a quad tree structure and then the resulting regions
are merged, using the SST concept, in three steps: merge, elimination of small regions and control
of the number of regions.
Results are presented for video sequences from studio and mobile videotelephony, which show the
good properties of the algorithm for application in very low bitrate video coding schemes.

Keywords: object-oriented coding, image segmentation, region growing, split & merge, shortest span-
ning trees.

1 Introduction

The recent identification of a wide variety of very low bitrate video coding applications (such as mobile
videotelephony, multimedia mail, electronic newspapers, entertainment, traffic control, and interactive
data bases) made this area one of the most important of image communication. This fact led to the
creation of study groups and consortia, such as ISO/MPEG4, RACE MAVT, and COST 211, to study
and standardize very low bitrate video coding algorithms.

The very low bit rate video coding challenge must be addressed considering two different approaches in
terms of scheduling, concepts and techniques involved:

1. Conventional approach – Improvement of conventional well-known techniques may soon lead to


acceptable quality, enabling the long expected PSTN video coding standard. This approach will
further explore the conventional pixel based techniques in two different ways:
Hybrid – Based on temporal prediction and transform coding, usually DCT. Compatible and non-
compatible extensions of the CCITT H.261 standard for very low bitrates can be considered.
Non-hybrid – Involves pixel-based well-known techniques, such as subband coding or vector
quantization.
2. Non-conventional approach – Since pixel based coding techniques are reaching a saturation
point, in terms of coding efficiency, it is urgent to initiate the study of a new generation of techniques
and concepts oriented to a higher image structural level, such as the object or the region. This will
be a long term approach, since it will take some time before the image coding community discovers
and familiarizes itself with these new concepts. The relevance of this approach is recognized by

1
the most important working groups in the world, which foresee a period of 3 to 5 years before the
first consistent results are obtained.

In the context of non-conventional approaches, the promising object oriented video coding strategy
considers the segmentation of images into a set of objects according to a given model (e.g. 2D or 3D
rigid or flexible objects).

Object oriented algorithms have two main blocks: analysis and synthesis. The first block analyses the
images, identifying individual objects and estimating their characteristic parameters (e.g. colour, shape,
and motion) which can then be encoded. The second block reconstructs the images from the given
(decoded) parameters, thus being an important part of the video decoder. The synthesis block is used
also in the video encoder, since temporal redundancy is usually dealt with by encoding the current image
relative to the previous decoded one. A good example of such an algorithm can be found in [8].

Image segmentation is of paramount importance in object oriented video coding. This paper presents
an image segmentation algorithm based on split & merge techniques and shortest spanning trees (SST).
The image is first split according to a quad tree structure and then the resulting regions are merged,
using the SST concept, in three steps: merge, elimination of small regions and control of the number
of regions. The split step generates an over-segmented image, but nevertheless allows a reduction in
the computational effort of the algorithm, when compared to a solution without split, i.e. where each
pixel is initially considered as an individual region. The merge step intends to merge the most similar
adjacent regions resulting from the split step, removing the false boundaries introduced by the quad tree
structure used in the split step. The next step eliminates the large number of small irrelevant regions
resulting from the merge step. These small regions, if not eliminated, lead frequently to an erroneous
final segmentation, since they have a large contrast relative to their surroundings. Small regions are thus
eliminated by merging them to their most similar neighbours. The last step is similar to the merge step,
the stopping condition being however the final number of regions. Since this step is done in a progressive
way, in terms of the decreasing number of regions, it can be seen as producing a segmentation hierarchy
with a decreasing level of detail (increasing scale).

The behaviour of the proposed algorithm and its potential for application on object based video coding
can be seen in the results presented in section 4.

2 Segmentation basics

The identification of regions or objects within an image, i.e. image segmentation, is one of the most
important steps in object-oriented image coding. Some authors [4] define image segmentation as the
process of decomposing an image into parts which are meaningful with respect to a particular application.
This interpretation usually leads to a decomposition into a set of connected and homogeneous regions
(or into a set of connected regions which are dissimilar to one another).

If I is the set of all pixels in the image, Ri (region i) is a subset of adjacent pixels (according to a
connectivity criterion), and H(·) is the homogeneity condition, then the set of N regions Ri (i = 1, · · · , N )
is a segmentation of image I if and only if:
S
i Ri = I, i = 1, · · · , N
Ri ∩ Rj = Ø, ∀i 6= j
(1)
H(Ri ) = True, i = 1, · · · , N
H(Ri ∪ Rj ) = False, ∀i 6= j with Ri adjacent to Rj

Considering a similarity criterion instead of a homogeneity criterion, with S(·, ·) as the similarity condi-
tion, the two last conditions of (1) change into:

S(Ri , Rj ) = False, ∀i 6= j with Ri adjacent to Rj

The existence of a wide variety of image features (e.g. shadows, texture, small contrast zones, noise,
etc.) makes it very difficult to define robust and generic homogeneity or similarity criteria. In fact, it is

2
possible to find a wide number of interpretations for these criteria, associated, for example, to gray level
or to texture.

In [2] D. J. Granrath refers some characteristics of the human visual system (HVS), that can be useful
in image analysis systems. He refers the existence of a pervasive two-channel organization of spatial
information in the neural representation of imagery. One of these channels is spatially low pass in nature
and carries information about the local brightness average across the image, i.e. about the degree of
contrast across the image, while the other channel (band pass) carries the line and edge information.
In this context, it is useful to define homogeneity criteria with the same characteristic as the low pass
channel of the HVS. Several image features can be used with this purpose, such as the local dynamic
range variation, the local average and the local variance of gray levels.

The way homogeneity or similarity criteria are used defines the segmentation method. Two of the most
referred segmentation methods in the literature [3] are region growing and split & merge, which will be
succinctly described in the following sections.

2.1 Region growing and region merging

Region growing starts with an initial set of “seeds” (small regions, eventually only one pixel) to which
adjacent pixels are successively merged according to homogeneity or similarity criteria. New seeds can
be introduced along the process. Regions stemming from different seeds but considered to be similar, or
to result in an homogeneous region, can be merged. The process is complete when every pixel is assigned
to one of the regions and there are no similar pairs of adjacent regions, or pairs which merged would lead
to a homogeneous region. When each individual pixel is initially considered as a seed, region growing
becomes region merging.

In 1986, Morris, Lee and Constantinides presented a region merging image segmentation algorithm using
graph theory, in particular the concept of shortest spanning tree1 [7]. The algorithm, called recursive
SST segmentation, merges two regions if they are adjacent, i.e. connected by a link in the graph (which
represents the image), and if all other pairs of adjacent regions have a higher link weight, which measures
dissimilarity between regions. The vertex Vi weight, vi , is the gray level average of its pixels. The link
weights are the absolute difference between vertex weights, i.e. ei,j = |vi − vj |. After each merge, all
vertices and link weights affected by the merge are recalculated. The algorithm finishes when a predefined
number of regions (vertices) is reached.

The computational effort required by region merging algorithms increases with the number of initial
seeds. In order to reduce this number, another class of segmentation algorithms can be used, viz. split
& merge, where pixels are initially considered in groups instead of individually.

2.2 Split & merge

In 1976, Horowitz and Pavlidis [5] developed an image segmentation algorithm combining two methods
used independently until then: region splitting and region merging. In the first phase, region splitting2 ,
the image is initially analyzed as a single region and, if considered non-homogeneous by a dynamic range
criterion, it is split into four regions (according to a quad tree structure). This algorithm is recursively
applied to each of the resulting regions, until the homogeneity criterion is fulfilled. At the end of the
1A weighted graph, G = (V, E), is composed of a set of vertices connected by links, where Ei,j is the link that connects
vertices Vi and Vj , and vi , vj and ei,j are the weights of the vertices and link, respectively. If every vertex has a link to
all other vertices, the graph is complete. A partial graph has the same number of vertices but only a subset of links of the
original graph. A chain is a list of successive vertices in which each vertex is connected to the next by a link in the graph.
A cycle is a chain whose end links meet at the same vertex. A tree is a connected set of chains such that there are no
cycles. A spanning tree is a tree which is also a partial graph. If the sum of all link weights of a spanning tree is minimum
for any possible spanning tree, it is called a shortest spanning tree (SST). An image can be mapped into a weighted graph
making vi = yx,y and ei,j = f (vi , vj ), where yx,y is the gray level of point (x, y) in the image.
2 Horowitz and Pavlidis actually define the first phase as the split & merge phase and the second as the grouping phase.

However, the first one can be simplified (though this can make it less computationally efficient) to a simple split if one
starts by considering the entire image (level 0). The name of the second phase has been changed from grouping to merging
because the concepts have shifted during the last years.

3
split phase the regions correspond to the leaves of a tree (figure 1). If split were the only phase of the
segmentation algorithm, the segmented image would have many false boundaries. The second phase of
the algorithm is region merging (formerly grouping phase), where pairs of adjacent regions are analyzed
and merged if their union satisfies the homogeneity criterion.

Figure 1: Image split using a quad tree structure.

Several problems may occur in split & merge algorithms, namely artificial or badly located region bound-
aries. These problems usually stem from the split criterion used, which is thus determinant for the final
segmentation quality.

In 1990, Pavlidis and Liow [9] presented a method that combines split & merge with edge detection to
avoid the typical split & merge problems (e.g. boundaries that do not correspond to edges and there
are no edges nearby; boundaries that correspond to edges but do not coincide with them; edges with
no boundaries near them). The method is applied to the possibly over-segmented image resulting from
the split & merge algorithm. It is based on criteria that integrate contrast with boundary smoothness,
variation of the image gradient along the boundary, and a criterion that penalizes for the presence of
artifacts reflecting the quad tree structure used during image splitting.

3 Segmentation algorithm

As was referred in section 1, the algorithm presented in this paper uses split & merge techniques together
with the SST concept. It consists of three main stages, the last one, related to the post-processing of
boundaries, being only suggested:

1. Image simplification.
2. Feature extraction.
3. Post-processing of boundaries.

This algorithm can be used as the base of a hierarchical segmentation algorithm in a very low bitrate
video coding system such as the one described in [10].

3.1 Image simplification

Segmentation itself can be considered as a simplification process, its first objective being to obtain a
simpler representation of the original image. Since it is generally not possible to achieve high compression
factors without loss of information, it is convenient to eliminate small details which are perceptually less
relevant from the HVS point of view, and thus convey little or no information. The purpose of this
stage is hence to pre-process (simplify) the original image in order to eliminate the mentioned irrelevant
information. This also considerably reduces the computational load of the second algorithm stage (feature
extraction), see table 1 (section 4.1).

A relatively easy method of simplifying an image is by using low pass filters, such as median or mean
filters with appropriate windows. However, these filters have some drawbacks; for instance, they can

4
attenuate edges or modify their positions. These effects are particularly negative if segmentation is based
on contrast features.

Mathematical morphology [3] proposes some good simplifying tools without the mentioned drawbacks.
This work uses the morphological Open–Close operator, i.e. the successive aplication of the Erosion–
Dilation (Opening) and Dilation–Erosion (Closing) operators, with a n×n planar structuring element. An
application of this operator can be viewed in figure 2. Though this operator produces acceptable results,
improvements may be obtained using more complex operators, such as the opening by reconstruction or
the closing by reconstruction [1].

a) b)

Figure 2: MAVT Foreman (image 0): a) Original; b) Simplified using a planar 3 × 3 structuring element.

3.2 Feature extraction

The purpose of this stage is to obtain an approximate segmentation of the input image, which is refined in
the post-processing of boundaries stage. In this work, however, post-processing of boundaries is included
implicitely in feature extraction, which is thus the main segmentation stage of the algorithm. Feature
extraction is divided in the following steps:

1. Split.
2. Merge.
3. Elimination of small regions.
4. Control of the number of regions.

3.2.1 Split

During this step, the image is recursively split into smaller regions according to a quad tree structure.
This means that each region is either split into four or remains intact. The splitting decision can be
based on two types of criteria:

Homogeneity: A region R is split if it is not homogeneous, otherwise it remains intact (e.g. for the
dynamic range criterion, R is split if max{R} − min{R} ≥ ts , where max{R} and min{R} are,
respectively, the maximum and the minimum gray level values of the pixels belonging to R; for the
2 2
variance criterion, R is split if σR ≥ ts , where σR is the variance of R).
Similarity: A region R is split unless all regions resulting from the partition of that region are similar
(e.g. for the average criterion, R remains intact if ∀i, j ∈ {1, · · · , 4}; |µ(i) − µ(j) | ≤ ts , where µ(k) is
the average of the grey levels of region R(k) , partition k of R).

5
Due to the format of the images used in this work – Common Intermediate Format (CIF)3 – the algorithm
starts with 32 × 32 blocks instead of the whole image, since using square blocks with power of two side
lengths allows splitting down to pixel level.

The main purpose of this step is to reduce the computational load of the merge step and hence of the
overall algorithm: the smaller the initial number of regions for the merge steps, the smaller the total
computational load will be. Thus, it is necessary to establish a compromise between the computational
effort and the final segmentation quality, which is very dependent on the split criteria. Usually the split
step output is an over-segmented image with a lot of small regions and false boundaries reflecting the
quad tree structure used.

3.2.2 Merge

As happened with region splitting, region merging can be based on homogeneity or similarity:

Homogeneity: Two regions are merged if the resulting region is homogeneous (e.g. for the dynamic
range criterion, two regions Ra and Rb are merged if max{Ra ∪ Rb } − min{Ra ∪ Rb } ≤ tm , i.e. if
2
Ra ∪ Rb is homogeneous; for the variance criterion, Ra and Rb are merged if σR a ∪Rb
≤ tm ).
Similarity: Two regions are merged if they are similar (e.g. for the average criterion, two regions are
merged if |µRa − µRb | ≤ tm , where µRk is the average of region k).

The objective of this step is to merge the most similar adjacent regions resulting from the split step.
Some of these regions should have been considered as a single region but were separated due to the quad
tree structure used. This step is characterized by the order of region merging, which is determinant
for the final segmentation attained. In this algorithm region merging is done by merging successively
the two most similar adjacent regions or the two adjacent regions which when merged lead to the most
homogeneous region, following ideas in [7, 6]. This step stops when the merging criterion fails for every
pair of adjacent regions.

3.2.3 Elimination of small regions

Many small, perceptually irrelevant, regions result from the previous step. These regions are usually
very contrasted to the surroundings, and hence could not be merged into the bigger, more perceptually
relevant, neighbours (using the similarity or homogeneity criterion). These small regions, if not dealt
with appropriately, usually lead to an erroneous final segmentation after the last algorithm step (control
of the number of regions). For instance, the majority of the final regions being small and irrelevant, and
the most perceptually relevant ones having been merged into one another or into the background.

In this algorithm, any region smaller than 0.004% of the total image area is eliminated. Afterwards,
regions smaller than 0.02% of the total image area are eliminated by growing order of size4 , but only
while the overall area of eliminated regions is less than 10% of the image area. The elimination is always
done by merging the small regions to the most similar adjacent region or to the region leading to the
greatest homogeneity.

3.2.4 Control of the number of regions

This is the final step of the feature extraction stage. Its main objective is to control the segmentation
result in terms of the final number of regions. It is similar to the merge step in the way regions are
merged, but now the process stops when the desired number of regions is attained. Since this step is
done in a progressive way, in terms of the decreasing number of regions, it can be seen as producing a
segmentation hierarchy with a decreasing level of detail (increasing scale).
3 352× 288 pixels for the luminance and 176 × 144 pixels for the chrominances.
4 Incase of a tie, the small region which has the “closest” adjacent region, in terms of similarity or resulting homogeneity,
is chosen for elimination.

6
3.3 Post-processing of boundaries

The post-processing of boundaries was not implemented in this study, it is however mentioned here
because it seems to be of great relevance and will be further studied in the future. This final stage
can be seen as a post-processing of segmentation results or as a definition of regions based on previous
estimation of their position. An example of the first case can found in [9], where good results are
presented for simple images.

4 Results and discussion

The results presented in this section are based on two typical video telephony sequences, viz. MAVT
Foreman (mobile videotelephony) and Claire (studio videotelephony). These sequences clearly illustrate
the performance of this algorithm and its potentiality for application in very low bitrate video coding.

The segmentation results presented were obtained using the dynamic range (DR), homogeneity, and
average (AVG), similarity, criteria.

4.1 Simplification

The simplification stage introduces an important trade off between computational load and segmentation
quality. Increasing too much simplification, reduces the computational load but also decreases the final
segmentation quality. On the other hand, a carefully chosen simplification degree may in fact improve
the segmentation results by reducing the effect of undesirable image features, such as noise.

Notice, however, that the desired segmentation quality depends heavily on the application at hand. In
the very low bitrate video coding framework, for instance, some quality degradation may be acceptable
or even desirable. For CIF resolutions and for fairly high quality standards, good results can be obtained
using morphological open-close filters with a planar 3 × 3 structuring element, see figure 2 (section 3.1).

The computational load reduction with increasing simplification can be seen in table 1. It presents the
algorithm running times5 for image 0 of MAVT Foreman, in three different situations: no simplification,
simplification using a planar 3 × 3 structuring element, and simplification using a planar 5 × 5 structuring
element.

Running times – MAVT Foreman (image 0)


Simplification Number of Regions Time
1st Step 2nd Step 3rd Step 4th Step mm:ss.dd
— 32901 4831 306 20 44:10.06
3×3 19119 2204 232 20 13:24.60
5×5 15942 1762 193 20 9:04.75

Table 1: MAVT Foreman (image 0): 1st Step – Split; 2nd Step – Merge; 3rd Step – Elimination of the
small regions; 4th Step – Control of the number of regions.

4.2 Split and merge criteria comparison

The split step is decisive for the reduction of the computational load of the algorithm. Moreover, the
split criterion used is determinant to the final segmentation result. From the two homogeneity/similarity
criteria tested, dynamic range is markedly better for image splitting, as can be seen in figure 3 6 .

As can be seen the average criterion used leads to the disappearance of a few relevant details. Two
reasons concur for this. The first is that often a small number of contrasted pixels inside a split block
5 These running times were obtained in a SUN SPARCstation10, with the gprof command.
6 In figures 3, 4, 5, and 6 each region is filled with the corresponding grey level average.

7
have a small influence on the block’s average. The second is that a contrasted object inside a split block
often has the same influence on the averages of each of its sub-blocks, thus preventing it from being split.

The dynamic range criterion is more sensitive to noise than the average criterion, but this is not very
problematic at the split step. However, concerning the merge steps, the sensitivity to noise of the
dynamic range criterion usually leads to poor results. The average criterion, on the other hand, has a
very robust behaviour in noisy situations.

Figure 4 shows the same images as in figure 3 after the first merge step. This figure shows the great
influence that the split criterion has on the segmentation results, and also shows the better performance
attained using the dynamic range split criterion. In particular figure 4 b) shows clearly artifacts resulting
from the quad tree structure (e.g. top right of Claire’s hair and coat lapels).

a) b)

Figure 3: Claire (image 17, no simplification) – End of the split step (ts = 12): a) DR – 9828 regions;
b) AVG – 6822 regions.

a) b)

Figure 4: Claire (image 17) – End of the merge step (ts = 12 and tm = 10): a) DR–AVG – 1722 regions;
b) AVG–AVG – 1600 regions.

4.3 Elimination of small regions

The elimination of small regions is a fundamental step of this algorithm, since it substantially reduces
irrelevant information. Figure 5 compares results of running the algorithm with and without elimination
of small regions (for the same segmentation parameters). Both images have 100 regions. It can be seen
that the elimination of small regions substantially improves the segmentation quality.

8
a) b)

Figure 5: MAVT Foreman (image 0, 3 × 3 simplification) – DR–AVG (100 regions, ts = 12 and tm = 10):
a) with and b) without elimination of small regions.

4.4 Control of the number of regions

This step controls the final number of segmentation regions, hence implicitly controlling the final level of
detail, or scale, of the segmentation. Figure 6 shows four different levels of detail for image 17 of Claire,
using the dynamic range criterion to split the image and the average criterion to merge the resulting
regions.

5 Future work

Although the implemented segmentation algorithm leads to good results, some topics still need to be
investigated:

• Implementation of filters by reconstruction for image simplification.


• Use of the algorithm’s hierarchical output for hierarchical video coding.
• Development of homogeneity criteria able to cope with textures.
• Development of merging techniques which take the regions’ geometric and topological information
into account, since the simple adjacency criterion seems to be insufficient.
• Investigation of methods able to automatically choose the final number of regions depending on
image content and on the available resources (e.g. total number of bits available for the current
image in the framework of video coding).

Acknowledgments

The authors would like to thank the RACE programme and JNICT for their support to this work.

References
[1] José Crespo, Jean Serra, and Ronald W. Schafer. Image segmentation using connected filters. In Jean
Serra and Philippe Salembier, editors, Proceedings of the International Workshop on Mathematical
Morphology and its Applications to Signal Processing, pages 52–57, Barcelona, May 1993. UPC,
EMP and EURASIP, UPC Publications Office.

9
a) b)

c) d)

Figure 6: Claire (image 17, no simplification) – DR–AVG: a) 24 regions; b) 20 regions; c) 12 regions;


d) 4 regions.

[2] Douglas J. Granrath. The role of human visual models in image processing. Proceedings of the
IEEE, 69(5):552–561, May 1981.
[3] Robert M. Haralick and Linda G. Shapiro. Computer and Robot Vision, volume I. Addison-Wesley
Publishing Company, Inc., Reading, Massachusetts, 1992.
[4] Robert M. Haralick and Linda G. Shapiro. Computer and Robot Vision, volume II. Addison-Wesley
Publishing Company, Inc., Reading, Massachusetts, 1993.
[5] Steven L. Horowitz and Theodosios Pavlidis. Picture segmentation by a tree traversal algorithm.
Journal of the Association for Computing Machinery, 23(2):368–388, April 1976.
[6] Murat Kunt, Michel Bénard, and Riccardo Leonardi. Recent results in high-compression image
coding. IEEE Transactions on Circuits and Systems, 34(11):1306–1336, November 1987.
[7] O. J. Morris, M. de J. Lee, and A. G. Constantinides. Graph theory for image analysis: an approach
based on the shortest spanning tree. IEE Proceedings – Part F, 133(2):146–152, April 1986.
[8] Hans Georg Musmann, Michael Hötter, and Jörn Ostermann. Object-oriented analysis-synthesis
coding of moving images. Signal Processing: Image Communication, 1(2):117–138, October 1989.
[9] Theodosios Pavlidis and Yuh-Tay Liow. Integrating region growing and edge detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 12(3):225–233, March 1990.
[10] Luis Torres, Philippe Salembier, Ferran Marqués, and Pere Hierro. Image coding for storage and
transmission based on morphological segmentation. In Rudy A. Mattheus, André J. Duerinckx, and
Peter J. van Otterloo, editors, Video Communications and PACS for Medical Applications, volume
Proc. SPIE 1977, pages 304–315, Berlin, April 1993. EOS and SPIE, SPIE – The International
Society for Optical Engineering.

10

You might also like