Professional Documents
Culture Documents
Based Retrieval
Mais M. Fatayer
Amman – Jordan
eMail:mais.fatayer@gmail.com
Introduction
Traditional database management systems can’t handle the de-
mands of managing multimedia data. With the rapid growth of mul-
timedia platforms and the world wide web, database management
systems must now process, store, index, and retrieve alphanumeric
data, bitmapped and vector-based graphics, and video and audio
clips both compressed and uncompressed.
Before the emergence of content-based retrieval, media was annot-
ated with text,
allowing the media to be accessed by text-based searching.
Through textual description, media can be managed based on the
classification of
subject or semantics. This hierarchical structure allows users to eas-
ily navigate and
browse, and can search using standard Boolean queries. However,
with the emergence
of massive multimedia databases, the traditional text-based search
suffers from the
following limitations:
- Manual annotations require too much time and are expensive to
implement. As
the number of media in a database grows, the difficulty in finding
desired
information increases. It becomes infeasible to manually annotate
all attributes of
the media content. Annotating a sixty-minute video, containing
more than
100,000 images, consumes a vast amount of time and expense.
APPLICATIONS
Content-based retrieval has been proposed by different communit-
ies for various
applications. These include:
Medical diagnosis: The amount of digital medical images used in
hospitals has
increased tremendously. As images with the similar pathology-bear-
ing regions
can be found and interpreted, those images can be applied to aid
diagnosis for
image-based reasoning. For example, Wei & Li (2004) proposed a
general
framework for content-based medical image retrieval and construc-
ted a retrieval
system for locating digital mammograms with similar pathological
parts.
Images are stored in a database in raw form as a set of pixels or cell values, or
stored in compressed form to save space.
Each image is represented in grid of cells. There are many approaches to image
indexing and retrieval.
The first approach, attribute based, where the image contents are modeled as a set
of attributes extracted manually and managed within the frame work if conven-
tional DBMSs. Quires specified using these attributes. Examples of such attributes
are image file name, image category, and date of creation, subject, author and im-
age source. However, database attributes may not be able to describe the image
contents completely. Another problem is that types of quires are limited to those
attributes.
The second approach, feature –extraction/object recognition depends on subsys-
tem to automate the feature extraction and object recognition. Limitations of this
approach are that it’s computationally expensive, difficult to implement and tends
to be domain specific.
Another method is annotating images high-level features and using IR techniques
to carry out retrieval. Where text can describe the high level feature s contained in
the images, and for retrieval this approach uses the relevance feedback and do-
main knowledge, where it can overcome some problems of incompleteness and
subjectieveness.
Finally, using low-level feature method is used to index and retrieve images.
In practice, the second and the fourth approaches have provided a good efficiency
in performance; however, the second approach is not applicable for general ap-
plications.
In following sections, a description for low-level feature combined with text based
retrieval techniques are provided in more details. Methods based on color, shape
and texture.
In practice, text –based and low-level, feature –based techniques are combined to
achieve high-relative performance
The advantage of Text –Based Image Retrieval, is that it captures high level ab-
straction and concepts, such as “smile” and “happy”, contained in images.
however, it can not retrieve images based on example, and some high-level fea-
tures are difficult to describe such as shape and texture.
ively.
For example, suppose we have three images of 8×8 pixels and each pixel is one of
the eight colors C1to C8.
Image 1 has 8 pixels in each of the eight colors, image 2 has 7 pixels in each of
colors C1 to C4,and 9 pixels in each of colors C5 to C8.Image 3 has 2 pixels in
each of colors C1andC2,and 10pixels in each of colors C3 to C8.then we have the
following three histograms:
H1=(8,8,8,8,8,8,8,8)
H2=(7,7,7,7,9,9,9,9)
H3=(2,2,10,10,10,10,10,10)
Therefore, images 1 and 2 are most similar and images 1 and 3 most different ac-
cording to the histogram.
There are generally two types of shape representations, i.e., contour-based and
region-based. Contour-based methods need extraction of boundary information
which in some cases may not available. Region-based methods, however, do not
necessarily rely on shape boundary information, but they do not reflect local
features of a shape. Therefore, for generic purposes, both types of shape represent-
ations are necessary. Several shape descriptors, which have been widely
adopted for CBIR, they are: Fourier descriptors (FD), and grid descriptors (GD).
Not all six features are used in texture-based image retrieval systems. For example
in QBIC system, texture is described by coarseness, contrast and directionality.
Retrieval is based on similarity instead of exact match.
The following section talks about shot-based video indexing and retrieval technique.
Shot
Shot
Shot
2- Index each shot. (The common approach is to first identify key frames or
representative frames(r frames) for each shot.
Then, use image indexing method (described before)
3- Apply similarity measurement between query and video shot and retrieve
shots with high similarities, (this is achieved by using he image retrieval
methods based on indexes or feature vectors obtained in step2
Where, H i ( j ) denotes the histogram for the ith frame, and j is one of the G possible
gray levels. If S D i is larger than the pre detected threshold, a shot boundary is de-
clared.
Another simple but more effective approach is used to compare histogram based on a
color code derived from the R, G and B components.
This measurement is called Х2 test. Here, j denotes a color code instead of gray. In
this technique, selection of appropriate threshold values is a key issue in determining
the segmentation performance.
Fade-in is when a scene gradually appears. Fade-out is when a scene gradually disap-
pear. Dissolve is when one scene gradually disappears while another gradually ap-
pears.
Wipe is when one scene gradually enters across the frame while another gradually
leaves.
In such operations, the difference values tend to be higher than those within a shot but
significantly lower than the shot threshold.
Here, a single threshold does not work, because to capture their boundaries, the
threshold must be lowered significantly causing much false detection.
To overcome such situation Zhang et al. developed a twin-comparison technique that
can detect normal camera break and gradual transitions. This technique requires the
use of two difference threshold:
Tb: used to detect normal camera break
Ts: a lower threshold to detect the potential frames when gradual
transition may occur.
During the shot boundary detection process, consecutive frames are compared using
one of the previous described methods.
If the difference is larger than Tb, a shot boundary is declared. If the difference is less
than Tb and the difference is larger than Ts, the frame is declared as a potential trans-
ition frame. Then, add the frame-to-frame difference of the potential transition frames
occurring consecutively. If the accumulated frame-to-frame differences, of consecut-
ive potential frames is larger than Tb, a transition is declared and the consecutive po-
tential frames are treated as special segment. here, the accumulated difference is only
computed when the frame-to-frame difference is larger than Ts consecutively.
To determine how many r frame should be used, a number of methods have been pro-
posed:
1- Using one r frame per shot. However, this method does not consider the length
and content changes of shots.
2- Assigning the number of r frames to shots according to their length, where for
each second or less, one r frame is assigned to represent the shot. if the length
of a shot is longer than one second, one r frame is assigned to each second of
the video. This method can partially overcome the limitation of the first meth-
od, but it ignores shot contents.
3- A shot is divided into subshots or scenes and assigns one r frame to each sub-
shot.
A subshot is detected based on changes in contents. The contents are determ-
ined based on motion vectore, optical flow and frame-to-frame difference.
Between all the above mentioned methods its hard to determined which is the
best, where the choice of r frame is application dependent.
The next section addresses some additional techniques for video index and retriev-
al.
Audio signals are represented in the time domain or the frequency domain. different
features are extracted from these two representations.
Time-Domain Features
• Average Energy: Indicates the loudness of the audio signal
• Zero Crossing Rate: Indicates the frequency of signal amplitude sign change
• Silence Ratio: Indicates the proportion of the sound piece that is silent.
Frequency-Domain Features
• Sound Spectrum: show the frequency components and frequency distribution of
a sound signal, represented in frequency domain. In frequency domain the signal
is represented as amplitude varying with frequency, indicating the amount of en-
ergy at different frequencies.
• Bandwidth: indicate the frequency range of a sound; can be taken as the differ-
ence between the highest frequency and lowest frequency of non-zero spectrum
components “non-zero” may be defined as at least 3dB above the silence level
• Energy distribution: Signal distribution across frequency components. One
important feature derived from the energy distribution is the centroid, which is
the mid-point of the spectral energy distribution of a sound. Centroid is also
called brightness.
• Harmonicity:In harmonic sound, the spectral components are mostly whole
number multiples of the lowest and most often loudest frequency. Lowest fre-
quency is called fundamental frequency. Music is normally more harmonic than
other sounds
• Pitch: the distinctive quality of a sound, dependent primarily on the frequency of
the sound waves produced by its source. only period sounds, such as those pro-
duced by musical instruments and the voice, give rise to a sensation of a pitch . In
practice, we use the fundamental frequency as the approximation of the pitch
Spectrogram
Previous two representations are simple .though, in amplitude – time representa-
tion doesn’t show the frequency component of the signal. and spectrum doesn’t
show when the difference frequency components occur.
To overcome the limitation of the two representations, a combined representation
called spectrogram is used. The spectrogram of a signal shows the relation
between the three variables: frequency contents, time and intensity. In the spectro-
gram, the frequency content is shown along the vertical axis, and time along the
horizontal one. The gray scales the darkest part marking the greatest
amplitude/power.
Audio Classification
We need to classify the audio into speech, music and possibly
into other categories/subcategories ,where different audio types require different pro-
cessing and indexing retrieval techniques also, they have different significance to dif-
ferent applications.
Main Characteristics of Different Type of Sound
Following are main characteristics of speech and music as they are the basis for audio
classification.
Speech
Speech has a low bandwidth comparing to music, within the range of 0-7KHZ;hence,
the spectral centroid (brightness)of speech signals are usually lower than those of mu-
sic.
Speech signals have a higher silence ratio than music, because of the frequent pauses
in a speech occurring between words and sentences.
Music
Music normally has a high frequency range, from 16 to 20,000 HZ.thus; its spectral
centroid is higher than that of speech. Music has a low silence ratio, comparing to
speech. one exception may be music produced by a solo instrument or singing without
accompanying music.
Audio Input
Yes
Music
High
Centriod?
No Music
High silence
ratio?
No
High ZRC
variability?
Solo music
Yes
Speech
Parameters Range
Speaking Isolated words to continuous speech
Mode
Speaking style Read speech to spontaneous speech
Enrollment Speaker-dependent to Speaker-independent
Vocabulary Small(<20 word)to large(>20,000 words)
Language Finite-state to context-sensitive
Model
Perplexity Small(<10) to large(>100)
SNR High(>30 dB) to low (<10 dB)
Transducer Voice-cancelling microphone to telephone
An isolated-word speech recognition system requires that the speaker pause briefly
between words, whereas a continuous speech recognition system does not. Spontan-
eous, or extemporaneously generated, speech contains disfluencies, and is much more
difficult to recognize than speech read from script. Some systems require speaker en-
rollment, a user must provide samples of his or her speech before using them, whereas
other systems are said to be speaker-independent, in that no enrollment is necessary.
Some of the other parameters depend on the specific task. Recognition is generally
more difficult when vocabularies are large or have many similar-sounding words.
When speech is produced in a sequence of words, language models or artificial gram-
mars are used to restrict the combination of words.
Conclusion
So far, the main concepts, issues and techniques in developing multimedia informa-
tion indexing and retrieval system have been discussed. The importance of multime-
dia databases made the researchers to focus their efforts to go forward and design
References
more efficient methods and techniques to retrieve the best of these database.
Terms and Definitions
Boolean Query: A query that uses Boolean operators (AND, OR, and NOT)
to
formulate a complex condition. A Boolean query example
can be “university” OR “college”
Content-Based Retriev- An application that directly makes use of the contents of
al: media, rather than annotation inputted by the human, to
locate the desired data in
large databases.
Feature Extraction: A subject of multimedia processing which involves apply-
ing
algorithms to calculate and extract some attributes for de-
scribing the media.
Similarity Measure: A measure that compares the similarity of any two objects
represented in the multi-dimensional space. The general ap-
proach is to represent the data features as multi-dimension-
al points and then to calculate the distances between the
corresponding multi-dimensional points