You are on page 1of 21

Multimedia Databases and Content-

Based Retrieval

Mais M. Fatayer

Department of Computer Science

Amman Arab University

Amman – Jordan

eMail:mais.fatayer@gmail.com
Introduction
Traditional database management systems can’t handle the de-
mands of managing multimedia data. With the rapid growth of mul-
timedia platforms and the world wide web, database management
systems must now process, store, index, and retrieve alphanumeric
data, bitmapped and vector-based graphics, and video and audio
clips both compressed and uncompressed.
Before the emergence of content-based retrieval, media was annot-
ated with text,
allowing the media to be accessed by text-based searching.
Through textual description, media can be managed based on the
classification of
subject or semantics. This hierarchical structure allows users to eas-
ily navigate and
browse, and can search using standard Boolean queries. However,
with the emergence
of massive multimedia databases, the traditional text-based search
suffers from the
following limitations:
- Manual annotations require too much time and are expensive to
implement. As
the number of media in a database grows, the difficulty in finding
desired
information increases. It becomes infeasible to manually annotate
all attributes of
the media content. Annotating a sixty-minute video, containing
more than
100,000 images, consumes a vast amount of time and expense.

- Manual annotations fail to deal with the discrepancy of subjective


perception.
The phrase, “an image says more than a thousand words,” implies
that the textual
description is sufficient for depicting subjective perception. To cap-
ture all
concepts, thoughts, and feelings for the content of any media is al-
most
impossible.

- Some media contents are difficult to concretely describe in words.


For example,
a piece of melody without lyric or irregular organic shape cannot
easily be
expressed in textual form, but people expect to search media with
similar
contents based on examples they provided.
In an attempt to overcome these difficulties, content-based retrieval
employs
content information to automatically index data with minimal human
intervention.

APPLICATIONS
Content-based retrieval has been proposed by different communit-
ies for various
applications. These include:
Medical diagnosis: The amount of digital medical images used in
hospitals has
increased tremendously. As images with the similar pathology-bear-
ing regions
can be found and interpreted, those images can be applied to aid
diagnosis for
image-based reasoning. For example, Wei & Li (2004) proposed a
general
framework for content-based medical image retrieval and construc-
ted a retrieval
system for locating digital mammograms with similar pathological
parts.

Intellectual property: Trademark image registration has applied


content-based
retrieval techniques to compare a new candidate mark with existing
marks to
ensure that there is no repetition. Copyright protection can also be-
nefit from
content-based retrieval as copyright owners are able to search and
identify unauthorized copies of images on the Internet. For example,
Wang & Chen (2002)
developed a content-based system using hit statistics to retrieve
trademarks.
Broadcasting archives: Every day broadcasting companies produce
a lot of
audio-visual data. To deal with these large archives, which can con-
tain millions
of hours of video and audio data, content-based retrieval techniques
are used to
annotate their contents and summarize the audio-visual data to
drastically reduce
the volume of raw footage. For example, Yang et al. (2003) de-
veloped a
content-based video retrieval system to support personalized news
retrieval.

Information searching on the Internet: A large amount of media


has been made
available on the Internet for retrieval. Existing search engines
mainly perform
text-based retrieval. To access the various media on the Internet,
content-based
search engines can assist users in searching the information with
the most similar
contents based on queries. For example, Hong & Nah (2004) de-
signed a
XML-scheme to enable content-based image retrieval on the Inter-
net.

TEXT DOCUMENT INDEXING AND RETRIEVAL


IR (information Retrieval) techniques are important in multimedia information man-
agement systems. Where there exist a large number of text documents in many organ-
izations such as libraries. Text is very important information source for any organiza-
tion
Text can be used to annotate other media such as audio, images and video.
Two major design issues of IR systems are how to represent documents and quires
and how to compare similarities between documents and query representations. A re-
trieval model defines these two aspects. The most common technique is the exact
match technique and the Boolean model will be discussed as example on this retrieval
method.

Automatic Text documents indexing and Boolean retrieval Model


Basic Boolean Retrieval Model
Most of the commercial IR systems can be classified as Boolean IR systems or text
pattern search systems. Text pattern search quires are strings or regular expressions.
During retrieval, all documents are searched and these containing the query string are
retrieved.
Text-pattern systems are more common for searching small documents databases or
collections. In Boolean retrieval system, documents are indexed by set of keywords.
queries are also represented by set of keywords joined by logical (Boolean)operators
that supply relationships between the query terms.
Three types of operators are in common use: OR, AND, and NOT. Their retrieval
rules are:
- The OR operators treats two terms as effectively synonymous. For example,
given the query (term1 OR term2), the presence of either term in a record or
document suffices to retrieve that record.
- The AND operator combines terms into term phrases; thus the query (term1
AND term2) indicates that both terms must be present in the document in or-
der for it to be retrieved.
- The NOT operator is a restriction, or term-narrowing, operator that is normally
used in conjunction with the AND operator to restrict the applicability of par-
ticular terms; thus the query(term1 AND NOT term2)leads to the retrieval of
records containing term 1 but not term2.
Term operation and Automatic Indexing
A document contains many terms or words. But not every word is useful and import-
ant. for example, prepositions and articles such as “of” ,”the” ,and “a” are not useful
to represent the content of the document. These terms are called stop words.
During the indexing process, a document is treated as a list of words and stop words
are removed from the list. The remaining terms or words are further processed to im-
prove indexing and retrieval efficiency and effectiveness. Common operations carried
out on these terns are stemming ,thesaurus, and weighting.
Stemming is the automated conflation of related words, usually by reducing the words
to a common root form. for example, suppose that the words “retrieval” ,”retrieved”
,”retrieving” and “retrieve” all appear in a document. Instead of treating these as four
different words, for indexing purposes these four words are reduced to a common root
“retrieve”. The term “retrieve” is used as index term of the document.
Another way of conflating related terms is with a thesaurus that lists synonymous
terms and sometimes the relationships among them. For example, the words” study
“,”learning”, “schoolwork”, and “reading” have similar meanings. So instead of using
four index terms, a general term “study” can be used to represent the four terms.

Different indexing terms have different frequencies of occurrence and importance to


the document. Note that the occurring frequency of a term after stemming or thesaur-
us operations is the sum of the occurring frequencies of all its variations.
The introduction of term-importance weight for document terms and query term may
distinguish the term that is more important to the document for retrieval purposes
from less important terms.

IMAGE INDEXING AND RETRIEVAL

Images are stored in a database in raw form as a set of pixels or cell values, or
stored in compressed form to save space.
Each image is represented in grid of cells. There are many approaches to image
indexing and retrieval.
The first approach, attribute based, where the image contents are modeled as a set
of attributes extracted manually and managed within the frame work if conven-
tional DBMSs. Quires specified using these attributes. Examples of such attributes
are image file name, image category, and date of creation, subject, author and im-
age source. However, database attributes may not be able to describe the image
contents completely. Another problem is that types of quires are limited to those
attributes.
The second approach, feature –extraction/object recognition depends on subsys-
tem to automate the feature extraction and object recognition. Limitations of this
approach are that it’s computationally expensive, difficult to implement and tends
to be domain specific.
Another method is annotating images high-level features and using IR techniques
to carry out retrieval. Where text can describe the high level feature s contained in
the images, and for retrieval this approach uses the relevance feedback and do-
main knowledge, where it can overcome some problems of incompleteness and
subjectieveness.
Finally, using low-level feature method is used to index and retrieve images.
In practice, the second and the fourth approaches have provided a good efficiency
in performance; however, the second approach is not applicable for general ap-
plications.

In following sections, a description for low-level feature combined with text based
retrieval techniques are provided in more details. Methods based on color, shape
and texture.
In practice, text –based and low-level, feature –based techniques are combined to
achieve high-relative performance

Text –Based Image Retrieval


In Text –Based Image Retrieval, images are described with free text. Queries are
in form of keywords with/without Boolean operators. The retrieval techniques are
based on similarities between query and the text descriptions of images.
There are two main differences between Text –Based Image Retrieval and conven-
tional text document retrieval.
First, text annotation is manual process where high-level image understanding is
not possible .in image annotation we care for efficiency and how to describe im-
age contents completely and consistently. domain knowledge or thesaurus should
be used to overcome completeness and consistency problems. Relationships
between words or terms will also be considered. for example ,”child”,” man” and
“woman” are issues a query using the key work “human being”, intending to re-
trieve all images contain human beings.
Second ,the text description may not be complete and may be subjective. thus the
use of knowledge base and relevance feedback is extremely important for text-
base image retrieval.

The advantage of Text –Based Image Retrieval, is that it captures high level ab-
straction and concepts, such as “smile” and “happy”, contained in images.
however, it can not retrieve images based on example, and some high-level fea-
tures are difficult to describe such as shape and texture.

Color –Based Image Indexing and Retrieval Technique


This is a commonly used approach in content-based retrieval techniques. The idea
of color-based image retrieval technique is to retrieve a database images that have
similar colors to user’s query.
Each image in the database is represented using 3 channels of the color space
chosen. The most common color space used is the RGB (red, green and blue).each
color channel is discretized into m intervals. so the total number of discrete color
combinations (Called bins) n is equal to m3.for example, if each color channel is
discretized into 16 intervals, we can have 4,096 bins in total.
A color histogram H(M) is a vector (h1,h2,h3,…hj,…,hn),where element hj represent
the number of pixels in image M falling into bin j. this histogram is the feature
vector to be stored as the index of the image.
During image retrieval, a histogram is found for the query image or estimated
from the user’s query. the distances between the histograms of the query image
and images in the database are measured. Images with a histogram distance smal-
ler than a predefined threshold are retrieved from the database and presented to
the user. Alternatively ,the first k images with the smallest distance are retrieved.
In the following formula, the L-1 metric is defined as the distance between images
I and H:
d(I,H)= nΣ  i l - h l 

where i l and h l is the number of pixels


l=1 falling in bin l in image I and H, respect-

ively.
For example, suppose we have three images of 8×8 pixels and each pixel is one of
the eight colors C1to C8.
Image 1 has 8 pixels in each of the eight colors, image 2 has 7 pixels in each of
colors C1 to C4,and 9 pixels in each of colors C5 to C8.Image 3 has 2 pixels in
each of colors C1andC2,and 10pixels in each of colors C3 to C8.then we have the
following three histograms:

H1=(8,8,8,8,8,8,8,8)
H2=(7,7,7,7,9,9,9,9)
H3=(2,2,10,10,10,10,10,10)

The distances between these three images are:


d(H1,H2)=1+1+1+1+1+1+1+1=8
d(H1,H3)=6+6+2+2+2+2+2+2=24
d(H2,H3)=5+5+3+3+3+1+1+1+1=23

Therefore, images 1 and 2 are most similar and images 1 and 3 most different ac-
cording to the histogram.

Image Retrieval Based on Shape


Shape representation is a fundamental issue in the newly emerging multimedia ap-
plications. In the content-based image retrieval (CBIR), shape is an important
low-level image feature.
A good shape representation and similarity measurement for recognition and re-
trieval purposes should have the following two important properties:
- Each shape should have a unique representation, invariant to translation, rota-
tion and scale
- Similar shapes should have similar representation so that retrieval can be
based on distances among shape representations.

There are generally two types of shape representations, i.e., contour-based and
region-based. Contour-based methods need extraction of boundary information
which in some cases may not available. Region-based methods, however, do not
necessarily rely on shape boundary information, but they do not reflect local
features of a shape. Therefore, for generic purposes, both types of shape represent-
ations are necessary. Several shape descriptors, which have been widely
adopted for CBIR, they are: Fourier descriptors (FD), and grid descriptors (GD).

Fourier descriptors method: in Fourier descriptor-based method, a shape is first


represented by feature function called a shape signature. a discrete Fourier trans-
form is applied to the signature to obtain (FD) of the shape. These FDs are used to
index the shape and for calculation of shape.
Grid descriptors :In grid shape representation, a shape is projected onto a grid of
fixed size. The grid cells are assigned the value of 1 if they are covered by the
shape (or covered beyond a threshold) and 0 if they are outside the shape. A shape
number consisting of a binary sequence is created by scanning the grid in left-
right and top-bottom order, and this binary sequence is used as shape descriptors
to index the shape.

Image Retrieval Based on Texture


Texture is an important image feature, but it is difficult to describe and its percep-
tion is subjective to a certain extent.
One of the best methods proposed is the one by Tamura,H.S. Mori , and T.
Yamawaki. To find a texture description, they conducted psychological experi-
ments.
They aimed to make the description conform to human perception as closely as
possible. According to their specification, six features describe texture, as follows:
- Coarseness: coarse is opposite to fine. Coarseness is the most fundamental tex-
ture feature and to some people texture means coarseness .the larger the dis-
tinctive image elements, the coarser the image. so, an enlarged image is coars-
er than the original one.
- Contrast: the contrast is measured using four parameters: dynamic range of
gray levels of the image, polarization of the distribution of black and white on
the gray-level histogram or ratio of black and white on areas, sharpness of
edges, and period of repeating patterns.
- Directionality: it is a global property over he given region. It measures both
element shape and placement .the orientation of the texture pattern is not im-
portant.
- Line likeness: this parameter is concerned with the shape of a texture element.
Two common types of shapes are linelike and bloblike.
- Regularity: this measures variation of an element placement rule. It is con-
cerned with whether the texture is regular or irregular. Different element shape
reduces regularity. A fine texture tends to be perceived as regular.
- Roughness: this measures whether the texture is rough or smooth. It is related
to coarseness and contrast.

Not all six features are used in texture-based image retrieval systems. For example
in QBIC system, texture is described by coarseness, contrast and directionality.
Retrieval is based on similarity instead of exact match.

Integrated Image Indexing and Retrieval Techniques


An individual feature will not be able to describe an image adequately. For ex-
ample, it’s not possible to distinguish a red car from a red apple based on color
alone. therefore; a combination of features is required for effective image indexing
and retrieval.
A practical system, QBIC, was developed by IBM Corporation. It allows a large
image database to be queried by visual properties such as colors, color percent-
ages, texture, shape and sketch, as well as by keywords.
QBIC capabilities have been incorporated into IBM’s DB2 Universal Database
product.

VIDEO INDEXING AND RETRIEVAL


Video is information rich. A complete video may consists of text, sound track (both
speech and nonspeech), and images recorded or played out continuously at fixed rate.

Following methods are used for video indexing and retrieval:


- Metadata-based method: video is indexed and retrieved based on structured
author/producer/director, date of production and type of video.
- Text-based method: using IR techniques, video can be indexed and retrieved.
- Audio-based method: using speech recognition techniques and IR techniques
audio video can be indexed and retrieved based on spoken words associated
with video frames.
- Content-based method: there are two approaches, in the first approach, video
is treated as independent frames or images, and use the image indexing and re-
trieval methods. The other approach divides the video into group of similar
frames, and indexing is based on the representative frame of these groups, this
approach is called shot-based video indexing and retrieval.
- Integrated approach: two or more methods of the above methods can be
combined to provide more effective video indexing and retrieval.

The following section talks about shot-based video indexing and retrieval technique.

Shot-based video indexing and retrieval technique


A video sequence consists of a sequence of images taken at a certain rate. A long
video contains many frames, which are if treated individually the indexing and re-
trieval will be very hard. So that, video is made of number of logical units or seg-
ments called shots.

A shot can have the following features:


- The frame depict the same scene
- The frame signify a single camera operation
- The frames contain a distinct event or/and action such as the significant pres-
ence of an object
- The frames are chosen as a single indexable entity by the user.

Shot
Shot

Shot

We need to identify Frames taken in same


The part of the video scene and featuring same
that contains required group of people
information correspond to a shot
Shot-based video indexing and retrieval consists of the following
main steps:
1- Segment the video into shots (called video temporal segmentation, partition
or shot detection)

2- Index each shot. (The common approach is to first identify key frames or
representative frames(r frames) for each shot.
Then, use image indexing method (described before)
3- Apply similarity measurement between query and video shot and retrieve
shots with high similarities, (this is achieved by using he image retrieval
methods based on indexes or feature vectors obtained in step2

SEGMENT THE VIDEO INTO SHOTS

Video Shot Detection or Segmentation


Consecutive frames on either side of a camera break, generally, display a significant
quantitative change.
Here, a suitable quantitative measure that capture the difference between a pair of
frames is needed. If difference between two consecutive frames exceeds a given
threshold, then it may be interpreted as indicating segment boundary.
From the above, it’s obvious that camera break is the simplest transaction between
two scenes, where a camera may have other transactions such as dissolve, wipe, fade-
in and fade-out. The last operations have a gradual change between two consecutive
frames than does a camera break.

Basic video segment techniques


The key issue of shot detection is how to measure the frame –to- frame differences.
The simplest way is to measure the sum of pixel-to-pixel differences between neigh-
boring frames. If the sum is larger than the preset threshold then assign a shot bound-
ary between the two frames.
However, this method is not effective and much false shot detection will be reported,
where two frames within one shot may have a large pixel-to-pixel difference due to
object movement from frame-to-frame
To overcome this limitation of the last approach, new methods were introduced to
measure color histogram distance between neighboring frames, the principle behind
these methods is that object motion causes a little histogram differences. If a large dif-
ference founded, hence a camera break occurred.
Following formula used to measure the difference between the ith frame and its suc-
cessor:
SDi=∑ │ Hi(j)-Hi+1(j) │
j

Where, H i ( j ) denotes the histogram for the ith frame, and j is one of the G possible
gray levels. If S D i is larger than the pre detected threshold, a shot boundary is de-
clared.
Another simple but more effective approach is used to compare histogram based on a
color code derived from the R, G and B components.

SDi=∑ (Hi(j)-Hi+1(j) )2/( Hi+1(j))


j

This measurement is called Х2 test. Here, j denotes a color code instead of gray. In
this technique, selection of appropriate threshold values is a key issue in determining
the segmentation performance.

Detecting Shot Boundaries with Gradual Change


The above technique relies on a single frame-to-frame difference threshold for shot
detection. What was found in practice is that the previous techniques cannot detect
shot boundaries when the change between frames is gradual as in videos produced
with the techniques of fade-in, fade-out, dissolve and wipe operations. Also, when the
color histogram between two different frames of two different scenes are similar.

Fade-in is when a scene gradually appears. Fade-out is when a scene gradually disap-
pear. Dissolve is when one scene gradually disappears while another gradually ap-
pears.
Wipe is when one scene gradually enters across the frame while another gradually
leaves.
In such operations, the difference values tend to be higher than those within a shot but
significantly lower than the shot threshold.

Here, a single threshold does not work, because to capture their boundaries, the
threshold must be lowered significantly causing much false detection.
To overcome such situation Zhang et al. developed a twin-comparison technique that
can detect normal camera break and gradual transitions. This technique requires the
use of two difference threshold:
Tb: used to detect normal camera break
Ts: a lower threshold to detect the potential frames when gradual
transition may occur.
During the shot boundary detection process, consecutive frames are compared using
one of the previous described methods.
If the difference is larger than Tb, a shot boundary is declared. If the difference is less
than Tb and the difference is larger than Ts, the frame is declared as a potential trans-
ition frame. Then, add the frame-to-frame difference of the potential transition frames
occurring consecutively. If the accumulated frame-to-frame differences, of consecut-
ive potential frames is larger than Tb, a transition is declared and the consecutive po-
tential frames are treated as special segment. here, the accumulated difference is only
computed when the frame-to-frame difference is larger than Ts consecutively.

VIDEO INDEXING AND RETRIEVAL


Now, we need to represent and index each shot so that shots can be located and re-
trieved quickly in response to quires.the most common way is to represent each shot
with one or more key frames or representative frames(r frames). Retrieval is then
based on similarity between the query and r frames.

Indexing and Retrieval Based on r Frames of Video Shots


Using a representative frame is the most common way to represent a shot.r frame,
capture the main contents of the shot. Features of this frame are extracted and indexed
based on color, shape and texture as in image retrieval. During retrieval, queries are
compared with indices or feature vectors of these frames.
If this frame is similar to the query, then its presented to the user so he/she can play
out the shot it represents.
When shot is static any frame is good enough to be representative frame. But when
there are a lot of object movements in the shot, other methods should be used.

We need to address two issues regarding r frame selection.first,how many r frames


should be used in a shot.second,how to select these r frames within a shot.

To determine how many r frame should be used, a number of methods have been pro-
posed:
1- Using one r frame per shot. However, this method does not consider the length
and content changes of shots.
2- Assigning the number of r frames to shots according to their length, where for
each second or less, one r frame is assigned to represent the shot. if the length
of a shot is longer than one second, one r frame is assigned to each second of
the video. This method can partially overcome the limitation of the first meth-
od, but it ignores shot contents.
3- A shot is divided into subshots or scenes and assigns one r frame to each sub-
shot.
A subshot is detected based on changes in contents. The contents are determ-
ined based on motion vectore, optical flow and frame-to-frame difference.

In second step, we need to determine how these r frames are selected.


According to previous methods of determining the number of r frames for each shot.
Three possibilities also proposed (here, a general term “segment” is used to refer to a
shot, a second of video or a subshot, depending on the used method):
1- In first method, the first frame of each segment is normally used as the r
frame. This choice is based on the observation that cinematographers attempt
to “characterize” a segment with the first few frames, before beginning to
track or zoom to a closeup.thus the first frame of a segment normally captures
overall contents of the segment
2- In the second method, an average frame is defined so that each pixel in this
frame is the average of pixel values at the same grid point in all frames of the
segment. Then the frame within the segment that is most similar to this aver-
age is selected as the representative frame of the segment.
3- In the third method, the histograms of all the frames in the segment are aver-
aged. The frame whose histogram is closest to this average histogram is selec-
ted as the representative frame.
4- The fourth method is mainly used for segments captured using camera pan-
ning. Each image or frame within the segment is divided into background and
foreground objects. A large background is then constructed from the back-
ground of all frames, and then the main foreground objects of all frames are
superimposed onto the constructed background.

Between all the above mentioned methods its hard to determined which is the
best, where the choice of r frame is application dependent.
The next section addresses some additional techniques for video index and retriev-
al.

Indexing and Retrieval Based on Motion Information


Video indexing and retrieval method based on motion information has been pro-
posed to complement the r frame-based approach, where the last treats a video as
a collection of still images.
In Video indexing and retrieval method based on motion information, motion in-
formation is derived from motion vectors and determined for each r frame, thus r
frame are indexed-based on both image contents and motion information.

Indexing and Retrieval Based on Objects


Object based indexing schemes find a way to distinguish individual objects
throughout a given scene, that is a complex collection of objects, and carry out the
indexing process based on information about each object. the indexing strategy
would be able to capture the changes in content throughout the sequence.

Indexing and Retrieval Based on Metadata


Metadata for video is available in some standard video format. Video indexing
and retrieval can be based on this metadata using conventional DBMSs.

Indexing and Retrieval Based on Annotation


Using Video manual interpretation and annotating or by using transcripts and sub-
titles, or by applying speech recognition to sound track to extract spoken words,
which can then be used for indexing and retrieval.

AUDIO INDEXING AND RETRIEVAL


Digital audio is represented as a sequence of samples and normally stored in com-
pressed form.
For human being, it’s easy to recognize different types of audio; we all can tell
whether the audio is music, noise or human voice, also the mood whether its
happy, sad, relaxing, etc.
For a computer, audio is just a sequence of sample values. So, it needs a retrieval
technique to access audio file and retrieve query request. for the traditional meth-
od of accessing audio pieces, its based on their titles or file names, which is not
good enough to retrieve a query such as “find audio pieces similar to the one be-
ing played”, or in other words, query by example.
Too overcome previous problem, the content based audio retrieval techniques are
required.
The following general approach to content-based audio retrieval techniques are
normally taken:
- Audio is classified into some common types of audio such as speech, music
and noise.
- Different audio types are processed and indexed in different ways. For ex-
ample, if the audio type is speech ,speech recognition is applied and the
speech is indexed based on recognized words.
- Query audio pieces are similarly classified, processed and indexed.
- Audio pieces are retrieved based on similarity between the query index and the
index in the database.

Audio signals are represented in the time domain or the frequency domain. different
features are extracted from these two representations.

Time-Domain Features
• Average Energy: Indicates the loudness of the audio signal
• Zero Crossing Rate: Indicates the frequency of signal amplitude sign change
• Silence Ratio: Indicates the proportion of the sound piece that is silent.

Frequency-Domain Features
• Sound Spectrum: show the frequency components and frequency distribution of
a sound signal, represented in frequency domain. In frequency domain the signal
is represented as amplitude varying with frequency, indicating the amount of en-
ergy at different frequencies.
• Bandwidth: indicate the frequency range of a sound; can be taken as the differ-
ence between the highest frequency and lowest frequency of non-zero spectrum
components “non-zero” may be defined as at least 3dB above the silence level
• Energy distribution: Signal distribution across frequency components. One
important feature derived from the energy distribution is the centroid, which is
the mid-point of the spectral energy distribution of a sound. Centroid is also
called brightness.
• Harmonicity:In harmonic sound, the spectral components are mostly whole
number multiples of the lowest and most often loudest frequency. Lowest fre-
quency is called fundamental frequency. Music is normally more harmonic than
other sounds
• Pitch: the distinctive quality of a sound, dependent primarily on the frequency of
the sound waves produced by its source. only period sounds, such as those pro-
duced by musical instruments and the voice, give rise to a sensation of a pitch . In
practice, we use the fundamental frequency as the approximation of the pitch
Spectrogram
Previous two representations are simple .though, in amplitude – time representa-
tion doesn’t show the frequency component of the signal. and spectrum doesn’t
show when the difference frequency components occur.
To overcome the limitation of the two representations, a combined representation
called spectrogram is used. The spectrogram of a signal shows the relation
between the three variables: frequency contents, time and intensity. In the spectro-
gram, the frequency content is shown along the vertical axis, and time along the
horizontal one. The gray scales the darkest part marking the greatest
amplitude/power.

Audio Classification
We need to classify the audio into speech, music and possibly
into other categories/subcategories ,where different audio types require different pro-
cessing and indexing retrieval techniques also, they have different significance to dif-
ferent applications.
Main Characteristics of Different Type of Sound
Following are main characteristics of speech and music as they are the basis for audio
classification.
Speech
Speech has a low bandwidth comparing to music, within the range of 0-7KHZ;hence,
the spectral centroid (brightness)of speech signals are usually lower than those of mu-
sic.
Speech signals have a higher silence ratio than music, because of the frequent pauses
in a speech occurring between words and sentences.
Music
Music normally has a high frequency range, from 16 to 20,000 HZ.thus; its spectral
centroid is higher than that of speech. Music has a low silence ratio, comparing to
speech. one exception may be music produced by a solo instrument or singing without
accompanying music.

Audio Classification Framework


All classification methods are based on calculated feature values. however, they differ
in how these features are used.
• Step by Step Classification
Each feature is used individually in different classification steps ,each feature used
separately to determine if an audio piece is speech or music. Each feature is seen as
filtering criterion. At each step, an audio piece is determined as one type or another.
In this classification method, the centroid of all input audio pieces is calculated, if the
centroid is higher than the pre-determined threshold then it’s a music, else its either
music or speech(where some music has a high centroid).then the silence ration is cal-
culated, and if it has a low value, then audio piece is music ,else, it is either solo mu-
sic or speech(solo music has a high silence ratio).finally the ZCR(zero crossing ratio)
is calculated, and if the input has a high ZCR variability ,it is a speech.
The above order of the algorithm is based on the differences between features, where
the less complicated feature with the high differentiating power is used first.a possible
filtering process is shown in figure 1.

Audio Input

Yes
Music
High
Centriod?

No Speech puls music

No Music
High silence
ratio?

Yes Speech puls solo music

No
High ZRC
variability?
Solo music

Yes

Speech

Figure 1:Audio classification process

• Feature Vector Based Audio Classification


Values of a set of features are calculated and used as a feature vector. During the
training stage, the average feature vector is found for each class of audio. During clas-
sification, the feature vector of an input is calculated and the vector distance between
the input feature vector and each of the reference vectors are calculated. The input us
classified into the class from which the input has least vector distance
Speech recognition
Speech recognition is the process of converting an acoustic signal, captured by a
microphone or a telephone, to a set of words. The recognized words can be the final
results, as for applications such as commands & control, data entry, and document
preparation. They can also serve as the input to further linguistic processing in order
to achieve speech understanding

Speech recognition systems can be characterized by many parameters, some of the


more important of which are:

Parameters Range
Speaking Isolated words to continuous speech
Mode
Speaking style Read speech to spontaneous speech
Enrollment Speaker-dependent to Speaker-independent
Vocabulary Small(<20 word)to large(>20,000 words)
Language Finite-state to context-sensitive
Model
Perplexity Small(<10) to large(>100)
SNR High(>30 dB) to low (<10 dB)
Transducer Voice-cancelling microphone to telephone

An isolated-word speech recognition system requires that the speaker pause briefly
between words, whereas a continuous speech recognition system does not. Spontan-
eous, or extemporaneously generated, speech contains disfluencies, and is much more
difficult to recognize than speech read from script. Some systems require speaker en-
rollment, a user must provide samples of his or her speech before using them, whereas
other systems are said to be speaker-independent, in that no enrollment is necessary.
Some of the other parameters depend on the specific task. Recognition is generally
more difficult when vocabularies are large or have many similar-sounding words.
When speech is produced in a sequence of words, language models or artificial gram-
mars are used to restrict the combination of words.

Basic Concepts of ASR (Automatic Speech Recognition System)


There are two stages of ASR:
1. Training: Features of each speech unit is extracted and stored in the system.
2. Recognition: Features of an input speech unit are extracted and compared with each
of the stored features and the speech unit with the best matching features is taken as
the recognized unit.

Music Indexing and Retrieval


There are two types of music, Structured music or synthetic and Sample-based music
Indexing and Retrieval of Structured Music and Sound Effects
Structured music and sound effects are represented by a set of commands or al-
gorithms. the most common structured music is MIDI, which represent music as a
number of notes and control commands.MPEG-4 is a new standard for structured au-
dio, which represents sound in algorithms and control languages.
These standard are developed for sound transmission, synthesis, and production. these
standard are not designed for indexing and retrieval purposes.However,the explicit
structure and notes descriptions in these formats make the retrieval process easy(no
need for feature extraction from audio signals)
User query for sound file will also depend on exact match between queries and data-
base sound files. Sometimes, the sound produced by the retrieved sound files may not
be what user wants, that’s because different devices can render the same structure of
sound file differently.

Indexing and Retrieval of Sample-based music


There are two general approaches to indexing and retrieval of sample-based music.
Retrieval based on a set of features
Build model for each class based on a set of features and then compute the similarity
between the features of the query and the models.

Retrieval based on pitch


The pitch for each note has to be extracted or estimated, and converts the musical
sound into a symbolic representation.

FUTURE RESEARCH ISSUES AND TRENDS


Since the 1990s, remarkable progress has been made in theoretical
research and
system development. However, there are still many challenging re-
search problems.
This section identifies and addresses some issues in the future re-
search agenda.
Automatic Metadata Generation
Metadata (data about data) is the data associated with an informa-
tion object for the
purposes of description, administration, technical functionality and
so on. Metadata
standards have been proposed to support the annotation of multi-
media content.
Automatic generation of annotations for multimedia involves high-
level semantic
representation and machine learning to ensure accuracy of annota-
tion. Content-based
retrieval techniques can be employed to generate the metadata,
which can be further
used by the text-based retrieval.
Embedding Relevance Feedback
Multimedia contains large quantities of rich information and involves
the subjectivity
of human perception. The design of content-based retrieval systems
has turned out to
emphasize an interactive approach instead of a computer-centric
approach. A user
interaction approach requires human and computer to interact in re-
fining the
high-level queries. Relevance feedback is a powerful technique used
for facilitating
interaction between the user and the system. The research issue in-
cludes the design of
the interface with regard to usability, and learning algorithms which
can dynamically
update the weights embedded in the query object to model the high
level concepts and
perceptual subjectivity.

Bridging the Semantic Gap


One of the main challenges in multimedia retrieval is bridging the
gap between
low-level representations and high-level semantics (Lew & Eakins,
2002). The
semantic gap exists because low-level features are more easily com-
puted in the
system design process, but high-level queries are used at the start-
ing point of the
retrieval process. The semantic gap is not only the conversion
between low-level
features and high-level semantics, but also the understanding of
contextual meaning
of the query involving human knowledge and emotion. Current re-
search intends to
develop mechanisms or models that directly associate the high-level
semantic objects
and representation of low-level features.

Conclusion
So far, the main concepts, issues and techniques in developing multimedia informa-
tion indexing and retrieval system have been discussed. The importance of multime-
dia databases made the researchers to focus their efforts to go forward and design
References

Guojun Lu ,Multimedia Databse Management Systems, Artech House


Publishers,1999
Chia-Hung Wei and Chang-Tsun Li,” Design of Content-based Multimedia
Retrieval”, Department of Computer Science ,University of Warwick ,Coventry CV4
7AL, UK

Leung, “Survey papers on Audio Indexing and Retrieval”,2004/2005, http://www.it.-


cityu.edu.hk

Content-based Image Retrieval& Shape as Feature of Image,Media Signal Processing,


Presentation by :Jahanzeb Farooq,Michael Osadebey

“Content-Based Shape Retrieval Using Different Shape Descriptors: A


Comparative Study”, Dengsheng Zhang and Guojun Lu,Gippsland School of Comput-
ing and Information Technology,Monash University,Churchill, Victoria 3842
Australia

more efficient methods and techniques to retrieve the best of these database.
Terms and Definitions

Boolean Query: A query that uses Boolean operators (AND, OR, and NOT)
to
formulate a complex condition. A Boolean query example
can be “university” OR “college”
Content-Based Retriev- An application that directly makes use of the contents of
al: media, rather than annotation inputted by the human, to
locate the desired data in
large databases.
Feature Extraction: A subject of multimedia processing which involves apply-
ing
algorithms to calculate and extract some attributes for de-
scribing the media.

High-level feature: Such as timber, rhythm, instruments, and events involve


different degrees of semantics contained in the media
Intensity: Power of a frequency component at a particular time inter-
val
Low-level feature: Such as object motion, color, shape, texture, loudness,
power spectrum, bandwidth and pitch
Query by Example: A method of searching a database using example media as
search criteria. This mode allows the users to select pre-
defined examples requiring the users to learn the use of
query languages.
Segmentation: Is a process of dividing a video sequence into shots.

Shot: A short sequence of contiguous frames.

Similarity Measure: A measure that compares the similarity of any two objects
represented in the multi-dimensional space. The general ap-
proach is to represent the data features as multi-dimension-
al points and then to calculate the distances between the
corresponding multi-dimensional points

Video: A combination of text audio and images with time dimen-


sion.

You might also like