Professional Documents
Culture Documents
Multimedia Processing
SUN-YUAN KUNG, FELLOW, IEEE, AND JENQ-NENG HWANG, SENIOR MEMBER, IEEE
This paper reviews key attributes of neural processing essential voluminous image/video data. Consequently, a massive
to intelligent multimedia processing (IMP). The objective is to amount of visual information on-line has come closer to
show why neural networks (NNs) are a core technology for
a reality. It promises a quantum elevation of the level of
the following multimedia functionalities: 1) efficient representa-
tions for audio/visual information, 2) detection and classification tomorrows world in entertainment and business. As data-
techniques, 3) fusion of multimodal signals, and 4) multimodal acquisition technology advances rapidly, however, we have
conversion and synchronization. It also demonstrates how the now substantially fallen behind in terms of technologies for
adaptive NN technology presents a unified solution to a broad indexing and retrieving visual information in large archives.
spectrum of multimedia applications. As substantiating evidence,
representative examples where NNs are successfully applied to
IMP applications are highlighted. The examples cover a broad
range, including image visualization, tracking of moving objects, A. Challenges of Multimedia Information Processing
image/video segmentation, texture classification, face-object detec- Consider only a small sample of the challenging prob-
tion/recognition, audio classification, multimodal recognition, and
lems:
multimodal lip reading.
how to find some desired pictures from a large archive
Keywords Access-control security, ACON/OCON networks,
active contour model, adaptive expectation-maximization, of photographs;
audio classification, audio-to-visual conversion, audio/visual how to search a specific clip from a video archive
fusion, decision-based neural network, face recognition,
hidden Markov models, image segmentation, image/video consisting of tons of tapes;
representation, independent components analyses, intelligent how to support visual-based interactive learning sys-
multimedia processing, linear/nonlinear fusion network, mixture
of experts, motion-based video segmentation, multilayer tems servicing a broad spectrum of customers includ-
perceptron, multimodal lip reading, neural-network technology, ing homemakers, students, and professionals;
object classification, object detection, object tracking, personal
authentification, principal components analyses, RBF networks, how to create and maintain multimedia data bases
self-organizing feature map, subject-based retrieval, texture capable of providing information in multiple medias,
classification, time-delay neural network, vector quantization,
video indexing and browsing. including text, audio, image, and video.
For example, it would be desirable to have a tool that
efficiently searches the Web for a desired picture (or video
I. INTRODUCTION
clip) and/or audio clip by using as a query a shot of mul-
Multimedia technologies will profoundly change the way timedia information [139], [152]. Nowadays, some popular
we access information, conduct business, communicate, queries might look like: Find frames with 30% blue on
educate, learn, and entertain [21], [95], [142]. Multimedia top and 70% green in bottom or Find the images or clips
technologies also represent a new opportunity for research similar to this drawing. In contrast to the above similarity-
interactions among a variety of media such as speech, based queries, it was argued that a so-called subject-based
audio, image, video, text, and graphics. As digitization and query [152] might be more likely to be used by users,
encoding of image/video have become more affordable, e.g., a query request such as Find Reagan speaking to the
computer and Web data base systems are starting to store Congress. The subject-based query offers a more user-
Manuscript received August 12, 1997; revised December 15, 1997. friendly interface but it also introduces a greater technical
The Guest Editor coordinating the review of this paper and approving it challenge, which calls for advances in two distinctive
for publication was K. J. R. Liu. research frontiers [21].
S.-Y. Kung is with the Department of Electrical Engineering, Princeton
University, Princeton, NJ 08544 USA (e-mail: kung@princeton.edu). Computer networking technology: Novel communica-
J.-N. Hwang is with the Department of Electrical Engineering, tion and networking technologies are critical for a
University of Washington, Seattle, WA 98195 USA (e-mail:
hwang@ee.washington.edu). multimedia data base system to support interactive dy-
Publisher Item Identifier S 0018-9219(98)03525-7. namic interfaces. A truly integrated media system must
PCA ICA
required performance and reliability with maximum econ- [67]. The structure incorporates lateral connections into the
omy. Actual design changes are frequently kept to a mini- network. The structure, together with an orthogonalization
mum to reduce the risk of failure. As a result, it is important learning rule, helps ensure that the orthogonality is pre-
to analyze the configurations, components, and materials of served between multiple principal components. A numerical
past designs so that good aspects may be reused and poor analysis on their learning rates and convergence properties
ones changed. A generic method of configuration evaluation have also been established.
based on an SOFM has been successfully reported [105]. ICA has played a prominent role in many communication
The SOFM architecture with activation retention and decay and signal/image-processing application domains [8], [15],
in order to create unique distributed response patterns for [27]. It has recently received a lot of interest from the
different sequences has also been successfully proposed for neural-network community [110]. Just like the PCA, the
mapping arbitrary sequences of binary and real numbers, output of an ICA network is meant for extracting an
as well as phonemic representations of English words independent component. A hybrid mixture is a mixture of
[57]. By using a selective learnable SOFM, which has the super-Gaussian, Gaussian, and sub-Gaussian independent
special property of effectively creating spatially organized components (ICs). It was shown in [69] and [70] that the
internal representations and nonlinear relations of various kurtosis of a single-output process has the
input signals, a practical and generalized method was very desirable property that most extrema points correspond
proposed in which effective nonlinear shape restoration is to pure ICs. In fact, all the positive local maxima (with
possible regardless of the existence of distortion models respect to negative local minima) can yield super-Gaussian
[44]. For many other examples of successful applications, (resp. sub-Gaussian) ICs from any mixture [69].
the interested reader may see, e.g., [29], [72], and [131]. It is interesting to note that PCA and ICA can share a
5) Principal and Independent Components Analyses learning network. For example, the APEX lateral network
(ICAs): PCA provides an effective way to find could be used for either PCA or ICA. Table 1 summarizes
representative components of a large set of multivariate a comparison between extractions of PCA and ICA.
data. The basic learning rules for extracting principal Briefly speaking, after prewhitening of the input ,
components follow the Hebbian rule and the Oja rule which yields a new observation space , the following
[111]. The PCA problem can be viewed as an unsupervised learning rule [69]
learning network following the traditional Hebbian-type
(13)
learning. The basic network is one where the neuron is
a simple linear unit with can be applied to extract super-Gaussian (via rule)
(10) and sub-Gaussian (via rule) ICs. (A numerical anal-
ysis indicated that renormalization of the weight vector
To enhance the correlation between the input and the will be needed for sub-Gaussian extraction.) This
extracted component , it is natural to use a Hebbian- learning rule has a very similar appearance as the Oja rule
type rule (12). The learning rule and an APEX-like orthogonalization
(11) (deflation) technique form the basis of an ICA neural net-
work called kurtosis-based independent component network
The above Hebbian rule is impractical for PCA, taking (KuicNet). Some KuicNet application examples for real
into account the finite-word-length effect, since the training speech/image data processing are presented in [70], and
weights will eventually overflow (i.e., exceed the limit of its extension to blind deconvolution/equalization is found
dynamic range) before the first component totally domi- in [35].
nates and the other components sufficiently diminish. An a) Application examples: ICA has recently found inter-
effective technique to overcome the overflow problem is esting applications in signal processing (e.g., interference
to ensure that the weight vectors remain normalized after removal, blind source separation, cocktail party problem)
updating, leading to the following Oja rule: and communication (e.g., blind deconvolution, multipath,
multichannel equalization) [8], [15]. PCA has been success-
(12)
fully applied in much broader signal/image processing and
A lateral neural network (called APEX) was found useful compression applications. Due to space constraints, only
for the extraction of multiple principal components [33], a few samples can be highlighted here. For example, in
(17)
Antireinforced Learning:
The MOE [56] is a modular architecture in which the
outputs of a number of experts, each performing a
(24)
are available, such as real-time recurrent learning (RTRL) varying amounts of contexts. The basic unit in each layer
networks [148] and back-propagation through time (BPTT) computes the weighted sum of its inputs and then passes
networks [50], [126]. The computational requirement of this sum through a nonlinear sigmoid function to the higher
these and several variants is very high. Among all the layer. The TDNN classifier shown in Fig. 3 has an input
recurrent networks, BPTTs performance is best unless on- layer with 16 units, a hidden layer with eight units, and
line learning is required, in which case RTRL is needed an output layer with three units (one output unit represents
instead. But for many applications involving temporal se- one class).
quence data, an SRN or a TDNN may suffice and is much When a TDNN is used for speech recognition, the speech
less costly than RTRL or BPTT. utterance is partitioned frame by frame (e.g., 30 ms frame
A hidden Markov model (HMM) [117][119] is a doubly with 15 ms advance). Each frame is transformed to 16
stochastic process with an underlying stochastic process that coefficients serving as input to the network. Every three
is not observable (i.e., hidden) but can only be observed frames with time delay zero, one, and two are input to
through another set of stochastic processes that produce the the eight time-delay hidden units, i.e., each neuron in
sequence of observed symbols [118]. The trellis diagram the first hidden layer now receives input (via 3 16
realization of an HMM can be considered as a BPTT weighted connections) from the coefficients in the three-
network expanding in time since its connections (transition frame window. The eight-unit hidden layer is delayed five
probabilities) are carrying the information about the envi- times to form a 40-unit layer. At the second hidden layer,
ronment and it consists of a multilayer network of simple each unit looks at all five copies of the delayed eight-unit
units activated by the weighted sum of the unit activations hidden blocks of the first hidden layer. Last, the output
at the previous iteration. In addition, the learning technique is obtained by integrating the evidence from the second
used in HMMs has a close algorithmic analogy with that hidden layer over time and connecting it to its output unit.
used in the BPTT networks [52]. This procedure can be formalized as the following equation:
1) TDNN: Fig. 3 shows the TDNN architecture [141] for
a three-class temporal sequence recognition task. A TDNN
is basically a feed-forward multilayer (four layers) NN (25)
with time-delay connections at the hidden layer to capture
1) Given and the model III. NEURAL NETWORKS FOR IMP APPLICATIONS
, how to compute , i.e., the Neural networks have played a very important role in
probability that the observation sequence was pro- the development of multimedia application systems [21],
bandwidth commonly introduces artifacts such as jerky (visual parameters) [23]. This approach contains two stages.
motion and loss of lip synchronization in talking-head In the first stage, the acoustics must be classified into one of
video. Therefore, lip synchronization becomes an impor- a number of groups. The second stage maps each acoustic
tant issue in video telephony and video conferencing. An group into a corresponding visual output.
important enabling technology was proposed for bimodal A (typical) leftright audio-visual HMM was adopted.
speech processing by mapping from audio speech [e.g., Consider estimating a single visual parameter from
linear predictive coding (LPC) cepstra] to lip movements the corresponding multidimensional acoustic parameter .
Defining the combined observation to be , breaking the word up into phonetically meaningful states.
the audio-to-visual conversion process using HMMs can This demonstrates the power of the temporal model for
be treated as a missing data problem. More specifically, synchronization applications.
a continuous-density HMM was trained with a sequence 2) Audio and Visual Integration for Lip-Reading Appli-
of for each word in the vocabulary. In the training cations: Although signal-processing theory of automatic
phase, the Gaussian mixtures in each state of the HMM speech recognition is well advanced, practical applications
are modeling the joint distribution of the audio-visual lag because the speech signals to be processed are usually
parameters. When presented with a sequence of acoustic contaminated with background noise in adverse environ-
vectors that correspond to a particular word, conversion ments such as offices, automobiles, aircraft, and factories.
can be made by using the HMM to segment the sequence To improve the performance of the speech-recognition sys-
of acoustic parameters into the optimal state sequence using tem, the following approaches can be used: 1) compensate
the Viterbi algorithm. More exactly, when presented with the noise in the acoustic speech signals prior to or during the
a sequence of acoustic vectors that correspond to a recognition process [98] or 2) use multimodal information
particular word, the maximum likelihood estimator for the sources, such as semantic knowledge and visual features,
associated visual parameter vectors is equal to the to assist acoustic speech recognition. The latter approach
conditional expectation, which can be derived from the is supported by the evidence that humans rely on other
optimal state sequence using the Viterbi algorithm of the knowledge sources, such as visual information, to help
HMM. constrain the set of possible interpretations [149].
In [23] and [121], audio-visual parameter sets were used Due to the maturity of digital video technology, it is now
to train one static MLP and one temporal model (HMM). feasible to incorporate visual information in the speech-
In the former approach, an OCON structure was adopted understanding process (lip reading). These new approaches
(see Fig. 1). One static MLP network per word was used offer effective integration of visually derived information
to estimate the visual height, which was considered to be into the state-of-the-art speech-recognition systems so as to
the most salient feature. Each MLP has six nodes in the gain an improved performance in noise without suffering
first hidden layer, eight nodes in the second hidden layer, degraded performance on clean speech [129]. Other impor-
and one node in the output layer. On the other hand, the tant evidence as to the use of lip reading in human speech
temporal HMM that was trained had seven states, three perception is offered by the auditory-visual blend illusion
mixtures per state, and diagonal covariance matrices. The or the McGurk effect [96].
results seem to indicate that the temporal model seems Three mechanisms about the means by which the two
to posses a better capability (than the static MLP) in disparate (audio and visual) streams of information are
(b)
(c)
Fig. 7. A subject-based indexing system. (a) Visual search methodology. (b) Tagging procedure.
(c) Tagging illustration.
audio features consist of 16 mel-scale Fourier coefficients layer perceptron to calculate the observation probabilities
obtained at 10-ms frame rate. The visual features were phone audio visual . The system combines the ten-
formed from the PCA transform with reduced dimen- order PCA transform coefficients (and/or the delta-features)
sionality (32 only) out of 24 16 smoothed eigenlips. from a gray-level eigenlip matrix (instead of the PCA
The two TDNNs are actually used for recognizing the from the snake points) from the video data and nine of
corresponding phoneme (out of 62) and viseme (out of the acoustic features from audio data [48]. They used a
42), which were then combined statistically for recognition discriminatively trained MLP to compute the observation
of the continuous letter sequence based on the dynamic probabilities (the likelihood of the input speech data given
time-warping algorithm. the state of a subword model) needed by the Viterbi
The project presented in [10] also combines the acoustic algorithm. Theoretically, the MLP provides the posterior
and visual features for effective lip reading. Instead of probabilities, instead of the likelihood, which can be easily
using neural networks as the temporal sequence classi- converted to likelihood according to Bayes rule using
fier, this project adopted the HMMs and used a multi- the prior probability information. This bimodal hybrid
speech-recognition system has already been applied to a based intelligent processing is so critical because it offers
multispeaker spelling task, and work is in progress to apply a very broad application domain, including video cod-
it to a speaker-independent spontaneous speech-recognition ing, compaction, object-oriented representation of video,
system, the Berkeley Restaurant Project. content-based retrieval in the digital library, video mosaic-
c) Decision integration: As discussed in the previous ing, video composition (hybrid of natural and synthetic
section, the audio and visual features can be combined into scenes), etc. [19].
one vector before pattern recognition. Then, the decision 1) Subject-Based Retrieval for Image and Video Data
is solely based on the result of the pattern recognizer. Bases: An NN-based tagging algorithm is proposed for
In the case of some lip-reading systems, which perform subject-based retrieval for image and video data bases
independent visual and audio evaluation, some rule is [152]. Object classification for tagging is performed off-line
required to combine the two evaluation scores into a single using DBNN. A hierarchical multiresolution approach is
one. Typical examples included the use of heuristic rules used, which helps cut down the search space of looking for
to incorporate knowledge of the relative confusability of a feature in an image. The classification is performed in two
phonemes in the evaluation of two modalities [114]; others phases. In the first phase, color is used, and in the second,
used multiplicative combination of independent evaluation texture features are applied to refine the classification (both
scores for each modality. These postintegration methods via DBNN). The general indexing scheme and tagging
possess the advantages of conceptual and implementational procedure are depicted in Fig. 7. The system [152] allows a
simplicity as well as give the user the flexibility to use just customer to search the image data base by semantic subject.
one of the subsystems if desired. The images are not manipulated directly in the on-line
phase. Each image is classified into a series of predefined
D. Video Browsing and Content-Based Indexing subjects off-line using color and texture features and neural-
Digital video processing has recently become an impor- network techniques. Queries are answered by searching
tant core information-processing technology. The MPEG-4 over the tag data base. Unlike previous approaches, which
audio/visual coding standards tend to allow content-based directly manipulate images on-line using templates or low-
interactivity, universal accessibility, and a high degree level image parameters, the system tags the images off-line,
of flexibility and extensibility. To accommodate volumi- which greatly enhances performance.
nous multimedia data, researchers have long suggested the Compared to most of the other existing content-based
content-based indexing and retrieval paradigm. Content- retrieval systems, which only support similarity-based re-
(b)
Fig. 9. (a) Representative frames of news report sequence. (b) Representative frames of anchorman
found by PDBNN face recognizer.
trieval, this system supports subject-based retrieval, allow- texture features are applied to refine the classification
ing the system to retrieve by visual objects. The difference using DBNN if the result of color classification is a
between subject- and similarity-based retrieval lies in the nonsingleton set of subject categories. Each block may
necessity for visual object recognition. Therefore, previous be further classified into one of the following categories:
low-level models are not suitable for subject-based re- sky, foliage, fleshtone, blacktop, white-object, ground, light,
trieval. Novel models are needed for subject-based retrieval, wood, unknown, and unsure. Last, an image tag generated
which could be utilized in film- and television-program- from the lookup table using the object-recognition results
oriented digital video data bases. Neural networks provide is saved in the tag data base. The experimental results
a natural effective technology for intelligent information on the Web-based implementation show that this model is
processing. very efficient for a large film or television-program-oriented
The tagging procedure includes four steps. In the first digital video data base.
step, each image is cut into 25 equal-size blocks. Each block 2) Face-Based Video Indexing and Browsing: A video in-
may contain single or multiple objects. In the second step, dexing and browsing scheme based on human faces is
color information is employed for an initial classification, proposed by Lin et al. [82], [83]. The scheme is imple-
where each block is classified into one of the following mented by applying the face detection and recognition
families in the HSV color space: black, gray, white, red, techniques. In many video applications, browsing through
yellow, green, cyan, blue, and magenta. In the next step, a large amount of video material to find the relevant clips is