Professional Documents
Culture Documents
Yuli You
Audio Coding
Theory and Applications
123
ISBN 978-1-4419-1753-9
e-ISBN 978-1-4419-1754-6
DOI 10.1007/978-1-4419-1754-6
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2010931862
c Springer Science+Business Media, LLC 2010
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Since branching out of speech coding in the early 1970s, audio coding has now
slipped into our daily lives in a variety of applications, such as mobile music/video
players, digital television/audio broadcasting, optical discs, online media streaming,
and electronic games. It has become one of the essential technologies in todays
consumer electronics and broadcasting equipments.
In its more than 30 years of evolution, many audio coding technologies had come
into the spotlight and then became obsolete, but only a minority have survived and
are deployed in major modern audio coding algorithms. While covering all the major
turns and branches of this evolution is valuable for technology historians or for
people with intense interests, it is distracting and even inundating for most readers.
Therefore, those historic events will be omitted and this book will, instead, focus on
the current state of this evolution. Such a focus also helps to provide full coverage
to selected topics in this book.
This state of the art is presented from the perspective of a practicing engineer and
adjunct associate professor, who single-handedly developed the whole DRA audio
coding standard, from algorithm architecture to assembly-code implementation and
to subjective listening tests. This perspective has a clear focus on why and how
to. In particular, many purely theoretical details such as proof of perfect reconstruction property of various filter banks are omitted. Instead, the emphasis is on
the motivation for a particular technology, why it is useful, what it is, and how it is
integrated into a complete algorithm and implemented in practical products. Consequently, many practical aspects of audio coding technologies normally excluded in
audio coding books, such as transient detection and implementation of decoders on
low-cost microprocessors, are covered in this book.
This book should help readers to grasp the state-of-the-art audio coding technologies and build a solid foundation for them to either understand and implement
various audio coding standards or develop their own should the need arise. It is,
therefore, a valuable reference for engineers in the consumer electronics and broadcasting industry and for graduate students of electrical engineering.
Audio coding seeks to achieve data compression by removing perceptual irrelevance and statistical redundancy from a source audio signal and the removal efficiency
is powerfully augmented by data modeling which compacts and/or decorrelates the
vii
viii
Preface
source signal. Therefore, the presentation of this book is centered around these three
basic elements and organized into the following five parts.
Part I gives an overview of audio coding, describing the basic ideas, the key challenges, important issues, fundamental approaches, and the basic codec architecture.
Part II is devoted to quantization, the tool for removing perceptual irrelevancy.
Chapter 2 delineates scalar quantization which quantizes a source signal one sample
at a time. Both uniform and nonuniform quantization, including the LloydMax
algorithm, are discussed. Companding is posed as a structured and simple method
to implement nonuniform quantization.
Chapter 3 describes vector quantization which quantizes two or more samples
of a source signal as one block each time. Also included is the LindeBuzoGray
(LBG) or k-means algorithm which builds an optimal VQ codebook from a set of
training data.
Part III is devoted to data modeling which transforms a source signal into a representation that is energy-compact and/or decorrelated. Chapter 4 describes linear
prediction which uses a linear combination of the historic samples of the source
signal as a prediction for the current sample so as to arrive at a prediction error
signal that has lower energy and is decorrelated. It first explains why quantizing
the prediction error signal, instead of the source signal, can dramatically improve
coding efficiency. It then presents open-loop DPCM and DPCM, the two most common forms of linear prediction, derives the normal equation for optimal prediction,
presents LevinsonDurbin algorithm that iteratively solves the normal equation,
shows that the prediction error signal has a white spectrum and is thus decorrelated, and illustrates that the prediction decoder filter provides an estimate of the
spectrum of the source signal. Finally, a general framework for linear prediction
that can shape the spectrum of quantization noise to desirable shapes, such as that
of the absolute threshold of hearing, is presented.
Chapter 5 deals with transforms which linearly transform a block of source signal
samples into another block of coefficients whose energy is compacted to a minority. It first explains why this compaction of energy leads to dramatically improved
coding efficiency through the AMGM inequality and the associated optimal bit allocation strategy. It then derives the KarhunenLoeve transform from the search for
the optimal transform. Finally, it presents suboptimal and practical transforms, such
as discrete Fourier transform (DFT) and discrete cosine transform (DCT).
Chapter 6 presents subband filter banks as extended transforms in which historic
blocks of source samples overlap with the current block. It describes various aspects
of subband coding, including reconstruction error and polyphase representation and
illustrates that the dramatically improved coding efficiency is also achieved through
energy compaction.
Chapter 7 is devoted to cosine modulated filter banks (CMFB), whose structure
is amenable for fast implementation. It first builds this filter bank from DFT and
explains that it has a structure of a prototype filter plus cosine modulation. It then
presents nonperfect reconstruction and perfect reconstruction CMFB and their efficient implementation structures. Finally, it illustrates that modified discrete cosine
transform (MDCT), the most widely used filter bank in audio coding, is a special
and simple case of CMFB.
Preface
ix
Part IV is devoted to entropy coding, the tool for removing statistical redundancy.
Chapter 8 establishes that entropy is determined by the probability distribution of the
source signal and is the fundamental lower limit of bit rate reduction. It then shows
that any meaningful entropy codes have to be uniquely decodable and, to be practically implementable, should be instantaneously decodable. Finally, it illustrates that
prefix-free codes are just such codes and further proves Shannons noiseless coding
theorem, which essentially states that the entropy can be asymptotically approached
by a prefix-free code if source symbols are coded as blocks and the block size goes
to infinity.
Chapter 9 presents Huffman code, an optimal prefix-free code widely used in
audio coding. It first presents Huffmans algorithm, which is an iterative procedure
to build a prefix-free code from the probability distribution of the source signal,
and then proves its optimality. It also addresses some practical issues related to
the application of Huffman coding, emphasizing the importance of coding source
symbols as longer blocks.
While the previous parts can be applied to signal coding in general, Part V is devoted to audio. Chapter 10 covers perceptual models which determines which part
of the source signal is inaudible (perceptually irrelevant) and thus can be removed. It
starts with the absolute threshold of hearing, which is the absolute sensitivity level
of the human ear. It then illustrates that the human ear processes audio signals in
the frequency domain using nonlinear and analog subband filters and presents Bark
scale and critical bands as tools to describe the nonuniform bandwidth of these subband filters. Next, it covers masking effects which describe the phenomenon that
a weak sound becomes less audible due to the presence of a strong sound nearby.
Both simultaneous and temporal masking are covered, but emphasis is given to the
former because it is more thoroughly studied and extensively used in audio coding.
The rest of the chapter addresses a few practical issues, such as perceptual bit allocation, converting masked threshold to the subband domain, perceptual entropy, and
an example perceptual model.
Chapter 11 addresses the resolution challenge posed by transients. It first illustrates that audio signals are mostly quasistationary, hence need fine frequency resolution to maximize energy compaction but are frequently interrupted by transients,
which requires fine time resolution to avoid pre-echo artifacts. The challenge,
therefore, arises: a filter bank cannot have fine frequency and time resolution simultaneously according to the Fourier uncertainty principle. It then states that one
approach to address this challenge is to adapt frequency resolution in time to the
presence and absence of transients and further presents switched-window MDCT as
an embodiment: switching the window length of MDCT in such a way that short
windows are applied to transients and long ones to quasistationary episodes. Two
such examples are given, which can switch between two and three window lengths,
respectively. For the double window length example, two more techniques, temporal noise shaping and transient localization are given, which can further improve the
temporal resolution of the short windows. Practical methods for transient detection
are finally presented.
Preface
Chapter 12 deals with joint channel coding. Only two widely used methods are
covered, they are joint intensity coding and sum/difference (M/S stereo) coding.
Methods to deal with low-frequency effect (LFE) channels are also included.
Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms, such as how to organize various data, how to
assign entropy codebooks, how to optimally allocate bit resources, how to organize
bits representing various compressed data and control commands into a bit stream
suitable for transmission over various channels, and how to make the algorithm
amenable for implementation on low-cost microprocessors.
Chapter 14 is devoted to performance assessment, which, for a given bit rate,
becomes an issue of how to evaluate coding impairments. It first points out that
objective methods are highly desired, but are generally inadequate, so subjective
listening tests are necessary. The double-blind principle of subjective listening test
is then presented, along with the two methods, namely the ABX test and ITU-R
BS.1116, that implement it.
Finally, Chap. 15 presents Dynamic Resolution Adaptation (DRA) audio coding standard as an example to illustrate how integrate the technologies described
in this book to create a practical audio coding algorithm. DRA algorithm has been
approved by the Blu-ray Disc Association as part of its BD-ROM 2.3 specification
and by Chinese government as its national standard.
Yuli You
Adjunct Associate Professor
Department of Electrical and Computer Engineering
yuliyou@hotmail.com
Contents
Part I Prelude
1
Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.1 Audio Coding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.2 Basic Idea .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.3 Perceptual Irrelevance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.4 Statistical Redundancy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.5 Data Modeling .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.6 Resolution Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.7 Perceptual Models .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.8 Global Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.9 Joint Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.10 Basic Architecture .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
1.11 Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3
4
6
8
9
9
11
13
13
14
14
16
Part II Quantization
2
Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.1 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.2 Re-Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3 Uniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.2 Midtread and Midrise Quantizers .. . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.3 Uniformly Distributed Signals . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.3.4 Nonuniformly Distributed Signals . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.4 Nonuniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
2.4.1 Optimal Quantization and Lloyd-Max Algorithm . . . . . . . . . .
2.4.2 Companding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
19
21
23
24
24
25
27
28
33
35
39
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3.1 The VQ Advantage .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3.2 Formulation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3.3 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
43
43
46
48
xi
xii
Contents
3.4
3.5
LBG Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 48
Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 49
Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.1 Linear Prediction Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.2.1 Encoder and Decoder .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.2.2 Quantization Noise Accumulation .. . . . . . . . .. . . . . . . . . . . . . . . . .
4.3 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.3.1 Quantization Error .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.3.2 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4 Optimal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.1 Optimal Predictor .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.2 LevinsonDurbin Algorithm .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.3 Whitening Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.4.4 Spectrum Estimator.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5 Noise Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5.1 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
4.5.3 Noise-Feedback Coding .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
53
53
55
55
57
59
59
60
61
61
63
65
68
69
69
71
71
Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.1 Transform Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2 Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.1 Quantization Noise . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.2 AMGM Inequality . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.3 Optimal Conditions .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.4 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.5 Optimal Bit Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.6 Practical Bit Allocation.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.2.7 Energy Compaction . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3 Optimal Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3.1 KarhunenLoeve Transform . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3.2 Maximal Coding Gain .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.3.3 Spectrum Flatness . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.4 Suboptimal Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.4.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
5.4.2 DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
73
73
76
76
77
78
79
80
81
82
82
83
84
85
85
86
88
Subband Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1 Subband Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1.1 Transform Viewed as Filter Bank .. . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1.2 DFT Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
6.1.3 General Filter Banks. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
91
91
92
93
94
Contents
xiii
6.2
6.3
Subband Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 96
Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 97
6.3.1 Decimation Effects . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 98
6.3.2 Expansion Effects . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .100
6.3.3 Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .102
Polyphase Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103
6.4.1 Polyphase Representation . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103
6.4.2 Noble Identities .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .107
6.4.3 Efficient Subband Coder . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109
6.4.4 Transform Coder.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109
Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110
6.5.1 Ideal Subband Coder . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110
6.5.2 Optimal Bit Allocation and Coding Gain . .. . . . . . . . . . . . . . . . .111
6.5.3 Asymptotic Coding Gain . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .112
6.4
6.5
Part IV
8
Entropy Coding
xiv
Contents
8.4
Part V
Audio Coding
Contents
xv
xvi
Contents
Part I
Prelude
Chapter 1
Introduction
Sounds are physical waves that propagate in the air or other media. Such waves,
which may be expressed as changes in air pressure, may be transformed by an analog
audio system using a transducer, such as a microphone, into continuous electrical
waves in the forms of current and/or voltage changes. This transformation of sounds
into an electrical representation, which we call an audio signal, facilitates the storage, transmission, duplication, amplification, and other processing of sounds. To
reproduce the sounds, the electrical representation, or audio signal, is converted
back into physical waves via loudspeakers.
Since electronic circuits and storage/transmission media are inherently noisy and
nonlinear, audio signals are susceptible to noise and distortion, resulting in loss of
sound quality. Consequently, modern audio systems are mostly digital in that the
audio signals obtained above are sampled into discrete-time signals and then digitized into numerical representations, which we call digital audio signals. Once in the
digital domain, a lot of technologies can be deployed to ensure that no inadvertent
loss of audio quality occurs.
Pulse-code modulation (PCM) is usually the standard representation format for
digital audio signals. To obtain a PCM representation, the waveform of an analog
audio signal is sampled regularly at uniform intervals (sampling period) to generate
a sequence of samples (a discrete-time signal), which are then quantized to generate
a sequence of symbols, each represented as a numerical (usually binary) code.
The NyquistShannon sampling theorem states that an analog signal that was
sampled can be perfectly reconstructed from the samples if the sample rate exceeds
twice the highest frequency in the original analog signal [68]. To ensure this condition is satisfied, the input analog signal is usually filtered with a low-pass filter
whose stopband corner frequency is less than half of the sample rate. Since the
human ears perceptual range for pure tones is widely believed to be between 20 Hz
and 20 kHz (see Sect. 10.2) [102], such low-pass filters may be designed in such a
way that the cutoff frequency starts at 20 kHz and a few kilohertz are allowed as
the transition band before the stopband. For example, the sample rate is 44.1 kHz
for compact discs (CD) and 48 kHz for sound tracks in DVD-Video. Some people, however, believe that the human ear can perceive frequencies much higher than
20 kHz, especially when transients present, so sampling rates as high as 192 kHz
1 Introduction
are used in some audio systems, such as DVD-Audio. Note that, there is still power
in the stopband for any practical low-pass filters, so perfect reconstruction is only
approximately satisfied.
The subsequent quantization process also introduces noise. The more the number of bits is used to represent each audio sample, the less the quantization noise
becomes (see Sect. 2.3). The compact discs (CD), for example, use 16 bits to represent each sample. Due to the limited resolution and dynamic range of the human ear,
16 bits per sample are argued by some to be sufficient to deliver the full dynamics of
almost all music, but higher resolution audio formats are called for in applications
such as soundtracks in feature films, where there is often a very wide dynamic range
between whispered conversations and explosions. The higher resolution format also
enables more headroom to be left for audio processing which may inadvertently or
intentionally introduce noise. Twenty-four bits per sample, used by DVD, are widely
believed to be sufficient for most applications, but 32 bits are not uncommon.
Digital audio signals rarely come as a single channel or monaural sound. The
CD delivers two channels (stereo) and the DVD up to 7.1 channels (surround
sound) that consists of seven normal channels (front left, front right, center,
surround left, surround right, back left, and back right, for example) and one lowfrequency effect (LFE) channel. The terminology of 0.1 is used to indicate that
an LFE channel has a very low bandwidth, usually no more than 120 Hz. Apparently, the number of channels are in an increasing trend. For example, Japans NHK
demonstrated 22.2 channel surround sound in 2005 and Chinas DRA audio coding
standard (see Chap. 15) allows for 64.3 channel surround sound.
Storage channels usually have the best channel capacity. DVD-Video, for
example, is designed to hold at least two hours of film with standard definition
video and 5.1 surround sound. It was given a capacity of 4.7 GB (gigabytes), which
was the state-of-arts when DVD was standardized. If two hours of 5.1 surround
sound is to be delivered in the standard PCM format (24 bits per sample and 48 kHz
sample rate), it needs about 6.22 GB (gigabytes) storage space. This is more than the
capacity of the whole DVD disc and there is no capacity left for the video, whose
demand for bit rate is usually more than ten times that of audio signals. Apparently,
there is a problem of insufficient channel capacity for the delivery of audio signals.
This problem is much more acute with wireless channels. For example, overthe-air audio and/or television broadcasting usually allocates no more than 64 kbps
channel capacity to deliver stereo audio. If delivered at 24 bits per sample and
48 kHz sample rate PCM, a stereo audio signal needs a bit rate of 2,304 kbps, which
is 36 times the allocated channel capacity.
This problem of insufficient channel capacity for delivering audio signals may
be addressed by either allocating more channel capacity or reducing the demand of
it. Allocating more channel capacity is usually very expensive and even physically
impossible in situations such as wireless communication or broadcasting. It is often
more plausible and effective to pursue the demand reduction route: reducing the
bit rate necessary for delivering audio signals. This is the task of digital audio
(compression) coding.
Audio coding achieves this goal of bit rate reduction through an encoder and
a decoder, as shown in Fig. 1.1. The encoder obtains a compact representation of
the input audio signal, often referred to as the source signal, that demands less
bits. The bits for this compact representation is delivered through a communication/broadcasting or storage channel to the decoder, which then reconstructs the
original audio signal from the received compact representation.
Note that the term channel used here is an abstraction or aggregation of channel
coder, modulator, physical channel, channel demodulator, and channel decoder in
communication literature. Since the channel is well known for introducing bit errors,
the compact representation received by the decoder may be different than that at the
output of the encoder. From the viewpoint of audio coding, however, the channel
Fig. 1.1 Audio coding involves an encoder to transform a source audio signal into a compact representation for transmission through a channel and a decoder to decode the compact representation
received from the channel to reconstruct the source audio signal
1 Introduction
(1.3)
is the component in the source signal that is statistically redundant for the purpose of transmitting the source signal to the decoder and is thus called statistical
redundancy.
The goal of lossless audio coding is to remove statistical redundancy from the
source signal as much as possible so that it is delivered to the decoder with a bit rate
B as close as possible to the entropy. This is illustrated in Fig. 1.2. Note that, while
entropy coding is a general terminology for coding techniques that remove statistical
redundancy, a lossless audio coding algorithm usually involves sophisticated data
modeling (to be discussed in Sect. 1.3), so the use of entropy coding in Fig. 1.2 is an
over simplification if the context involves lossless coding algorithms and may imply
that data modeling is part of it.
The ratio of compression achievable by lossless audio coding is usually very limited, an overall compression ratio of 2:1 may be considered high. This level of compression ratio cannot satisfy many practical needs. As stated before, the over-the-air
audio and/or television broadcasting, for example, may require a compression ratio
of 36:1. To achieve this level of compression, some information in the source signal
has to be irreversibly discarded by the encoder.
Fig. 1.2 A lossless audio coder removes through entropy coding statistical redundancy from the
the source audio signal to arrive at a compact representation
Fig. 1.3 A lossy audio coder removes both perceptual irrelevancy and statistical redundancy from
the source audio signal to achieve much higher compression ratio
1 Introduction
Note that, while quantization is a general terminology for coding techniques that
remove perceptual irrelevance, a lossy audio coding algorithm usually involves
sophisticated data modeling, so the use of quantization in Fig. 1.3 is an over simplification if the context involves lossy audio coding algorithms and may imply that
data modeling is part of it.
1
1
1
1
C 2 C 3 C 4 D 1:875 bits;
2
4
8
8
10
1 Introduction
Amplitude
x 104
2
0
2
0.005
0.01
Time (second)
0.015
0.02
0.005
0.01
Time (second)
0.015
0.02
0
Frequency (Hz)
Magnitude (dB)
Amplitude
x 10
2
0
2
80
60
40
20
0
2
x104
Fig. 1.4 A 1,000 Hz sinusoidal signal represented as 16-bit PCM with a sample rate of 48 kHz
(top), its linear prediction error signal (middle), and its DFT spectrum (bottom)
remove correlation is through linear prediction (see Chap. 4). Let x.n/ denote the
nth sample of the sinusoidal signal and x.n/
O
its predicted value, an extremely simple
prediction scheme is to use the immediate preceding sample value as the prediction
for the current sample value:
x.n/
O
D x.n 1/:
This prediction is, of course, not perfect, so there is prediction error or residue
which is
e.n/ D x.n/ x.n/
O
and is shown in the middle of Fig. 1.4. If we elect to send this residue signal, instead
of the original signal, to the decoder, we will end up with a much smaller number
of bits due to its significantly reduced dynamic range. In fact, its dynamic range is
[4278, 4278] which can be represented using 14-bit PCM, resulting a compression
ratio of 16:14.
11
12
1 Introduction
Amplitude
0.2
0.1
0
0.1
0.2
Amplitude
1
0.5
0
0.5
1
Fig. 1.5 Audio signals are quasistationary (such as the one shown at the top) most of the time, but
are frequently interrupted by dramatical transients (such as the one shown at the bottom)
or filter bank is largely proportional to the block size, this second assumption calls
for the deployment of transforms or filter banks with large block sizes. To achieve
a high degree of energy compaction, the block size should be as large as possible,
limited only by the physical memory of the encoder/decoder as well as the delay
associated with buffering a long block of samples.
Unfortunately, the first assumption above is not correct all the time quasistationary episodes of audio signals are frequently interrupted by dramatic transients
which may rise from absolute quiet to extreme loudness within a few samples.
Examples of such transients include sudden gun shots and explosions. A less dramatic transient produced by a music instrument is shown at the bottom of Fig. 1.5.
For such a transient, it is well known that a long transform or filter bank would produce a flat spectrum that corresponds to small, if any, energy compaction, resulting
in poor compression performance. To mitigate this problem, a short transform or
filter bank should be used that has fine time resolution to localize the transient in the
time domain.
To be able to code all audio signals with high coding performance all the time,
a transform/filter bank that would have good resolution in both time and frequency
domains is desired. Unfortunately, the Fourier uncertainty principle [75], which is
related to the Heisenberg uncertainty principle [90], states that this is impossible:
a transform/filter bank can have a good resolution either in the time or frequency
domain, but not both (see Sect. 11.1). This poses one of the most difficult challenges
in audio coding.
This challenge is usually addressed along the line of adapting the resolution of
transforms/filter banks with time to the changing resolution demands of the input
13
signal: applying long block sizes to quasistationary episodes and short ones to
transients. There are a variety of ways to implement this scheme, most dominant
among them seems to be the switched block-size MDCT (see Sect. 11.2).
In order for the resolution adaptation to occur on the fly, a transient detection
mechanism which detects the occurrence of transients and identifies their locations
(see Sect. 11.7) is needed.
14
1 Introduction
The basic bit allocation strategy is to allocate bits (by decreasing the quantization
step size) iteratively to a group of transform coefficients or subband samples whose
quantization noise is most audible until either the bit pool is exhausted or quantization noises for all transform coefficients/subband samples are below the masking
thresholds.
15
Fig. 1.6 The basic audio encoder architecture. The solid lines represent movement of audio data
and the dashed line indicates control information
Fig. 1.7 The basic audio decoder architecture. The solid lines represent movement of audio data
and the dashed line indicates control information
In addition to the modules shown in Figs. 1.6 and 1.7, the development of
an audio coding algorithm also involves many practical and important issues,
including:
Data Structure. How transform coefficients or subband samples from all audio
channels are organized and accessed by the audio coding algorithm.
Bit Stream Format. How entropy codes and bits representing other control information are packed into a coherent bit stream.
16
1 Introduction
Part II
Quantization
Chapter 2
Scalar Quantization
19
20
2 Scalar Quantization
Table 2.1 An example of mapping analog sample values to integer
values that would take place in a process called quantization
Analog sound pressure level
Integer sound pressure level
3.4164589759
3.124341
2.14235
1.409086743
0.61341984378562890423
0.37892458
0.61308
1.831401348156
2.8903219654710
3.208913064
3
3
2
1
1
0
1
2
2
3
0
1
2
3
4
5
6
3
2
1
0
1
2
3
21
(2.1)
(2.2)
which are also called output values or representative values. A source sample
value x is quantized to the quantization index q if and only if x falls into the qth
decision interval
q D bq1 ; bq /;
(2.3)
so the operation of forward quantization is
q D Q.x/; if and only if bq1 x < bq :
(2.4)
The quantized value can be reconstructed from the quantization index by the following inverse quantization
(2.5)
xO q D Q1 .q/;
which is also referred to as backward quantization. Since q is a function of x as
shown in (2.4), xO q is also a function of x and can be written as:
x.x/
O
D xO q D Q1 Q.x/ :
(2.6)
This quantization scheme is called scalar quantization (SQ) because the source
signal is quantized one sample each time.
The function in (2.6) is another approach to describing the inputoutput map of
a quantizer, in addition to the quantization table. Figure 2.2 is such a function that
describe the quantization map of Table 2.2.
The quantization operation in (2.4) obviously causes much loss of information,
the reconstructed quantized value obtained in (2.5) or (2.6) is different than the input
to the quantizer. The difference between them is called quantization error
q.x/ D x.x/
O
x:
(2.7)
(2.8)
22
2 Scalar Quantization
Fig. 2.2 Inputoutput map for the quantizer shown in Table 2.2
+
Fig. 2.3 Additive noise model for quantization
q 2 .x/p.x/dx
Z1
1
.x.x/
O
x/2 p.x/dx
D
1
M Z
X
bq
qD1 bq1
.x.x/
O
x/2 p.x/dx
(2.9)
2.2 Re-Quantization
23
Since x.x/
O
D xO q is a constant within the decision interval bq1 ; bq /, we have
q2
M Z
X
bq
qD1 bq1
(2.10)
The MSQE may be better appreciated when compared with the power of the
source signal. This may be achieved using signal-to-noise ratio (SRN) defined
below
!
x2
;
(2.11)
SNR (dB) D 10 log10
q2
where x2 is the variance of the source signal.
It is obvious that the smaller the decision intervals, the smaller the error term
.xO q x/2 in (2.10), thus the smaller the mean squared quantization error q2 . This
indicates that q2 is inversely proportional to the number of decision intervals M .
The placement of each individual decision boundary and the quantized value also
play major roles in the final q2 . The problem of quantizer design may be posed in a
variety of ways, including:
Given a fixed M :
M D Constant;
(2.12)
find the optimal placement of decision boundaries and quantized values so that
q2 is minimized. This is the most widely used approach.
Given a distortion constraint:
q2 < Threshold;
(2.13)
find the optimal placement of decision boundaries and quantized values so that
M is minimized. A minimal M means a minimal number of bits needed to represent the quantized value, hence a minimal bit rate.
2.2 Re-Quantization
The quantization process was presented above with the assumption that the source
random variable or sample values are continuous or analog. Quantization by name
usually gives the impression that it were only for quantizing analog sample values. When dealing with such analog source sample values, the associated forward
quantization is referred to as ADC (analog-to-digital conversion) and the inverse
quantization as DAC (digital-to-analog conversion).
24
2 Scalar Quantization
Table 2.3 An quantization table for re-quantizing a discrete
source
Decision interval
Quantization index
Re-quantized value
0, 10)
0
5
10, 20)
1
15
20, 30)
2
25
30, 40)
3
35
40, 50)
4
45
50, 60)
5
55
60, 70)
6
65
70, 80)
7
75
80, 90)
8
85
90, 100)
9
95
Discrete sources sample values can also be further quantized. For example,
consider a source that takes integer sample values between 0 through 100. If it is
decided, for some reason, that this resolution is too much or irrelevant for a particular application and sample values spaced at an interval of 10 are really what are
needed, a quantization table shown in Table 2.3 can be established to re-quantize
the integer sample values.
With discrete sources sample values, the formulation of quantization process in
Sect. 2.1 is still valid with the replacement of probability density function with probability distribution function and integration with summation.
2.3.1 Formulation
Let us consider a uniform quantizer that covers an interval of Xmin ; Xmax of a
random variable X with M decision intervals. Since its quantization step size is
25
D
Xmax Xmin
;
M
(2.14)
q D 0; 1; : : : ; M:
(2.15)
The mean of an decision interval is often selected as the quantized value for that
interval:
xO q D Xmin C q 0:5; q D 1; 2; : : : ; M:
(2.16)
M Z
X
Xmin Cq
(2.17)
Let
y D Xmin C q 0:5 x;
(2.17) becomes
q2 D
M Z
X
0:5
qD1 0:5
(2.18)
M Z
X
0:5
qD1 0:5
x 2 p bq .x C 0:5/ dx:
(2.19)
M Z
X
0:5
qD1 0:5
x 2 p.xO q x/dx:
(2.20)
26
2 Scalar Quantization
important for audio signals because the zero value is needed to represent the absolute
quiet. Due to the midtreading of zero, the number of decision intervals (M ) is odd
if a symmetric sample value range (Xmin D Xmax ) is to be covered.
Since both the decision boundaries and the quantized values can be represented
by a single step size, the implementation of the midtread uniform quantizer is simple
and straight forward. The forward quantizer may implemented as
q D round
x
(2.21)
where round./ is the rounding function which returns the integer that is closest to
the input. The corresponding inverse quantizer may be implemented as
xO q D q:
(2.22)
The other uniform quantizer does not have zero as one of its quantized values,
so is called midrise. This is shown in Fig. 2.4. Its number of decision intervals is
even if a symmetric sample value range is to be covered. The forward quantizer may
implemented as
(
x
C 1; if x > 0I
truncate
(2.23)
qD
x
truncate 1; otherwiseI
27
where truncate./ is the truncate function which returns the integer part of the
input, without the fractional digits. Note that q D 0 is forbidden for a midrise
quantizer. The corresponding inverse quantizer is expressed below
(
xO q D
.q 0:5/; if q > 0I
.q C 0:5/; otherwise:
(2.24)
1
; x 2 Xmin ; Xmax ;
Xmax Xmin
(2.25)
(2.20) becomes
q2
M Z 0:5
X
1
D
y 2 dx
Xmax Xmin qD1 0:5
M
X
3
1
Xmax Xmin qD1 12
M
3
Xmax Xmin 12
Due to the step size given in (2.14), the above equation becomes
q2 D
2
:
12
(2.26)
1
Xmax Xmin
Xmax
Xmin
x 2 dx D
.Xmax Xmin /2
;
12
(2.27)
28
2 Scalar Quantization
x2
q2
.Xmax Xmin /2 12
12
2
Xmax Xmin
D 20 log10
D 10 log10
(2.28)
Due to the step size given in (2.14), the above SNR expression becomes
SNR (dB) D 20 log10 .M / D
20
log2 .M / 6:02 log2 .M /:
log2 .10/
(2.29)
If the quantization indexes are represented using fixed-length codes, each codeword
can be represented using
R D ceil log2 .M / bits;
(2.30)
which is referred as bits per sample or bit rate. Consequently, (2.29) becomes
SNR (dB) D
20
R 6:02R dB;
log2 .10/
(2.31)
which indicates that, for each additional bit allocated to the quantizer, the SNR is
increased by about 6.02 dB.
29
PDF
-Xmax
Overload
_3
Xmax
Granular
_2
Overload
x.x/
O
D
(2.32)
Xmax
Xmin
1
(2.33)
The MSQE given in (2.17) only accounts for quantization error within
Xmin ; Xmax , which is referred to as granular error or granular noise. The total
quantization error is
2
2
D q2 C q.overload/
:
(2.34)
q.total/
For a given PDF p.x/ and the number of decision intervals M , (2.17) indicates
that the smaller the quantization step size is, the smaller the granular quantization
noise q2 becomes. According to (2.14), however, the smaller quantization step size
also translates into smaller Xmin and Xmax for a fixed M . Smaller Xmin and
Xmax obviously leads to larger overload areas, hence a larger overload quantization
2
error q.overload/
. Therefore, the choice of , or equivalently the range Xmin ; Xmax
of the uniform quantizer, represents a trade-off between granular and overload quantization errors.
This trade-off is, of course, relative to the effective width of the given PDF,
which may be characterized by its variance . The ratio of the quantization range
Xmin ; Xmax over the signal variance
Fl D
Xmax Xmin
;
(2.35)
called the loading factor, is apparently a good description of this trade-off. For
Gaussian distribution, a loading factor of 4 means that the probability of input
30
2 Scalar Quantization
samples going beyond the range is 0.045. For a loading factor of 6, the probability
reduces to 0.0027. For most applications, 4 loading is sufficient.
M Z
X
Xmin Cq
C
Z
C
Xmax
Xmin
1
(2.36)
(2.37)
2Xmax
:
M
(2.38)
Replacing all Xmin and Xmax with using the above equations, we have
2
q.total/
M Z
X
.q0:5M /
qD1 .q10:5M /
C
Z
0:5M
0:5M
(2.39)
p.x/ D p.x/
(2.40)
1
and doing a variable change of y D x in the last term of (2.39), it turns out that
this last term becomes the same as the second term, so (2.39) becomes
31
M Z
X
.q0:5M /
qD1 .q10:5M /
Z 1
C2
(2.41)
0:5M
Now that both (2.39) and (2.41) are only a function of , their minimum can be
found by setting their respective first order derivative against to zero:
@ 2
D 0:
@ q.total/
(2.42)
This equation can be solved using a variety of numerical methods, see [76], for
example.
Figure 2.6 shows optimal SNR achieved by a uniform quantizer at various bits
per sample (see (2.30)) for Gaussian, Laplacian, and Gamma distributions [33]. The
SNR given in (2.31) for uniform distribution, which is the best SNR that a uniform
quantizer can achieve, is plotted as the bench mark. It is a straight line in the form of
SNR(R) D a C bR (dB);
(2.43)
with a slope of
bD
20
6:02
log2 .10/
(2.44)
and an intercept of
a D 0:
(2.45)
50
45
40
Uniform
Gaussian
Laplacian
Gamma
35
30
25
20
15
10
5
0
1
4
5
Bits Per Sample
Fig. 2.6 Optimal SNR achieved by a uniform quantizer for uniform, Gaussian, Laplacian, and
Gamma distributions
32
2 Scalar Quantization
Apparently, the curves for other PDFs also seem to fit a straight line with
different slopes and intercepts. Notice that both the slope b and the intercept a
decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of
a uniform quantizer is inversely related to PDF kurtosis. This degradation in performance is mostly reflected in the intercept a. The slope b is only moderately affected.
There is, nevertheless, reduction in slope when compared with the uniform
distribution. This reduction indicates that the quantization performance for other
distributions relative to the uniform distribution becomes worse at higher bit rates.
Figure 2.7 shows the optimal step size normalized by the signal variance,
opt =x , for Gaussian, Laplacian, and Gamma distributions as a function of the
number of bits per sample [33]. The data for uniform distribution is used as the
benchmark. Due to (2.14), (2.27) and (2.30), the normalized quantization step size
for the uniform distribution is
2
2
log2 .M /
R
D log10 p
; (2.46)
log10
D log10 p
x
log
10
log
3
3
2
2 10
so it is a straight line. Apparently, as the peakedness or kurtosis increases in the
order of uniform, Gaussian, Laplacian, and Gamma distributions, the step size
also increases. This is necessary for optimal balance between granular and overload quantization errors: an increased kurtosis means that the probability density is
spread more toward the tails, resulting more overload error, so the step size has to
be increased to counteract this increased overload error.
100
101
102
Uniform
Gaussian
Laplacian
Gamma
1
4
5
Bits Per Sample
Fig. 2.7 Optimal step size used by a uniform quantizer to achieve optimal SNR for uniform,
Gaussian, Laplacian, and Gamma distributions
33
The empirical formula (2.43) is very useful for estimating the minimal total
MSQE for a particular quantizer, given the signal power and bit rate. In particular, dropping in the SNR definition in (2.11) to (2.43), we can represent the total
MSQE as
(2.47)
10 log10 q2 D 10 log10 x2 a bR
or
q2 D 100:1.aCbR/ x2 :
(2.48)
34
2 Scalar Quantization
PDF
-3
-3
-3
-3
-3
-3
-3
-3
-3 x
Fig. 2.8 Quantization resources are under-utilized by the uniform quantizer (top) in the tail areas
and over-utilized in the head area of the PDF. These resources are re-distributed in the nonuniform quantizer (bottom) so that individual pieces of resources carry the same amount of MSQE
contribution, leading to smaller MSQE
The above two considerations indicates that the MSQE can be reduced by assigning the size of decision intervals inversely proportional to the probability density.
The consequence of this strategy is that the more densely distributed the PDF is,
the more densely placed the decision intervals can be, thus the smaller the MSQE
becomes.
One approach to nonuniform quantizer design is to post it as an optimization
problem: finding the quantization intervals and quantized values that minimizes the
MSQE. This leads to the Lloyd-Max algorithm. Another approach is to transform
the source signal through a nonlinear function in such a way that the transformed
signal has a PDF that is almost uniform, then a uniform quantizer may be used to
deliver improved performance. This leads to companding.
35
Z
D2
bq
bq1
D 2xO q
.xO q x/p.x/dx
bq
bq
p.x/dx 2
bq1
xp.x/dx:
(2.49)
bq1
xO q D R q1
bq
xp.x/dx
;
(2.50)
bq1 p.x/dx
which indicates that the quantized value for each decision interval is the centroid of
the probability mass in the interval.
Let us now consider another partial derivative
@q2
@bq
(2.51)
1
.xO q C xO qC1 /;
2
(2.52)
which indicates that the decision boundary is simply the midpoint of the neighboring
quantized values.
Solving (2.50) and (2.52) would give us the optimal set of decision boundaries
2
O q gM
fbq gM
0 and quantized values fx
1 that minimizes q . Unfortunately, to solve
(2.50) for xO q we need bq1 and bq , but to solve (2.52) for bq we need xO q and xO qC1 .
The problem is a little difficult.
bq1 C bq
:
2
(2.53)
36
2 Scalar Quantization
bq C bqC1
:
2
(2.54)
(2.55)
bqC1 bq D bq bq1 :
(2.56)
bq bq1 D ;
(2.57)
bqC1 bq D :
(2.58)
which leads us to
Let us denote
plugging it into (2.56), we have
bqC1 bq C bq bq1
:
2
(2.59)
(2.60)
which indicates that the quantized values are also uniformly spaced. Therefore, uniform quantizer is optimal for uniform distribution.
(2.61)
For unbounded inputs, we may set Xmin D 1 and/or Xmax D 1. Also, we rearrange (2.52) into
(2.62)
xO qC1 D 2bq xO q ;
37
1
D xO M R M
bM
bM 1
xp.x/dx
(2.63)
p.x/dx
8. Stop if
jj < predetermined threshold:
(2.64)
38
2 Scalar Quantization
45
Uniform
Gaussian
Laplacian
Gamma
40
35
30
25
20
15
10
5
0
4
5
Bits Per Sample
Fig. 2.9 Optimal SNR versus bits per sample achieved by Lloyd-Max algorithm for uniform,
Gaussian, Laplacian, and Gamma distributions
Apparently, the optimal SNR curves in Fig. 2.9 also fit straight lines well, so can
be approximated by the same equation given in (2.43) with improved slope b and
intercept a. The improved performance of nonuniform quantization results in better
fitting to a straight line than those in Fig. 2.6.
Similar to uniform quantization in Fig. 2.6, both the slope b and the intercept
a decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of
a LloydMax quantizer is inversely related to PDF kurtosis. Compared with the uniform distribution, all other distributions have reduced slopes b, indicating that their
performance relative to the uniform distribution becomes worse as the bit rate increases. However, the degradations of both a and b are less conspicuous than those
in Fig. 2.6.
In order to compare the performance between Lloyd-Max quantizer and uniform
quantizer, Fig. 2.10 shows optimal SNR gain of Lloyd-Max quantizer over uniform
quantizer for uniform, Gaussian, Laplacian, and Gamma distributions:
Optimal SNR Gain D SNRNonuniform SNRUniform ;
where SNRNonuniform is taken from Fig. 2.9 and SNRUniform from Fig. 2.6. Since the
Lloyd-Max quantizer for uniform distribution is a uniform quantizer, the optimal
SNR gain is zero for uniform distribution. It is obvious that the optimal SNR gain is
more profound when the distribution is more peaked or is of larger kurtosis.
39
8
Uniform
Gaussian
Laplacian
Gamma
7
6
5
4
3
2
1
0
Fig. 2.10 Optimal SNR gain of Lloyd-Max quantizer over uniform quantizer for uniform,
Gaussian, Laplacian, and Gamma distributions
2.4.2 Companding
Finding the whole set of decision boundaries fbq gM
O q gM
0 and quantized values fx
1
for an optimal nonuniform quantizer using Lloyd-Max algorithm usually involves a
large number of iterations, hence may be computationally intensive, especially for
a large M . The storage requirement for these decision boundaries and quantization
values may also become excessive, especially for the decoder. Companding is an
alternative.
Companding is motivated by the observation that a uniform quantizer is simple
and effective for a matching uniformly distributed source signal. For a nonuniformly
distributed source signal, one could use a nonlinear function f .x/ to convert it into
another one with a PDF similar to a uniform distribution. Then the simple and effective uniform quantizer could be used. After the quantization indexes are transmitted
to and subsequently received by the decoders, they are first inversely quantized to
reconstruct the uniformly quantized values and then the inverse function f 1 .x/ is
applied to produce the final quantized values. This process is illustrated in Fig. 2.11.
The nonlinear function in Fig. 2.11 is called a compressor because it usually
has a shape similar to that shown in Fig. 2.12 that stretches the source signal when
its sample value is small and compresses it otherwise. This shape of compression
is to match the typical shape of PDF, such as Gaussian and Laplacian, which has
large probability density for small absolute sample values and tails off towards large
absolute sample values, in order to make the converted signal have a PDF similar to
a uniform distribution.
40
2 Scalar Quantization
Fig. 2.11 The source sample value is first converted by the compressor into another one with a
PDF similar to a uniform distribution. It is then quantized by a uniform quantizer and the quantization index is transmitted to the decoder. After inverse quantization at the decoder, the uniformly
quantized value is converted by the expander to produce the final quantized value
Expander
1
0.5
0.5
Output
Output
Compressor
1
0.5
0.5
1
1
0.5
0
Input
0.5
1
1
0.5
0
Input
0.5
Fig. 2.12 -Law companding deployed in North American and Japanese telecommunication
systems
41
y D f .x/ D sign.x/
ln.1 C jxj/
; 1 x 1I
ln.1 C /
(2.65)
.1 C /jyj 1
; 1 y 1:
(2.66)
Ajxj;
0 jxj A1 I
1 C ln.Ajxj/; A1 < jxj 1I
(2.67)
where A D 87:7 and the normalized sample value x is limited to 12 magnitude bits.
Its corresponding expanding function is
(
x D f 1 .y/ D sign.y/
1Cln.A/
jyj;
A
ejyj.1Cln.A//1
ACA ln.A/
0 jyj
;
1
1Cln.A/
1
1Cln.A/ I
< jyj 1:
(2.68)
(2.70)
42
2 Scalar Quantization
The implementation cost for the above exponential function is a remarkable issue
in decoder development. Piece-wise linear approximation may lead to degradation
in audio quality, hence may be unacceptable for high fidelity application. Another
alternative is to store the exponential function as a quantization table. This amounts
to 13 3 D 39 KB if each of the 213 entries in the table are stored using 24 bits.
The most widely used companding in audio coding is the companding of quantization step sizes of uniform quantizers. Since quantization step sizes are needed in
the inverse quantization process in the decoder, they need to be packed into the bit
stream and transmitted to the decoder. Transmitting these step sizes with arbitrary
resolution is out of the question, so it is necessary that they be quantized.
The perceived loudness of quantization noise is usually considered as logarithmically proportional to the quantization noise power, or linearly proportional to
the quantization noise power in decibel. Due to (2.28), this means the perceived
loudness is linearly proportional to the quantization step size in decibel. Therefore,
almost all audio coding algorithms use logarithmic companding to quantize quantization step sizes:
(2.71)
D f ./ D log2 ./;
where is the step size of a uniform quantizer. The corresponding expander is
obviously
(2.72)
D f 1 ./ D 2 :
Another motivation for logarithmic companding is to cope with the wide dynamic
range of audio signals, which may amount to more than 24 bits per sample.
Chapter 3
Vector Quantization
The scalar quantization discussed in Chap. 2 quantizes the samples of a source signal
one by one in sequence. It is simple because it deals with only one sample each
time, but it can only achieve so much for quantization efficiency. We now consider
quantizing two or more samples as one block each time and call this approach vector
quantization (VQ).
(3.1)
If we use the scalar midtread quantizer given in Table 2.2 and Fig. 2.2, we get the
following SQ indexes:
f1; 1; 2; 2; 2; 2; 3; 3g;
(3.2)
which is also the sequence for the quantized values since the quantization step size
is one. Since the range of the quantization indexes is 1; 3, we need 2 bits to convey
each index. This amounts to 8 2 D 16 bits for encoding the whole sequence.
If two samples are quantized as a block, or vector, each time using the VQ codebook given in Table 3.1, we end up with the following sequence of indexes:
f0; 1; 1; 2g:
When this sequence is used by the decoder to look up Table 3.1, we obtain exactly
the same reconstructed sequence as in (3.2), so the total quantization error is the
same. Now 2 bits are still needed to convey each index, but there are only four
indexes, so we need 4 2 D 8 bits to convey the whole sequence. This is only half
the number of bits needed by the SQ while the total quantization error is same.
To explain why VQ can achieve much better performance than SQ, let us view
the sequence in (3.1) as a sequence of two-dimensional vectors:
43
44
3 Vector Quantization
Index
0
1
2
Representative vector
[1,1]
[2,2]
[3,3]
3.5
2.5
1.5
0.5
0.5
1.5
2.5
3.5
(3.3)
and plot them in Fig. 3.1 as dots using the first and second elements of each vector as
x and y coordinates, respectively. For the first element of each vector, we use solid
vertical lines to represent its decision boundaries and dashed vertical lines its quantized value. Consequently, its decision intervals are represented by vertical strips
defined by two adjacent vertical lines. For the second element of each vector, we do
the same with horizontal solid and dashed lines, respectively, so that its decision intervals are represented by horizontal strips defined by two adjacent horizontal lines.
When the first element is quantized using SQ, a vertical decision strip is activated
to produce the quantized value. Similarly, a horizontal decision strip is activated
when the second element is quantized using SQ. Since the first and second elements
of each vector are quantized separately, the quantization of the whole sequence
can be viewed as a process of alternative activation of vertical and horizontal
decision strips.
However, if the results of the SQ described above for the two elements of each
vector are viewed jointly, the quantization decision is actually represented by the
squares where the vertical and horizontal decision strips cross. Each decision square
represents the decision intervals for both elements of a vector. The crossing point of
dashed lines in the middle of the decision square represents the quantized values for
both the elements of the vector. Therefore, the decision square and the associated
crossing point inside it are the real decision boundaries and quantized values for
each source vector.
45
It is now easy to realize that many of the decision squares are never used by SQ.
Due to their existence, however, we still need a two-dimensional vector to identify
and hence represent each of the decision squares. For the current example, each of
such vectors needs 2 2 D 4 bits to be represented, corresponding to 2 bits per
sample. So there is no bit saved. What a waste those unused decision squares cause!
To avoid this waste, we need to consider the quantization of each source vector as
a joint and simultaneous action and forgo the SQ decision squares. Along this line of
thinking, we can arbitrarily place decision boundaries in the two-dimensional space.
For example, noting that the data points are scattered almost along a straight line,
we can re-design the decision boundaries as those depicted by the curved solid lines
in Fig. 3.1. With this design, we designate three crossing points represented by the
diamonds in the figure as the quantized or representative vectors. They obviously
represent the same quantized values as those obtained by SQ, thus leaving quantization error unchanged. However, the two-dimensional decision boundaries carve out
only three decision regions, or decision cells, with only three representative vectors
that need to be indexed and transmitted to the decoder. Therefore, we need to transmit 2 bits per vector, or 1 bit per sample, to the decoder, amounting to a bit reduction
of 2 to 1.
Of course, source sequences in the real world are not as simple as those in (3.1)
and Fig. 3.1. To look at realistic sequences, Fig. 3.2 plots a correlated Gaussian sequence (meanD0, varianceD1) in the same way as Fig. 3.1. Also plotted are the
decision boundaries (solid lines) and quantized values (dashed lines) of the same
uniform scalar quantizer used above. Apparently, the samples are highly concentrated along a straight line and there are a lot of SQ decision squares that are wasted.
One may argue that a nonuniform SQ would do much better. Recall that it was
concluded in Sect. 2.4 that the size of a decision interval of a nonuniform SQ should
be inversely proportional to the probability density. The translation of this rule into
two dimension is that the size of the SQ decision squares should be inversely proportional to the probability density. So a nonuniform SQ can improve the coding
performance by placing small SQ decision squares in densely distributed areas and
large ones in loosely distributed areas. However, there still are a lot of SQ decision
regions placed in areas that are extremely sparsely populated by the source samples,
causing a waste of bit resources.
3.5
2.5
1.5
0.5
0.5
1.5
2.5
3.5
3.5 2.5 1.5 0.5
0.5
1.5
2.5
3.5
46
Fig. 3.3 An independent
Gaussian sequence (meanD0,
varianceD1) is plotted as two
dimensional vectors over the
decision boundaries (solid
lines) and quantized values
(dashed lines) of a midtread
uniform SQ (step sizeD1)
3 Vector Quantization
3.5
2.5
1.5
0.5
0.5
1.5
2.5
3.5
3.5 2.5 1.5 0.5
0.5
1.5
2.5
3.5
To void this waste, the restriction that decision boundaries have to be square
has to be removed. Once this restriction is removed, we arrive at the world of VQ,
where decision regions can be of arbitrary shapes and can be arbitrarily allocated to
match the probability distribution density of the source sequence, achieving better
quantization performance.
Even with uncorrelated sequences, VQ can still achieve better performance than
SQ. Figure 3.3 plots an independent Gaussian sequence (meanD0, varianceD1) over
the decision boundaries (solid lines) and quantized values (dashed lines) of the same
midtread uniform SQ. Apparently the samples are still concentrated, even though
not as much as the correlated sequence in Fig. 3.2, so a VQ can still achieve better
performance than an SQ. At least, an SQ has to allocate decision squares to cover
the four corners of the figure, but a VQ can use arbitrarily shaped decision regions
to cover those areas without wasting bits.
3.2 Formulation
Let us consider an N -dimensional random vector
x D x0 ; x1 ; : : : ; xN 1 T
(3.4)
with a joint PDF p.x/ over a vector space or support of . Suppose this vector space
1
in a mutually exclusive and collectively
is divided by a set of M regions fgM
0
exhaustive way:
M
1
[
D
q
(3.5)
qD0
and
p \ q D
for all
p q;
(3.6)
3.2 Formulation
47
where is the null set. These regions, referred to as decision regions, play a role
similar to decision intervals in SQ. To each decision region, a representative vector
is assigned:
(3.7)
rq 7! q ; for q D 0; 1; : : : ; M 1:
The source vector x is vector-quantized to VQ index q as follows:
q D Q.x/
if and only if
x 2 q :
(3.8)
(3.9)
(3.10)
(3.11)
xO D x C q.x/;
(3.12)
N
1
X
.xk xO k /2 :
(3.13)
kD0
d.x; xO /p.x/dx
M
1 Z
X
qD0
d.x; rq /p.x/dx:
(3.14)
(3.15)
If the Euclidean distance is used, the average quantization noise may be again called
MSQE.
The goal of VQ design is to find a set of decision regions fg0M 1 and representative vectors frq g0M 1 , referred to as a VQ codebook, that minimizes this average
quantization error.
48
3 Vector Quantization
if and only if
(3.16)
(3.17)
Such disjoint sets are referred to as Voronoi regions, which rids us off the trouble of
literally defining and representing the boundary of each decision region. This also
indicates that a whole VQ scheme can be fully described by the VQ codebook, or
1
the set of representative vectors frq gM
.
0
Another condition for an optimal solution is that, given a decision region q , the
best choice for the representative vector is the conditional mean of all vectors within
the decision region [19]:
Z
xq D
xp.xjx 2 q /dx:
(3.18)
x2q
L1
X
d .xk ; xO .xk // :
(3.19)
kD0
For the same reason, the best choice of the representative vector in (3.18) needs to
be replaced by
1 X
xq D
xk
(3.20)
Lq
xk 2q
3.5 Implementation
49
3.5 Implementation
Once the VQ codebook is obtained, it can be used to quantize the source signal that
generates the training set. Similar to SQ, this also involves two stages as shown in
Fig. 3.4.
50
3 Vector Quantization
Fig. 3.4 VQ involves an encoding or forward VQ stage represented by VQ in the figure, which
maps a source vector to a VQ index, and a decoding or inverse VQ stage represented by VQ1 ,
which maps a VQ index to its representative vector
Fig. 3.5 The
vector-quantization of a
source vector entails
searching through the VQ
codebook to find the
representative vector that is
closest to the source vector
Part III
Data Model
Transform the source signal into another one using a data model.
Quantize the transformed signal.
Transmit the quantization indexes to the decoder.
Inverse-quantize the received quantized indexes to reconstruct the transformed
signal.
5. Inverse-transform the reconstructed signal to reconstruct the original signal.
A necessary condition for this scheme to work is that the transformation must be
invertible, either exactly or approximately. Otherwise, original signal cannot be reconstructed even if no quantization is applied.
The key to the success of such a scheme is that the transformed signal must be
compact. Companding achieves this using a nonlinear function to arrive at a PDF
that is similar to a uniform distribution. There are other methods that are much more
powerful, most prominently among them are linear prediction, linear transform, and
subband filter banks.
Linear prediction uses a linear combination of historic samples as a prediction for
the current sample. As long as the samples are fairly correlated, the predicted value
will be a good estimate to the current sample value, resulting a small prediction
error signal, which may be characterized by a smaller variance. Since the MSQE
of an optimal quantizer is proportional to the the variance of the source signal (see
(2.48)), the reduced variance will result in a reduced MSQE.
A linear transform takes a block of input samples to generate another block of
transform coefficients whose energy is compacted to a minority. Bit resources can
then be concentrated to those high-energy coefficients to arrive at a significantly
reduced MSQE.
Chapter 4
Linear Prediction
Let us consider the source signal x.n/ shown at the top of Fig. 4.1. A simple
approach to linear prediction is to just use the previous sample x.n 1/ as the
prediction for the current sample:
p.n/ D x.n 1/:
(4.1)
This prediction is, of course, not perfect, so there is prediction error or residue
r.n/ D x.n/ p.n/ D x.n/ x.n 1/;
(4.2)
which is shown at the bottom of Fig. 4.1. The dynamic range of the residue is obviously much smaller than that of the source signal. The variance of the residue is
2.0282, which is much smaller than 101.6028, the variance of the source signal.
The histograms of the source signal and the residue, both shown in Fig. 4.2, clearly
indicate that, if the residue, instead of the source signal itself, is quantized, the quantization error will be much smaller.
fak gK
kD1
K
X
ak zk :
(4.4)
kD1
53
54
4 Linear Prediction
Input Signal
20
20
1000
2000
3000
4000
5000
6000
7000
8000
6000
7000
8000
Prediction Residue
20
20
1000
2000
3000
4000
5000
Prediciton Residue
2200
400
2000
1800
350
1600
300
1400
250
1200
200
1000
150
800
600
100
400
50
0
200
20
20
20
Fig. 4.2 Histograms for the source signal and its prediction residue
20
55
(4.5)
which is also referred to as prediction residue. With proper design of the prediction coefficients, this prediction error can be significantly reduced. To assess the
performance of this prediction error reduction, prediction gain is defined:
PG D
x2
r2
(4.6)
where x2 and r2 denotes the variances of the source and the residue signals, respectively. For example, the prediction gain for the simple predictor in last section is
PG D 10 log10
101:6028
17 dB:
2:0282
Since the prediction residue is likely to have a much smaller variance than the
source signal, the MSQE could be significantly reduced if the residue signal is quantized in place of the source signal. This amounts to linear prediction coding (LPC).
K
X
ak zk D 1 A.z/;
(4.7)
kD1
56
4 Linear Prediction
Fig. 4.3 Encoder for open-loop DPCM. The quantizer is attached to the prediction residue and is
modeled by an additive noise source
quantization. This process may be viewed as if the quantized residue rO .n/ were
received directly by the decoder. This is the convention adopted in Figs. 4.3 and 4.4.
Since the original source signal x.n/ is not available at the decoder, the prediction
scheme in (4.3) cannot be used by the decoder. Instead, the prediction at the decoder
has to use the past reconstructed sample values:
p.n/
O
D
K
X
ak x.n
O k/:
(4.8)
kD1
K
X
ak x.n
O k/ C rO .n/:
(4.9)
kD1
This leads us to the decoder shown in Fig. 4.4, where the overall transfer function
of the LPC decoder or the LPC reconstruction filter is
D.z/ D
1
1
PK
kD1
ak zk
1
:
1 A.z/
(4.10)
If there were no quantizer, the overall transfer function of the encoder (4.7) and
decoder (4.10) is obviously one, so the reconstructed signal at the decoder output
is the same as the encoder input. When the quantizer is deployed, the prediction
57
residue used by the decoder is different from that by the encoder and the difference
is the additive quantization noise. This causes reconstruction error at the decoder
output
e.n/ D x.n/
O
x.n/;
(4.11)
which is the quantization error of the overall open-loop DPCM.
(4.12)
where q.n/ is again the quantization error or noise. The reconstructed sample at the
decoder output (4.9) can then be rewritten as
x.n/
O
D p.n/
O
C r.n/ C q.n/:
(4.13)
(4.14)
(4.15)
K
X
ak x.n
O k/ x.n k/ C q.n/
kD1
K
X
ak e.n k/ C q.n/;
(4.16)
kD1
which indicates that the reconstruction error is equal to the current quantization error
plus weighted sum of reconstruction errors in the past.
58
4 Linear Prediction
(4.17)
(4.18)
(4.19)
n
X
q.k/;
(4.20)
kD1
which is the summation of quantization noise in all previous steps, starting from the
beginning of prediction.
To obtain a closed representation of quantization accumulation, let E.z/ denote
the Z-transform of reconstruction error e.n/ and Q.z/ the Z-transform of quantization noise q.n/, then the reconstruction error in (4.16) may be expressed as
E.z/ D
Q.z/
;
1 A.z/
(4.21)
which indicates that the reconstruction error at the decoder output is the quantization
error filtered by an all-pole LPC reconstruction filter (see (4.10)).
An all-pole IIR filter may be unstable and may produce large instances of reconstruction error which may be perceptually annoying, so open-loop DPCM is avoided
in many applications. But it is sometimes purposely deployed in other applications
to exploit the shaping of quantization noise spectrum by the all-pole reconstruction
filter, see Sects. 4.5 and 11.4 for details.
4.3 DPCM
59
4.3 DPCM
This problem of quantization error accumulation can be avoided by forcing the
encoder to use the same predictor as the decoder, i.e., the predictor given in (4.8).
This entails moving the quantizer in Fig. 4.3 inside the encoding loop, leading to
the encoder scheme shown in Fig. 4.5. This LPC scheme is often referred to as
differential pulse code modulation (DPCM) [9]. Note that a uniform quantizer is
usually deployed.
(4.22)
Plugging this into the additive noise model (4.12) for quantization, we have
r.n/
O
D x.n/ p.n/
O
C q.n/;
(4.23)
which is the quantized residue at the input to the decoder. Dropping the above equation into (4.9), we obtain the reconstructed value at the decoder
x.n/
O
D p.n/
O
C x.n/ p.n/
O
C q.n/ D x.n/ C q.n/:
(4.24)
(4.25)
E.z/ D Q.z/:
(4.26)
or equivalently
60
4 Linear Prediction
Therefore, the reconstruction error at the decoder output is exactly the same as the
quantization error of the residue in the encoder, there is no quantization error accumulation.
2
q.PCM/
2
q.DPCM/
(4.27)
(4.28)
2
Since q.Residue/
is related to the variance of the reside r2 by (2.48) for uniform and
nonuniform scalar quantization, we have
2
D 100:1.aCbR/ r2 ;
q.DPCM/
(4.29)
for a given bit rate of R. Note that the slope b and intercept a in the above equation
are determined by a particular quantizer and the PDF of the residue.
On the other hand, if the source signal is quantized directly with the same bit rate
R in what constitutes the PCM, (2.48) gives the following MSQE
2
D 100:1.aCbR/ x2 ;
q.PCM/
(4.30)
where x2 is the variance of the source signal. Here it is assumed that both the source
signal and the DPCM residue share the same set of parameters a and b. This assumption may not be valid in many practical applications.
Consequently, the coding gain for DPCM is
GDPCM D
2
q.PCM/
2
q.DPCM/
x2
D PG;
r2
(4.31)
61
which is the same as the prediction gain defined in (4.6). This indicates that the
quantization performance of the DPCM system is dependent on the prediction gain,
or how well the predictor predicts the source signal. If the predictor in the DPCM is
properly designed so that
(4.32)
r2 x2 ;
then GDPCM 1 or a significant reduction in quantization error can be achieved.
K
X
ak x.n
O k/
(4.33)
kD1
using (4.8). Therefore, the design problem is to find the set of prediction coefficients
that minimizes r2 :
fak gK
kD1
2
min r2 D E 4 x.n/
fak gK
kD1
K
X
!2 3
ak x.n
O k/ 5 ;
(4.34)
kD1
y./p./d:
(4.35)
Since x.n/
O
in (4.34) is the reconstructed signal and is related to the source signal
by the additive quantization noise (see (4.25)), the minimization problem in (4.34)
involves optimal selection of the prediction coefficients as well as the minimization
of quantization error. As discussed in Chaps. 2 and 3, independent minimization
of quantization error or quantizer design itself is frequently very difficult, so it is
highly desirable that the problem be simplified by taking quantization out of the
picture. This essentially implies that the DPCM scheme is given up and the openloop DPCM encoder in Fig. 4.3 is used when it comes to predictor design.
62
4 Linear Prediction
(4.36)
ak x.n k/;
(4.37)
!2 3
ak x.n k/ 5 :
(4.38)
kD1
fak gK
kD1
K
X
kD1
To minimize this error function, we set the derivative of r2 with respect to each
prediction coefficient aj to zero:
@r2
D 2E
aj
"
x.n/
K
X
kD1
(4.40)
(4.41)
This that the minimal prediction error or residue must be orthogonal to all data used
in the prediction. This is called orthogonality principle.
By moving the expectation inside the summation, (4.39) may be rewritten as
K
X
(4.42)
kD1
Now we are ready to make the second assumption: the source signal x.n/ is a wide
sense stationary process so that its autocorrelation function can be defined as
R.k/ D Ex.n/x.n k/;
(4.43)
63
(4.44)
ak R.k j / D R.j /; j D 1; 2; : : : ; K:
(4.45)
kD1
(4.46)
a D a1 ; a2 ; a3 ; : : : ; aK T ;
(4.47)
(4.48)
where
and
2
6
6
6
RD6
6
4
R.0/
R.1/
R.2/
::
:
R.1/
R.0/
R.1/
::
:
R.2/
R.1/
R.0/
::
:
3
R.K 1/
R.K 2/ 7
7
R.K 3/ 7
7:
7
::
5
:
(4.49)
R.0/
(4.51)
64
4 Linear Prediction
where k is the kth coefficient for nth iteration. To get the prediction filter of order K,
we need to iterate through the following sets of filter coefficients:
nD1W
fa11 g
nD2W
fa12 ; a22 g
::
:
K
n D K W fa1K ; a1K ; : : : ; aK
g
The iteration above is possible because both the matrix R and vector r are built from
the autocorrelation values of fR.k/gK
kD0 . The algorithm proceeds as follows:
1.
2.
3.
4.
Set n D 0.
Set E 0 D R.0/.
Set n D n C 1.
Calculate
n D
R.n/
E n1
n1
X
!
akn1 R.n k/
(4.52)
kD1
5. Calculate
ann D n
(4.53)
n1
akn D akn1 n ank
; for k D 1; 2; : : : ; n 1:
(4.54)
E n D .1 n2 /E n1 :
(4.55)
6. Calculate
7. Calculate
8. Go to step 3 if n < M .
For example, to get the prediction filter of order K D 2, two iterations are needed
as follows. For n D 1, we have
E 0 D R.0/
1 D
R.1/
R.1/
D
E0
R.0/
a11 D 1 D
"
E D .1
1
12 /E 0
D 1
(4.56)
R.1/
R.0/
(4.57)
R.1/
R.0/
2 #
R.0/ D
(4.58)
R2 .0/ R2 .1/
:
R.0/
(4.59)
65
For n D 2, we have
2 D
R.2/R.0/ R2 .1/
R2 .0/ R2 .1/
R.0/ R.2/
:
R2 .0/ R2 .1/
(4.60)
(4.61)
(4.62)
and
a2 D a22 :
(4.63)
(4.65)
(4.66)
Using (4.3), the right-hand side of the above equation may be further expanded into
Rr .j / D
K
X
ak E r.n/x.n j k/ ; j D 1; 2; : : : ; K:
(4.67)
kD1
r2 ; j D 0I
0; otherwise:
(4.68)
66
4 Linear Prediction
It indicates that the prediction residue sequence is a white noise process. Note that
this conclusion is valid only when the predictor has an infinite number of prediction
coefficients.
4.4.3.2 Markov Process
For predictors with a finite number of coefficients, the above condition is generally
not true, unless the source signal is a Markov process with an order N M . Also
called an autoregressive process and denoted as AR(N), such a process x.n/ is
generated by passing a white-noise process w.n/ through an N-th order all-pole
filter
W .z/
W .z/
;
(4.69)
D
X.z/ D
PN
k
1 B.z/
1 kD1 bk z
where
B.z/ D
N
X
bk zk ;
(4.70)
kD1
and X.z/ and W .z/ are the z transforms of x.n/ and w.n/, respectively. The corresponding difference equation is
x.n/ D w.n/ C
N
X
bk x.n k/:
(4.71)
kD1
D Ew.n/x.n j / C
bk Ex.n k/x.n j /
kD1
N
X
D Ew.n/x.n j / C
bk R.j k/
(4.72)
kD1
w2 ; j D 0I
(4.73)
0; j > 0:
Consequently,
(
Rx .j / D
w2 C
PN
PN
kD1
kD1
bk R.k/; j D 0I
(4.74)
67
A comparison with the WienerHopf equations (4.45) leads to the following set of
optimal prediction coefficients:
ak D
bk ; 0 < j N I
0; N < j M I
(4.75)
(4.76)
The above result makes intuitive sense because it sets the LPC encoder filter (4.7)
to be the inverse of the filter (4.69) that generates the AR process. In particular, the
Z-transform of the prediction residue is given by
R.z/ D 1 A.z/X.z/
(4.77)
W .z/
D W .x/;
1 B.z/
(4.78)
1
:
D.z/
(4.80)
68
4 Linear Prediction
Prediction Residue
Magnitude (dB)
10
0
10
20
30
500
1000
1500
2000
2500
Frequency (Hz)
3000
3500
4000
3500
4000
Magnitude (dB)
40
20
20
500
1000
1500
2000
2500
Frequency (Hz)
3000
Fig. 4.6 Power spectra of the prediction residue (top), source signal (bottom), and the estimate by
the reconstruction filter (the envelop in the bottom)
R.z/
R.z/
D
;
PK
H.z/
1 kD1 ak zk
(4.81)
jk!
a
e
1
kD1 k
(4.82)
where Sxx .ej! / and Srr .ej! / are the power spectrum of x.n/ and r.n/, respectively.
Since r.n/ is white or nearly white, the equation above becomes
r2
Sxx .ej! / DD
2 :
PK
1 kD1 ak ejk!
(4.83)
69
Therefore, the decoder or the LPC reconstruction filter (4.10) provides an estimate
of the spectrum of the source signal and linear prediction is sometimes considered
as a temporal-frequency analysis tool.
Note that this is an all-pole model for the source signal, so it can model peaks
well, but is incapable of modeling zeros (deep valleys) that may exist in the source
signal. For this reason, linear prediction spectrum is sometimes referred to as an
spectral envelop. Furthermore, if the source signal cannot be modeled by poles,
linear prediction may fail completely.
The spectrum of the source signal and the estimate by the LPC reconstruction
filter (4.83) are shown at the bottom of Fig. 4.6. It can be observed that the spectrum
estimated by the LPC reconstruction filter matches the signal spectrum envelop
very well.
4.5.1 DPCM
Let us revisit the DPCM system in Fig. 4.5. Its output given by (4.23) may be expanded using (4.8) to become
rO .n/ D x.n/
K
X
ak x.n
O k/ C q.n/
(4.84)
kD1
K
X
kD1
D x.n/
K
X
kD1
ak x.n k/ C q.n/
K
X
kD1
ak q.n k/
(4.85)
70
4 Linear Prediction
Fig. 4.7 Different schemes of linear prediction for shaping quantization noise at the decoder output: DPCM (top), open-loop DPCM (middle) and general noise feedback coding (bottom)
(4.86)
71
to one. Therefore, the spectrum of the quantization noise at the decoder output is the
same as that produced by the quantizer (see (4.26)). In other words, the spectrum
of the quantization noise as produced by the quantizer is faithfully duplicated at the
decoder output, there is no shaping of quantization noise.
Many quantizers, including the uniform quantizer, produce a white quantization
noise spectrum when the quantization step size is small (fine quantization). This
white spectrum is faithfully duplicated at the decoder output by DPCM.
(4.87)
Figure 4.7 implies an important advantage of noise feedback coding: the shaping of quantization noise is accomplished completely in the encoder, there is zero
implementation impact or cost at the decoder.
When designing the noise feedback function, it is important to realize that the
noise-feedback function is only half of the overall noise-shaping filter
72
4 Linear Prediction
S.z/ D
1 F .z/
;
1 A.z/
(4.88)
the other half is the LPC reconstruction filter which is determined by solving the
normal equation (4.46). The Z-transform of the quantization error at the decoder
output is given by
E.z/ D S.z/Q.z/;
(4.89)
so the noise spectrum is
See .ej! / D jS.ej! /j2 Sqq .ej! /:
(4.90)
In audio coding, this spectrum is supposed to be shaped to match the masked threshold for a given source signal. In principle, this matching can be achieved for all
masked threshold shapes provided by a perceptual model [38].
Chapter 5
Transform Coding
Transform coding (TC) is a method that transforms a source signal into another
one with a more compact representation. The goal is to quantize the transformed
signal in such a way that the quantization error in the reconstructed signal is smaller
than directly quantizing the source signal.
(5.1)
(5.2)
(5.3)
where
is called the transform of x.n/ or transform coefficients and the M M matrix
2
6
6
TD6
4
t0;0
t1;0
::
:
t0;1
t1;1
::
:
tM 1;0
tM 1;1
t0;M 1
t1;M 1
::
:
3
7
7
7
5
(5.4)
tM 1;M 1
is called the transformation matrix or simply the transform. This transform operation is shown in the left of Fig. 5.1.
73
74
5 Transform Coding
Fig. 5.1 Flow chart of a transform coder. The coding delay for transforming and inverse
transforming the signals is ignored in this graph
Transform coding is shown in Fig. 5.1. The transform coefficients y.n/ are
quantized into quantized coefficients yO .n/ and the resulting quantization indexes
transmitted to the decoder. The decoder reconstructs from the received indexes
the quantized coefficients yO .n/ through inverse quantization. This process may be
viewed as if the quantized coefficients yO .n/ are received directly by the decoder.
The decoder then reconstructs an estimate xO .n/ of the source signal vector x.n/
from the quantized coefficients yO .n/ through an inverse transform
xO .n/ D T1 yO .n/;
(5.5)
where T1 represents the inverse transform. The reconstructed vector xO .n/ can
then be unblocked to rebuild an estimate x.n/
O
of the original source signal x.n/.
A basic requirement for a transform in the context of transform coding is that it
must be invertible
(5.6)
T1 T D I;
so that an source block can be recovered from its transform coefficients in the absence of quantization:
(5.7)
x.n/ D T1 y.n/:
Orthogonal transforms are most frequently used in practical applications. To
be orthogonal, a transform must satisfy
T1 D TT ;
(5.8)
TT T D I;
(5.9)
or
75
(5.10)
tT0
6 T 7
6 t1 7
6
7
T D 6 : 7;
6 :: 7
4
5
tTM 1
(5.11)
k D 0; 1; : : : ; M 1:
(5.12)
M
1
X
yk tk ;
(5.13)
kD0
which means that the source can be represented as a linear combination of the vecM 1
tors ftk gkD0
. Therefore, the rows of T are often referred to as basis vectors or basis
functions.
One of the advantages of an orthogonal transform is that its inverse transform
TT is immediately defined without considering matrix inversion. The fact that the
inverse matrix is just the transposition of the transform matrix means that the inverse transform can be implemented by just transposing the transform flow graph or
running it backwards.
Another advantage of an orthogonal transform is energy conservation which
means that the transformed coefficients have the same total energy as the source
signal:
M
1
X
kD0
y 2 .k/ D
M
1
X
x 2 .k/:
(5.14)
kD0
y 2 .k/ D yT y D xT TT Tx D xT x D
M
1
X
x 2 .k/;
(5.15)
kD0
where the orthogonal condition in (5.9) is used. In the derivation above, the block
index n is dropped for convenience. When the context is appropriate, this practice
will be followed in the reminder of this book.
76
5 Transform Coding
(5.16)
so the MSQE is
2
q.
x/
O D
M 1
M 1
1 X 2
1 X 2
qO k D
E qO k ;
M
M
kD0
(5.17)
kD0
(5.18)
1 T
E qO qO :
M
(5.19)
(5.20)
1 T T
E q TT q :
M
(5.21)
M 1
M 1
1 X 2
1 T
1 X 2
E q q D
D
E qk D
qk :
M
M
M
kD0
(5.22)
kD0
Suppose that the bit rate for the transform coder is R bits per source sample.
Since there are M samples in a block, the total number of bits available for coding
one block of source samples is MR. These bits are allocated to the quantizers in
Fig. 5.1. If the kth quantizer is allocated rk bits, the total must be MR:
RD
77
M 1
1 X
rk :
M
(5.23)
kD0
(5.24)
where the parameters a and b are dependent on the quantization scheme and the
probability distribution of yk .n/. Dropping (5.24) into (5.22), we obtain the following total MSQE:
2
q.
x/
O D
M 1
1 X 0:1.aCbrk / 2
10
yk :
M
(5.25)
kD0
Apparently, the MSQE is a function of both signal variance y2k of each transform
coefficient and the number of bits allocated to quantize it. While the former is determined by the source signal and the transform T, the later is by how the bits are
allocated to the quantizers, i.e., bit allocation strategy. For a given the bit rate R,
M 1
, called a bit
a different bit allocation strategy assigns a different set of frk gkD0
allocation, which results in a different MSQE. It is, therefore, imperative to find the
optimal one that gives the minimum MSQE.
M 1
1 X
pk
M
(5.26)
kD0
M
1
Y
!1=M
pk
(5.27)
kD0
(5.28)
p0 D p1 D D pM 1 :
(5.29)
78
5 Transform Coding
M 1
1 X 2
D
qk
M
1
Y
!1=M
q2k
(5.30)
(5.31)
kD0
kD0
!1=M
q2k
M
1
Y
!1=M
100:1.aCbrk / y2k
kD0
D 100:1a 100:1b
PM 1
kD0
rk
M
1
Y
!1=M
y2k
kD0
D 100:1.aCbR/
M
1
Y
!1=M
y2k
(5.32)
kD0
where (5.23) is used to arrive at the last equation. Plugging this back into (5.30), we
have
!1=M
M
1
Y
2
0:1.aCbR/
2
yk
;
(5.33)
q.x/
O
10
kD0
79
M 1
Optimal MSQE. MSQE from any bit allocation frk gkD0
is either equal to or larger
than the fixed value on the right-hand side of (5.33), which is thus the minimal
MSQE that could be achieved.
Optimal Bit Allocation. This minimal MSQE is achieved, or the equality holds, if
and only if (5.31) is satisfied, or the bits are allocated in such a way that the MSQE
from all quantizers are equalized.
Note that this result is obtained without dependence on the actual transform matrix used. As long as the transform matrix is orthogonal, the results in this section
hold.
D 10
0:1.aCbR/
M
1
Y
!1=M
y2k
(5.34)
kD0
If the source signal is quantized directly as PCM with the same bit rate of R, the
MSQE is given by (2.48) and denoted as
2
D 100:1.aCbR/ x2 :
q.PCM/
(5.35)
2
q.PCM/
2
q.TC/
x2
D
;
QM 1 2 1=M
kD0 yk
(5.36)
which is the ratio of source variance x2 to the geometric mean of the transform
coefficient variances y2k .
Due to the energy conservation property of T (5.14), we have
x2
M 1
M 1
1 X 2
1 X 2
D
xk D
yk ;
M
M
kD0
(5.37)
kD0
where x2k is the variance of the kth element of the source vector x. Applying this
back into (5.36), we have
1
PM 1
kD0
GTC D M
QM 1
2
kD0 yk
y2k
1=M ;
(5.38)
80
5 Transform Coding
which states that the optimal coding gain of a transform coder is the ratio of arithmetic to geometric mean of the transform coefficient variances. Due to the AMGM
inequality (5.28), we always have
GTC
1:
(5.39)
Note, however, this is valid if and only if the optimal bit allocation strategy is deployed. Otherwise, the equality in (5.30) will not hold, the transform coder will not
be able to achieve the minimal MSQE given by (5.34).
(5.41)
kD0
Dropping this back into (5.40), we obtain the following optimal bit allocation:
#
"
M 1
10
1 X
2
2
log10 yk
log10 yk
rk D R C
b
M
kD0
DRC
y2k
10
log2
;
QM 1 2 1=M
b log2 10
yk
kD0
for k D 0; 1; : : : ; M 1:
(5.42)
(5.43)
81
Both (5.42) and (5.43) state that, aside from a global bias determined by the bit rate
R and the geometric mean of the variances of the transform coefficients, bits should
be assigned to a transform coefficient in proportion to the logarithm of its variance.
(5.45)
The zero variance causes the geometric mean to completely break down, so both
(5.34) and (5.36) are meaningless. The modified strategy above, however, dictates
the following bit allocation:
b0 D MR;
bk D 0; k D 1; 2; : : : ; M 1:
(5.46)
82
5 Transform Coding
(5.49)
For a scenario of M D 1;024 and R D 1 bits per source sample which is typical in
audio coding, this coding gain is 6164.48 dB!
An intuitive explanation for this dramatic improvement is energy compaction
and exponential reduction of quantization error with respect to bit rate. With direct
quantization, each sample in the source vector has a variance of x2 and is allocated
R bits, resulting in a total signal variance of M x2 and a total number of MR bits
for the whole block. With transform coding, however, this total variance of M x2 for
the whole block is compacted to the first coefficient and all MR bits for the whole
block are allocated to it. Since MSQE is linearly proportional to the variance (see
(5.25)), the MSQE for the first coefficient would increase M times due to the M
times increase in variance, but this increase is spread out to M samples in the block,
resulting no change. However, MSQE decreases exponentially with bit rate, so the
M times increase in bit rate causes M times MSQE decrease in decibel!
In fact, energy compaction is the key for coding gain in transform coding. Since
the transform matrix T is orthogonal, the arithmetic mean is constant for a given
signal, no matter how its energy is distributed by the transform to the individual
coefficients. This means that the numerator of (5.38) remains the same regardless
of the transform matrix T. However, if the transform distributes most of its energy
(variance) to a minority of transform coefficients and leaves the balance to the rest,
the geometric mean in the denominator of (5.38) becomes extremely small. Consequently, the coding gain becomes extremely large.
83
(5.50)
Taking expected values on both sides, we obtain the covariance of the transform
coefficients
where
Ryy D TRxx TT ;
(5.51)
Ryy D E y.n/yT .n/
(5.52)
and Rxx is the covariance matrix of source signal defined in (4.49). As noted there,
it is symmetric and Toeplitz.
By definition, the kth diagonal element of Ryy is the variance of kth transform
coefficient:
(5.53)
Ryy kk D E yk .n/yk .n/ D y2k ;
1
so the geometric mean of fy2k gM
kD0 is the product of the diagonal elements of Ryy :
M
1
Y
y2k D
kD0
M
1
Y
Ryy kk :
(5.54)
kD0
It is well known that a covariance matrix is positive semidefinite, i.e., its eigenvalues are all real and nonnegative [33]. For practical signals, Rxx may be considered
as positive definite (no zero eigenvalues). Due to (5.51), Ryy may also be considered
as positive definite as well, so the following inequality holds [93]:
M
1
Y
Ryy kk
det Ryy
(5.55)
kD0
(5.56)
(5.57)
j det Tj D 1:
(5.58)
which leads to
84
5 Transform Coding
(5.59)
Ryy kk
det Rxx ;
(5.60)
kD0
(5.61)
Minimum:
M
1
Y
!1=M
y2k
(5.62)
kD0
Dropping this back into (5.36), we establish that the maximum coding gain of the
optimal transform coder, for a given source signal x.n/, is
Maximum: GTC D
x2
.det Rxx /1=M
(5.63)
85
M !1
1
;
x2
(5.64)
exp
R
1
ln Sxx .ej! /d!
2
R
1
S .ej! /d!
2 xx
(5.65)
available.
Signal statistics changes with time in most applications. This calls for real-time
86
5 Transform Coding
(5.66)
WM D ej2=M
(5.67)
kn
Tk;n D WM
D ej2kn=M :
(5.68)
where
so that
(5.69)
so its inverse is
1
.WM /T :
(5.70)
M
Note that the scale factor M above is usually disregarded when discussing orthogonal transforms, because it can be adjusted either on the forward or backward
transform side within a particular context. The matrix is also symmetric, that is,
W1 D
WTM D WM ;
(5.71)
so (5.70) becomes
W1 D
1
W :
M M
(5.72)
87
M
1
X
kn
x.n/WM
(5.73)
nD0
M
1
X
kn
yk WM
:
(5.74)
kD0
DFT
DFT-II
DFT-IV
88
5 Transform Coding
5.4.2 DCT
DCT is a family of Fourier-related transforms that use only real transform coefficients [2, 80]. Since the Fourier transform of a real and even signal is real and even,
a DCT operates on real data and is equivalent to a DFT of roughly twice the block
length. Depending on how the beginning and ending block boundaries are handled,
there are eight types of DCTs, and correspondingly eight types of DSTs, but only
two of them are widely used.
(
c.k/ D
k
2
cos .n C 0:5/
M
M
p1 ;
2
if k D 0I
1;
otherwise:
(5.75)
(5.76)
(5.77)
DCT-II tends to deliver the best energy compaction performance in the DCT and
DST family. It achieves this mostly because it uses symmetric boundary conditions
on both sides of its period, as shown in the middle of Fig. 5.2. In particular, DCT-II
extends its boundaries symmetrically on both sides of a period, so the samples can be
considered as periodic with a period of 2M and there is essentially no discontinuity
at both boundaries.
DCT-II for M D 2 was shown to be identical to the KLT for a first-order autoregression (AR) source [33]. Furthermore, the coding gain of DCT-II for other M is
shown to be very close to that of the KLT for such a source with high correlation
coefficient:
R.1/
1:
(5.78)
D
R.0/
Similar results with real speech were also observed in [33].
89
Since many real-world signals can be modeled as such a source, DCT-II is deployed in many signal coding or processing applications and is sometimes simply
called the DCT (its inverse is, of course, called the inverse DCT or the IDCT).
Two-dimensional DCT-II, which shares these characteristics, has been deployed by
many international image and video coding standards, such as JPEG [37], MPEG1&2&4 [54, 57, 58], and MPEG-4(AVC)/H.264 [61].
h
i
2
cos .n C 0:5/.k C 0:5/
:
M
M
(5.79)
Chapter 6
Subband Coding
Transform coders artificially divide an source signal into blocks, then process and
code each block independently of each other. This leads significant variations between blocks, which may become visible or audible as discontinuities at the block
boundaries. Referred to as blocking artifacts or blocking effects, these artifacts
may appear as tiles in decoded images or videos that were coded at low bit rates.
In audio, blocking artifacts sound like periodic clicking which is considered as annoying by many people. While the human eye can tolerate a large degree of blocking
artifacts, the human ear is extremely intolerant of such periodic clicking. Therefore, transform coding is rarely deployed in audio coding.
One approach to avoiding blocking artifacts is to introduce overlapping between
blocks. For example, a transform coder may be structured in such a way that it
advances only 90% of a block, so that there is a 10% overlap between blocks. Reconstructed samples in the overlapped region can be averaged to smooth out the
discontinuities between blocks, thus avoiding the blocking artifacts.
The cost for this overlapping is reduced coding performance. Since the number
of transform coefficients is equal to the number of source samples in a block, the
number of samples overlapping between blocks is the number of extra transform
coefficients that needs to be quantized and conveyed to the decoder. They consume
extra valuable bit resources, thus degrading coding performance.
What is really needed is overlapping in the temporal domain, but no overlapping
in the transform or frequency domain. This means the transform matrix T in (5.2)
is no longer M M , but M N , where N > M . This leads to subband coding
(SBC). For example, application of this idea to DCT leads to modified discrete cosine transform (MDCT) which has 50% overlapping in the temporal domain or its
transform matrix T is M 2M .
91
92
6 Subband Coding
the subband samples to the transform coefficients. This extension is based on a new
perspective on transforms, which views the multiplication of a transform matrix
with the source vector as filtering of the source vector by a bank of subband filters whose impulse responses are the row vectors of the transform matrix. Allowing
these subband filters to be longer than the block size of the transform enables dramatical improvement in energy compaction capability, in addition to the elimination
of blocky artifacts.
M
1
X
tk;m xm .n/ D
mD0
M
1
X
(6.1)
mD0
where the last step is obtained via (5.1). The above operation obviously can be
considered as filtering the source signal x.n/ by a filter with an impulse response
given by the kth basis vector or basis function tk . Consequently, the whole transform
may be considered as filtering by a bank of filters, called analysis filter banks, with
impulse responses given by the row vectors of the transform matrix T. This is shown
in Fig. 6.1.
Similarly, the inverse transform in (5.10) may be considered as the output from
a bank of filters, called synthesis filter banks, with impulse responses given by the
row vectors of the inverse matrix TT . This is also shown in Fig. 6.1.
For the suboptimal sinusoidal transforms discussed in Chap. 5, each of the filters
in either the analysis or synthesis bank is associated with a basis vector or basis
93
(6.2)
This enables that M samples from the source signal are presented simultaneously
to the transform matrix. Due to (5.67), except for the scale factor of 1=M , the
subband samples for the kth subband is the output from the kth subband filter
and is given as
yk .n/ D
M
1
X
km
um .n/WM
; k D 0; 1; : : : ; M 1;
(6.3)
mD0
(6.4)
M
1
X
yk .n/ D
mD0
94
6 Subband Coding
Its Z-transform is
Yk .z/ D
M
1
X
km
X.Z/zm WM
D X.Z/
mD0
M
1
X
k
zWM
m
;
(6.5)
mD0
M
1
m
X
Yk .z/
k
zWM
D
:
X.z/
mD0
(6.6)
M
1
X
zm ;
(6.7)
mD0
the transfer functions for all other subbands may be represented by it:
k
:
Hk .z/ D H zWM
(6.8)
(6.9)
which is H ej! uniformly shifted by 2k
M in the frequency domain. Therefore,
H.z/ is called the prototype filter and all other subband filters in the DFT bank are
built by uniformly shifting or modulating the prototype filter. A filter bank with
such a structure is called a modulated filter bank. It is a prominent category of
filter banks which are most notably amenable for fast implementation.
The magnitude response of the prototype filter (6.7) is
sin M!
!2 ;
H ej! D
sin 2
(6.10)
which is shown at the top of Fig. 6.3 for M D 8. According to (6.9), all other subband filters in the bank are shifted or modulated versions of it and are shown at the
bottom of Fig. 6.3.
20
Amplitude (dB)
95
H0
10
0
10
20
0.2
0.4
0.6
0.8
/2
Amplitude (dB)
20
H0
H2
H1
H4
H3
H5
H6
H7
H0
10
0
10
20
0
0.2
0.4
0.6
0.8
/2
Fig. 6.3 Magnitude responses of a DFT analysis filter bank with M D 8. The top shows the
prototype filter. All subband filters in the bank are uniformly shifted or modulated versions of it
and are shown at the bottom
While other transforms, such as KLT and DCT, may have better energy compacting capability than the DFT, they are eventually limited by M , the number
of samples in the block. It is well known in filter design that a sharp magnitude
response with less energy leakage requires long filters [67], so the fundamental limiting factor is the block size M of these transforms or subband filters.
To obtain even better performance, subband filters longer than M need to be
used. To maximize energy compaction, subband filters should have the magnitude
response of an ideal bandpass filter shown in Fig. 6.4, which has no energy leakage
at all and achieves the maximum coding gain (to be proved later). Unfortunately,
such a bandpass filter requires an infinite filter order [68], which is very difficult to
implement in a practical system, so the challenge is to design subband filters that
can optimize coding gain for a given limited order.
96
6 Subband Coding
Once the order of subband filters are extended beyond M , overlapping between
blocks occurs, providing an additional benefit of mitigating the blocking artifacts
discussed at the beginning of this chapter.
It is clear that transforms are a special type of filter banks whose subband filters
have order less than M . Its main characteristics is that there is no overlapping between transform blocks. Therefore, transforms are frequently referred as filter banks
and transform coefficients as subband samples. On the other hand, filter banks with
subband filters longer than M are also sometimes referred to as transforms or lapped
transform in the literature [49]. One such example is the modulated cosine transform
(MDCT), whose subband filters are twice as long as the block size.
(6.11)
Fig. 6.5 Maximally decimated filter bank and subband coder. The #M denotes M -fold decimation
and "M M -fold expansion. The additive noise model is used to represent quantization in each
subband
97
where x.n/ is the source sequence and xD .n/ is the decimated sequence. Due to the
loss of M 1 samples incurred in decimation, it may not be possible to recover
x.n/ from the decimated xD .n/ due to aliasing [65, 68].
When applied to the analysis bank in Fig. 6.5, the decimator reduces the sample
rate of each subband to its 1=M . Since there are M subbands, the total sample rate
for all subbands is still the same as the source signal.
The M -fold expander, also referred to as an upsampler or interpolator, passes
through each source sample to the output and, after each, inserts M 1 zeros to the
output:
(
x.n=M /; if n=M is an integerI
(6.12)
xE .n/ D
0;
otherwise:
Since all samples from the input are passed through to the output, there is obviously
no loss of information. For example, the source can be recovered from the expanded
output by an M -old decimator. As explained in Sect. 6.3, however, expansion causes
images in the spectrum, so needs to be handled accordingly.
When applied to the analysis bank in Fig. 6.5, the expander for each subband
recovers the sample rate of each subband back to the original sample rate of the
source signal. It is then possible to output an reconstructed sequence at the same
sample rate as the source signal.
Coding in the subband domain, or subband coding, is accomplished by attaching a quantizer to the output of each subband filter in the analysis filter bank and
a corresponding inverse quantizer to the input of each subband filter in the synthesis filter bank. The abstraction of this quantization and inverse quantization is the
additive noise model in Fig. 2.3, which is deployed in Fig. 6.5.
The filter bank above has a special characteristic: its sample rate for each subband
is 1=M of that of the source. This happens because the decimation factor is equal to
the number of subbands. Such a subband system is called a maximally decimated
or critically sampled filter bank.
(6.13)
where k is a scale factor and d is a delay. For subband coding, this reconstruction error needs to be either exactly or approximately zero, in the absence of
quantization, so that the reconstructed signal is a delayed and scaled version of the
source signal. This section analyzes decimation and expansion effects to arrive at
conditions on analysis and synthesis filters that guarantee zero reconstruction error.
98
6 Subband Coding
M 1
1 X j 2mn
e M :
M mD0
(6.14)
(6.15)
(6.16)
due to the formula for geometric series and ej2n D 1. Therefore, the geometric
series (6.14) becomes
(
1; if n D multiples of M I
pM .n/ D
0; otherwise:
(6.17)
1
X
xD .n/zn D
nD1
1
X
x.Mn/zn :
(6.18)
nD1
Due to the upper half of (6.17), we can multiply the right-hand side of the above
equation with pM .nM/ to get
1
X
XD .z/ D
pM .nM/x.Mn/zn :
(6.19)
nD1
XD .z/ D
pM .m/x.m/zm=M ;
(6.20)
mD1
(6.21)
99
M 1
1
m
1 X X
k
x.m/ z1=M WM
;
M
mD1
(6.22)
kD0
Let X.z/ denote the Z-transform of x.m/, the equation above can be written as
XD .z/ D
M 1
1 X 1=M k
X z
WM :
M
(6.23)
M 1
1 X j !2k
;
X e M
M
(6.24)
kD0
kD0
100
6 Subband Coding
8Fold Streched Spectrum and its Shifted Copies
H0
Amplitude (dB)
20
H1
H2
H3
H4
H5
H0
H6
H7
10
0
10
20
4
5
/2
8Fold Decimated Signal
Amplitude (dB)
20
10
0
10
20
4
/2
Fig. 6.6 Stretched spectrum of the source signal and all its shifted aliasing copies (top). Due to
the stretching factor of M D 8, their period is also stretched from 2 to 8 2 . Spectrum for
the decimated signal (bottom) has a period of 2 , but is totally different from that of the original
signal
The approach above is not the only one for aliasing-free decimation, see [82]
for details. However, aliasing-free decimation is not the goal for filter bank design.
A certain amount of aliasing is usually allowed in some special ways. As long as
aliasing from all decimators in the filter bank cancel each other completely at the
output of the synthesis bank, the reconstructed signal is still aliasing free. Even if
aliasing cannot be canceled completely, proper filter bank design can still keep them
small enough so as to obtain a reconstruction with tolerable error.
101
1
X
XE .z/ D
xE .kM/zkM :
(6.27)
kD1
x.k/zKM D X zM :
1
X
(6.28)
kD1
XE .ej! / D X ejM! ;
(6.29)
Input Signal
Amplitude (dB)
20
10
0
10
20
0
0.2
0.4
0.6
0.8
0.6
0.8
/2
8Fold Expanded Signal
Amplitude (dB)
20
10
0
10
20
0
0.2
0.4
/2
Fig. 6.7 The effect of sample rate expansion is frequency compression. The spectrum for the
source signal on the top is compressed by a factor of 8 to produce the spectrum for the expanded
signal at the bottom. Seven images are shifted into 0; 2 region from outside due to this frequency
compression
102
6 Subband Coding
M 1
1 X
m
m
X z1=M WM
:
Hk z1=M WM
M mD0
(6.30)
M
1
X
Fk .z/Yk zM
kD0
M 1 M 1
m m
1 X X
X zWM
Fk .z/Hk zWM
M
mD0
kD0
D
D
1
M
M
1
X
mD0
1
X.z/
M
C
1
X
m
m M
X zWM
Fk .z/Hk zWM
kD0
M
1
X
Fk .z/Hk .z/
kD0
M 1
M 1
m
1 X m X
:
X zWM
Fk .z/Hk zWM
M mD1
(6.31)
kD0
M 1
1 X
Fk .z/Hk .z/
M
(6.32)
kD0
M 1
m
1 X
;
Fk .z/Hk zWM
M
m D 1; 2; : : : ; M 1;
(6.33)
kD0
M
1
X
m
Am .z/:
X zWM
(6.34)
mD1
103
To set the reconstruction error (6.13) to zero, the overall transfer function should
be set to a delay and scale factor:
T .z/ D kzd
(6.35)
m
Am .z/ D 0:
X zWM
(6.36)
mD1
1
X
nD1
h.n/zn
(6.37)
104
6 Subband Coding
H.z/ D
h.nM/znM
nD1
C z1
1
X
h.nM C 1/znM
nD1
::
:
1
X
C z.M 1/
h.nM C M 1/znM :
(6.38)
nD1
Denoting
pk .n/ D h.nM C k/;
0 k < M;
(6.39)
1
X
pk .n/zn ; 0 k < M;
(6.40)
nD1
M
1
X
zk Pk zM :
(6.41)
kD0
The equation above is called the type-I polyphase representation of H.z/ with
respect to M and its implementation is shown in Fig. 6.8.
The type-I polyphase representation in (6.41) may be further written as
H.z/ D pT zM d.z/;
(6.42)
where
105
h
iT
d.z/ D 1; z1 ; : : : ; zM 1
(6.43)
(6.44)
(6.45)
(6.46)
where
are the type-I polyphase components of Hk .z/. The analysis bank may then be represented by
3
hT0 zM d.z/
6
7 6 hT zM d.z/ 7
6
7
7 6 1
h.z/ D 6
7 D H zM d.z/;
7D6
::
4
5
5 4
:M
T
HM 1 .z/
hM 1 z d.z/
2
H.z/
H1 .z/
::
:
where
2
6
6
H.z/ D 6
4
hT0 .z/
hT1 .z/
::
:
(6.47)
3
7
7
7
5
(6.48)
hTM 1 .z/
is called a polyphase (component) matrix. This leads to the type-I polyphase implementation in Fig. 6.9 for the analysis filter bank in Fig. 6.5.
106
6 Subband Coding
M
1
X
z.M 1n/ PM 1n zM
(6.49)
z.M 1n/ Qn zM ;
(6.50)
nD0
M
1
X
nD0
where
Qn .z/ D PM 1n .z/
(6.51)
(6.52)
H.z/ D z.M 1/ dT z1 q zM ;
(6.53)
(6.54)
M
1
X
nD0
where
represents the type-II polyphase components.
Similar to type-I polyphase representation of the analysis filter bank, a synthesis
filter bank may be implemented using type-II polyphase representation. The type-II
polyphase components of the kth synthesis subband filter may be denoted as
fk .z/ D fk;0 .z/; fk;1 .z/; : : : ; fk;M 1 .z/T ;
(6.55)
107
(6.56)
(6.57)
(6.58)
where
This leads to the type-II polyphase implementation in Fig. 6.11.
6.4.2.1 Decimation
The noble identity for decimation is shown in Fig. 6.12 and is proven below using
(6.23)
108
6 Subband Coding
Y1 .z/ D
M 1
1 X 1=M k
U z
WM
M
kD0
M
1
k
k
H z1=M WM
X z1=M WM
M
kD0
#
" M 1
1 X 1=M k
X z
WM H.z/
D
M
M
1
X
kD0
D Y2 .z/:
(6.59)
Applying the notable identity given above to the analysis bank in Fig. 6.9, we
can move the decimators on the right side of the analysis polyphase matrix to its
left side to arrive at the analysis filter bank in Fig. 6.13. With this new structure,
the delay chain presents M source samples in correct succession simultaneously to
the decimators. The decimators ensure that the subband filters operate only once for
each block of M input samples, generating only one block of M subband samples.
The sample rate is thus reduced by M times, but the data move in parallel now.
The combination of the delay chain and the decimators essentially accomplishes a
series-to-parallel conversion.
6.4.2.2 Interpolation
The noble identity for interpolation is shown in Fig. 6.14 and is easily proven
below using (6.28)
Y1 .z/ D U zM D X zM H zM D Y2 .z/:
(6.60)
109
For the synthesis bank in Fig. 6.11, the expanders on the left side of the synthesis polyphase matrix may be moved to its right side to arrive at the filter bank
in Fig. 6.15 due to the noble identity just proven. With this structure, the synthesis
subband filters operate once for each block of subband samples, whose sample rate
was reduced to 1=M of the source sample rate by the analysis filter. The expanders
increase the sample rate by inserting M 1 zeros after each source sample, making the sample rate the same as the source signal. The delay chain then delays their
outputs in succession to align and interlace the M nonzero subband samples in the
time domain so as to form an output stream which has the same sample rate as the
source. The combination of the expander and the delay chain essentially accomplishes a parallel-to-series conversion.
(6.61)
In other words, the transform matrix is a polyphase matrix of order zero. The
delay chain and the decimators in the analysis bank simply serve as a series-toparallel converter that feeds the source samples to the transform matrix in blocks
of M samples. The expander and the delay chain in the synthesis bank serve as a
110
6 Subband Coding
parallel-to-series converter that interlaces subband samples outputted from the transform matrix to form an output sequence. The orthogonal condition (5.9) ensures that
the filter bank satisfies the PR condition. For this reason, the transform coefficients
can be referred to as subband samples.
111
m
Since shifting the frequency of any of these ideal bandpass filters by WM
creates
a copy that does not overlap with the original one:
m
m
D Fk .z/Hk zWM
D 0; for m D 1; 2; : : : ; M 1;
Hk .z/Hk zWM
(6.62)
M 1
m
1 X
D 0; m D 1; 2; : : : ; M 1:
Fk .z/Hk zWM
M
(6.63)
kD0
(p
M ; passband
0;
stopband
(6.64)
(6.65)
M 1
1 X 2
qk
M
(6.66)
kD0
Let us now consider the analysis bank in Fig. 6.5. Since the decimator retains
one sample out of M source samples, its output has the same variance as its input,
which, for the kth subband, is given by:
y2k D
1
2
(6.67)
112
6 Subband Coding
Using (6.64) and denoting its passband as k , we can write the equation above as
y2k D
M
2
Z
Sxx .ej! /d!
(6.68)
k
Adding both sides of the equation above for all subbands, we obtain
M
1
X
y2k D
kD0
Z
M 1 Z
M X
M
Sxx .ej! /d! D
Sxx .ej! /d! D M x2 ;
2
2
k
(6.69)
kD0
which leads to
x2
M 1
1 X 2
D
yk :
M
(6.70)
kD0
Since (6.66) and (6.70) are the same as (5.22) and (5.37), respectively, all the
derivations in Sect. 5.2 related to coding gain and optimal bit allocation applies to
the ideal subband coder as well. In particular, we have the following coding gain:
1
M
x2
GSBC D
QM 1
kD0
y2k
PM 1
kD0
1=M D Q
M 1
kD0
y2k
y2k
1=M ;
(6.71)
(6.72)
M
jk jSxx .ej! /;
2
(6.73)
2
;
M
(6.74)
(6.75)
113
The geometric mean used by the coding gain formula (6.71) may be rewritten as
M
1
Y
!1=M
M
1
Y
D exp 4ln
y2k
kD0
!1=M 3
5
y2k
kD0
M 1
1 X
D exp
ln y2k
M
!
:
(6.76)
kD0
!1=M
y2k
kD0
"
#
M 1
1 X
j!
exp
ln Sxx .e /
M
kD0
"
#
M 1
1 X
j! 2
ln Sxx .e /
D exp
2
M
kD0
#
"
M 1
1 X
j!
ln Sxx .e /jk j ;
D exp
2
(6.77)
kD0
where (6.74) is used to obtain the last equation. As M ! 1, the equation above
becomes
!1=M
Z
M
1
Y
1
2
yk
D exp
ln Sxx .ej! /d!
(6.78)
2
kD0
M !1
D
D
exp
1
2
1
exp
1
:
x2
x2
j!
ln Sxx .e /d!
R
R
21 R
2
(6.79)
(6.80)
(6.81)
where x2 is the spectral flatness measure defined in (5.65). Therefore, the ideal
subband coder approaches the same asymptotic optimal coding gain as KLT (see
(5.64)).
Chapter 7
Between the KLT transform coder and the ideal subband coder, there are many
subband coders which offer great energy compaction capability with a reasonable implementation cost. Prominent among them are cosine modulated filter banks
(CMFB) whose subband filters are derived from a prototype filter through cosine
modulation.
The first advantage of CMFB is that the implementation cost of both analysis
and synthesis banks are that of the prototype filter plus the overhead associated with
cosine modulation. For a CMFB with M bands and N taps per subband filter, the
number of operations for the prototype filter is on the order of N and that for the
cosine modulation, when implemented using a fast algorithm, is on the order of
M log2 M , so the total operations is merely on the order of N C M log2 M . For
comparison, the number of operations for a regular filter bank is on the order of
M N.
The second advantage is associated with the design of subband filters. Instead of
designing all subband filters in a filter bank independently, which entails optimizing
a total of M N coefficients, we only need to optimize the prototype filter with
CMFB, which has no more than N coefficients.
Early CMFBs are near perfect reconstruction systems [6, 8, 50, 64, 81] in which
only adjacent-subband aliasing is canceled, so the reconstructed signal at the
output of the synthesis filter bank is only approximately equal to a delayed and
scaled version of the signal inputted to the analysis filter bank. The same filter bank
structure was later found to be capable of delivering perfect reconstruction if two
additional constraints are imposed on the prototype filter [39, 40, 4648, 7779].
115
116
While the subband filters of the DFT bank are limited to just M taps (see (6.7)),
it illustrates the basic idea of modulated filter banks and can be easily extended to
accommodate for subband filters with more than M taps through polyphase representation.
Even if the prototype filter of an extended DFT filter bank is real-valued, the
modulated subband filters are generally complex-valued because the DFT modulation is complex-valued. Consequently, the subband samples of an extended DFT
filter bank are complex-valued and are thus not amenable to subband coding. To
obtain a real-valued modulated filter bank, the idea that extends DFT to DCT is followed: a 2M DFT is used to modulate a real-valued prototype filter to produce 2M
complex subband filters and then subband filters symmetric with respect to the zero
frequency are combined to obtain real-valued ones. This leads to CMFB.
There is a little practical issue as shown in Fig. 6.3, where the magnitude response of the prototype filter in the DFT bank is split at the zero frequency, leaving
half of its bandwidth in the real frequency domain and the other half in the imaginary frequency domain. Due to periodicity of the DFT spectrum, it appears to have
two subbands in the real frequency domain, one starting at the zero frequency and
the other ending at 2 . This is very different from other modulated subbands whose
subbands are not split. This issue may be easily addressed by shifting the subband
filters by =2M . Afterwards, the subband filters whose center frequencies are symmetric with respect to the zero frequency are combined together to construct a real
subband filter.
M
1
X
k
zWM
m
Pm
k
zWM
M
mD0
M
1
X
km m
WM
z Pm zM ; for k D 1; 2; : : : ; M 1;
(7.1)
mD0
where Pm .zM / is the mth type-I polyphase component of H.z/ with respect to M .
Note that a subscript M is attached to W to emphasize that it is for an M -fold DFT:
WM D ej2=M :
Equation (7.1) leads to the implementation structure shown in Fig. 7.1.
(7.2)
117
Fig. 7.1 Extension of DFT analysis filter bank to accommodate for subband filters with more
M 1
are the type-I polyphase components of the prototype filter H.z/ with
than M taps. fPm .zM /gmD0
respect to M . Note that the M subscript for the DFT matrix WM is included to emphasize that
each of its elements is WM and it is an M M matrix
(7.3)
then the filter bank in Fig. 7.1 degenerates to the DFT bank in Fig. 6.2.
There is obviously no explicit restriction on the length of the prototype filter in
(7.1), so a generic N -tap FIR filter can be assumed:
H.z/ D
N
1
X
h.n/zn :
(7.4)
nD0
This extension enables longer prototype filters which can offer much better energy
compaction capability than the 13 dB achieved by the DFT bank in Fig. 6.3.
118
2M 1
Fig. 7.2 2M -DFT analysis filter bank. fPm .z2M /gmD0
are the type-I polyphase components of
the prototype filter H.z/ with respect to 2M and W2M is the 2M 2M DFT matrix
(7.5)
where
W2M D ej2=2M D ej=M ;
(7.6)
and Pm .z2M / is the m-th type-I polyphase component of H.z/ with respect to 2M .
The implementation structure for such a filter bank is shown in Fig. 7.2.
The magnitude responses of the above 2M DFT bank is shown in Fig. 7.3 for
M D 8. The prototype filter is again given in (6.7) with 2M D 16 taps and its
magnitude response is shown at the top of the figure. There are 2M D 16 subband
filters whose magnitude responses are shown in the bottom of the figure.
Since a filter with only real-valued coefficients has a frequency response that
is conjugate-symmetric with respect to the zero frequency [67], each pair of the
subband filters in the above 2M DFT bank satisfying this condition can be combined
to form a subband filter with real-valued coefficients. Since the frequency responses
of both H0 .z/, which is the prototype filter, and H8 .z/ are themselves conjugatesymmetric with respect to the zero frequency, their coefficients are already realvalued and they cannot be combined with any other subband filters. The remaining
subband filters from H1 .z/ to H7 .z/ and from H9 .z/ to H1 5.z/ can be combined to
form a total of .M 2/=2 D 7 real-valued subband filters. These combined subband
filters, plus H0 .z/) and H8 .z/, give us a total of 7 C2 D 9 combined subband filters.
Since the frequency response of the prototype filter H.z/ is split at zero frequency
and that of H8 .z/ split at and , their bandwidth is only half of that of the
other combined subband filters, as shown at the bottom of Fig. 7.3. This results in a
situation where two subbands have half bandwidth and the remaining subbands have
full-bandwidth. While this type of filter banks with unequal subband bandwidth can
be made to work (see [77], for example), it is rather awkward for practical subband
coding.
119
Amplitude (dB)
30
H0
20
10
0
10
0.5
Amplitude (dB)
30
H8 H9
0
/2
H10
H11
H12
H13
H14
H15
H0
0.5
H1
H2
H3
H4
H5
H6
H7 H8
20
10
0
10
0.5
0
/2
0.5
Fig. 7.3 Magnitude responses of the prototype filter of a 2M DFT filter bank (top) and of all its
subband filters (bottom)
2M
1
X
kC0:5
zW2M
m
Pm
2M
kC0:5
zW2M
mD0
2M
1
X
0:5 m
km
zW2M
W2M
Pm z2M ; k D 1; 2; : : : ; 2M 1; (7.7)
mD0
M
where W2M
D 1 was used. Equation (7.7) can be implemented using the structure
shown in Fig. 7.4.
Figure 7.5 shows the magnitude responses of the filter bank above using the
prototype filter given in (6.7) with 2M D 16 taps. They are the same magnitude
responses given in Fig. 7.2 except for a frequency shift of =2M , respectively. Now
all subbands have the same bandwidth.
120
Fig. 7.4 2M -DFT analysis filter bank with an additional frequency shift of =2M to the right.
2M 1
are the type-I polyphase components of H.z/ with respect to 2M and W2M is the
fPm .z2M /gmD0
2M 2M DFT matrix
Amplitude (dB)
30
H0
20
10
0
10
0.5
Amplitude (dB)
30
0
/2
H8
H9
0.5
H1
H2
H3
H4
H5
H6
H7
20
10
0
10
0.5
0
/2
0.5
Fig. 7.5 Magnitude response of a prototype filter shifted by =2M to the right (top). Magnitude
responses of all subband filters modulated from such a filter using a 2M DFT (bottom)
7.1.4 CMFB
From Fig. 7.5, it is obvious that the frequency response of the k-th subband filter
is conjugate-symmetric to that of the .2M 1 k/-th subband filter with respect
to zero frequency (they are images of each other), so they are candidate pairs for
combination into a real filter. Let us drop (7.4) into first equation of (7.7) to obtain
121
Hk .z/ D
N
1
X
n
kC0:5
h.n/ zW2M
nD0
N
1
X
.kC0:5/n n
h.n/W2M
(7.8)
nD0
and
H2M 1k .z/ D
N
1
X
n
2M 1kC0:5
h.n/ zW2M
nD0
N
1
X
n
k0:5
h.n/ zW2M
nD0
N
1
X
.kC0:5/n n
h.n/W2M
(7.9)
nD0
Comparing the two equations above we can see that the coefficients of Hk .z/ and
H2M 1k .z/ are obviously conjugates of each other, so the combined filter will
have real coefficients.
When the pair above are actually combined, they are weighted by a unitmagnitude constant c:
Ak .z/ D ck Hk .z/ C ck H2M 1k .z/
(7.10)
(7.11)
(7.12)
where ak .n/ and sk .n/ are the impulse responses of the kth analysis and synthesis
subband filters, respectively. After much derivation [93], we arrive at the following
cosine modulated analysis filter:
N 1
ak .n/ D 2h.n/ cos
.k C 0:5/ n
C k ;
M
2
for k D 0; 1; : : : ; M 1;
where the phase
k D .1/k
:
4
(7.13)
(7.14)
122
The cosine modulated synthesis filter can be obtained from the analysis filter
(7.13) using the mirror image relation (7.12):
N 1
.k C 0:5/ n
k ;
M
2
for k D 0; 1; : : : ; M 1:
(7.15)
The system of analysis and synthesis banks above completely eliminates phase
distortion, so the overall transfer function T .z/ defined in (6.32) is linear phase. But
amplitude distortion remains. Therefore, the CMFB is a nonperfect reconstruction
system and is sometimes called pseudo QMF (quadrature mirror filter).
Part of the amplitude distortion comes from incomplete aliasing cancellation:
aliasing from only adjacent subbands are canceled, not from all subbands.
Note that, even though linear phase condition (7.11) is imposed to the prototype
filter, the analysis and synthesis filters generally do not have linear phase.
for ! 2 0; =M :
(7.16)
This condition can be enforced through minimizing the following cost function:
Z
=M
.H.z// D
2
jH.ej! /j2 C jH ej.!=M / j2 1 d!:
(7.17)
123
Z
.H.z// D
2M
(7.18)
C
where controls the transition bandwidth and should be adjusted for a particular
application.
Now both amplitude distortion and stopband attenuation can be optimized by
minh.n/
Subject to (7.11);
(7.19)
where controls the trade-off between amplitude distortion and stopband attenuation. See [76] for standard optimization procedures that can be applied.
(7.20)
(7.21)
then
(7.22)
HQ .z/ D a C b z C cz2 :
(7.23)
124
Pk .z/
; k D 0; 1; : : : ; M 1;
PM Ck .z/
(7.27)
(7.28)
which means that the 2 1 system Pk .z/ is paraunitary. Therefore, the power complementary condition (7.21) is equivalent to the condition that Pk .z/ is paraunitary.
In general terms, an m n rational transfer matrix or system H.z/ is called
paraunitary if
Q
(7.29)
H.z/H.z/
D In ;
where In is the n n unit matrix. It is obviously necessary that m
n. Otherwise,
the rank of H.z/ is less than n. If m D n or the transfer matrix is square, the transfer
system is further referred to as unitary.
125
cos sin
sin cos
cos sin
sin cos
D I2 ;
(7.31)
so it is unitary.
A geometric interpretation of Givens rotation is that it rotates an input vector
clockwise by . In particular, if an input A D r cos ; r sin T with an angle of
is rotated clockwise by an angle of , the output vector has an angle of and is
given by
r cos. /
r cos cos C r sin sin
r cos./
D
D G./
: (7.32)
r sin. /
r sin cos r cos sin
r sin./
(7.33)
which is a simple 2 2 delay system and is shown in Fig. 7.7. It is unitary because
Z -1
126
Q D 1 0
DD
0 z
1
0
0
z1
D I2 :
(7.34)
cos
R./ D
sin
(7.35)
R ./R./ D cos
T
cos
sin
sin
D I1 :
(7.36)
(7.37)
The result above obviously can be extended to include a cascading of any number
of paraunitary systems.
Using this property we can build more complex 2 1 paraunitary systems of
arbitrary order by cascading the elementary paraunitary transfer matrices discussed
above. One such example is the lattice structure shown in Fig. 7.9. It has N 1
delay subsystem and N 1 Givens rotation. Its transfer function may be written as
127
P.z/ D
1
Y
(7.38)
nDN 1
N 1
It has a parameter set of fn gnD0
and N 1 delay units, so represents a 2 1
real-coefficient FIR system of order N 1. It was shown that any 2 1 realcoefficient FIR paraunitary systems of order N 1 may be factorized by such a
lattice structure [93].
1
Y
!
k
G n D.z/ R 0k ; k D 0; 1; : : : ; M 1:
(7.39)
nDm1
nk
om1
nD0
(7.40)
can be arbitrarily adjusted to minimize (7.26) without violating the power complementary condition. Since there are M such systems, the total number of free
parameters nk that can be varied for optimization is reduced to mM from 2mM.
128
To see this, let us first represent the linear phase condition (7.11) in the Z-transform
domain:
H.z/ D
N
1
X
h.n/zn
nD0
D z.N 1/
N
1
X
h.N 1 m/zm
mD0
D z.N 1/
N
1
X
h.m/zm
mD0
.N 1/
Dz
HQ .z/:
(7.41)
2M
1
X
zk Pk z2M :
(7.42)
kD0
2M
1
X
zN 1k Pk z2M
kD0
nD2M 1k
2M
1
X
zN Ck2M P2M 1k z2M
kD0
N D2mM
2M
1
X
zk z2M.m1/ P2M 1k z2M :
(7.43)
kD0
2M
1
X
zk PQk z2M :
(7.44)
kD0
(7.45)
129
or
PQk .z/ D zm1 P2M 1k .z/:
(7.46)
(7.47)
(7.48)
and
can be used to form the 2 1 system in (7.27), which in turn can be represented by the lattice structure in (7.39) with a parameter set given in (7.40). Since
each parameter set has m free parameters, the total number of free parameters is
mM=2.
The remaining polyphase components can be derived from the two sets above using the linear phase condition (7.46). In particular, the set of polyphase components
in (7.47) determines the following set:
P2M 1 .z/; P2M 2 .z/; : : : ; P3M=2 .z/
(7.49)
(7.50)
respectively.
130
7.4.3.2 Odd M
When M is odd or .M 1/=2 is an integer, the scheme above is still valid, except
for P.M 1/=2 and PM C.M 1/=2 . In particular, each pair of the following two sets of
polyphase components:
P0 .z/; P1 .z/; : : : ; P.M 1/=21 .z/
(7.51)
(7.52)
and
can be used to form the 2 1 system in (7.27), which in turn can be represented by
the lattice structure in (7.39) with a parameter set given in (7.40). Since each parameter set has m free parameters, the total number of free parameters is m.M 1/=2.
The linear phase condition (7.46) in turn causes (7.51) to determine
P2M 1 .z/; P2M 2 .z/; : : : ; P.3M C1/=2 .z/
(7.53)
(7.54)
respectively.
Apparently, both P.M 1/=2 and PM C.M 1/=2 are missing from the lists above.
The underlying reason is that both the power-complementary and linear-phase conditions apply to them simultaneously. In particular, the linear phase condition (7.46)
requires
(7.55)
PQ.M 1/=2 .z/ D zm1 P.3M 1/=2 .z/:
This causes the power complementary condition (7.21) to become
2PQ.M 1/=2 .z/P.M 1/=2 .z/ D ;
which leads to
P.M 1/=2 .z/ D
p
0:5z ;
(7.56)
(7.57)
where is a delay. Since H.z/ is low-pass with a cutoff frequency of =2M , it can
be shown that the only acceptable choice of is [93]
(
D
m1
2 ;
m
;
2
if m is oddI
if m is even:
(7.58)
Since is a scale factor for the whole the system, P.M 1/=2 is now completely
determined. In addition, P.3M 1/=2 .z/ can be derived from it using (7.55):
P.3M 1/=2 .z/ D z.m1/ PQ.M 1/=2 .z/ D
p
0:5z.m1/ :
(7.59)
131
7.5.1 Even m
When m is even, the analysis filter bank, represented by the following vector:
2
A0 .z/
A1 .z/
::
:
6
6
a.z/ D 6
4
3
7
7
7;
5
(7.60)
AM 1 .z/
may be written as [93]
a.z/ D
M Dc CI J
P .z2M / 0
I J 0
0
P1 .z2M /
d.z/
zM d.z/
(7.61)
where
Dc kk D cos .k C 0:5/m; k D 0; 1; : : : ; M 1;
(7.62)
3
1
07
7
:: 7
:7
7
1 0 05
1 0 0 0
0
60
6
6
J D 6 :::
6
40
0 0
0 1
::
::
:
:
(7.63)
i
P0 z2M
D Pk z2M ; k D 0; 1; : : : ; M 1;
kk
(7.64)
is the diagonal matrix consisting of the initial M polyphase components of the prototype filter P .z/, and
132
P1 z2M
i
kk
D PM Ck z2M ; k D 0; 1; : : : ; M 1;
(7.65)
is an diagonal matrix consisting of the last M polyphase components of the prototype filter P .z/.
For an even m, (7.62) becomes
Dc kk D cos0:5m D .1/0:5m ; k D 0; 1; : : : ; M 1;
(7.66)
Dc D .1/0:5m I:
(7.67)
so
Therefore, the analysis bank may be written as
#"
#
"
2M
p
d.z/
.z
/
0
P
0
:
a.z/ D .1/0:5m M CIJ IJ
zM d.z/
0
P1 .z2M /
(7.68)
The filter bank above may be implemented using the structure shown in Fig. 7.10.
It is obvious that the major burdens of calculations are the polyphase filtering which
entails operations on the order of N and the M M DCT-IV which needs operations
Fig. 7.10 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an even m. The Pk .z2M / is the
p kth polyphase component of the
prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted
133
on the order of M log2 M when implemented using a fast algorithm, so the total
number of operations is on the order of N C M log2 M .
The synthesis filter bank is obtained from the analysis bank using (7.12), which
becomes
Sk .z/ D z.N 1/ AQk .z/
(7.69)
due to (7.41). Denoting
sT .z/ D S0 .z/; S0 .z/; : : : ; SM 1 .z/;
(7.70)
(7.71)
"
#
p
IJ
C M .1/0:5m :
I J
(7.72)
0
"
D
JP1 z2M J 0
0
JP0 z2M J
0
#
::
:
0
::
:
::
:
0
::
:
P0 z2M
3
7
7
7
7
7
7
7
7
7
7
7
5
(7.73)
134
Fig. 7.11 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an even m. The Pk .z2M / is the
p kth polyphase component of the
prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted
where the last step uses the following property of the reversal matrix:
Jdiagfx1 ; x2 ; : : : ; xM 1 gJ D diagfxM 1 ; : : : ; x2 ; x1 g:
(7.74)
7.5.2 Odd m
When m is odd, both the analysis and synthesis banks are essentially the same as
when m is even, with only minor differences. In particular, the analysis bank is given
by [93]
135
a.z/ D
p
P .z2M / 0
M Ds CI C J I J 0
0
P1 .z2M /
d.z/
zM d.z/
(7.76)
where
Ds kk D sin .k C 0:5/m D .1/
m1
2
.1/k ; k D 0; 1; : : : ; M 1;
(7.77)
is a diagonal matrix with alternating 1 and 1 on the diagonal. Since this matrix
only changes the signs of alternating subband samples and these sign changes are
reversed upon input to the synthesis bank, the implementation of this matrix can be
omitted in both analysis and synthesis bank.
Similar to the case with even m, the corresponding synthesis filter bank may be
obtained as:
i JP .z2M /J 0
h
1
T
M M C1 Q
M C1 Q
d.z/ z
d.z/
s .z/ D z z
0
JP0 .z2M /J
p
ICJ
(7.78)
CDs M :
IJ
The analysis and synthesis banks can be implemented by structures shown in
Figs. 7.12 and 7.13, respectively.
Fig. 7.12 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the
subbands,
prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating
p
hence can be omitted together with that in the synthesis bank. A scale factor of M is omitted
136
Fig. 7.13 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter
has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the
subbands,
prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating
p
hence can be omitted together with that in the analysis bank. A scale factor of M is omitted
k D 0; 1; : : : ; 2M 1;
(7.79)
137
which makes it intuitive to understand the filter bank. For example, the powercomplementary condition (7.21) becomes
h2 .n/ C h2 .M C n/ D ; n D 0; 1; : : : ; M 1:
(7.80)
Also, the polyphase filtering stage in both Figs. 7.12 and 7.13 becomes simply applying the prototype filter coefficients, so the prototype filter is often referred to as
the window function or simply the window. Since the block size is M and the window size is 2M , there is half window or one block overlapping between blocks, as
shown in Fig. 7.15.
The window function can be designed using the procedure discussed in Sect. 7.4.
Due to m D 1, the lattice structure (7.39) degenerates into the rotation vector (7.35),
so the design problem is much simpler.
A window widely used in audio coding is the following half-sine window or
simply sine window:
h
i
:
h.n/ D sin .n C 0:5/
2M
(7.81)
7.6.2 MDCT
The widely used MDCT is actually given by the following synthesis filter:
M C1
.k C 0:5/ n C
M
2
for k D 0; 1; : : : ; M 1:
(7.83)
(7.84)
138
k
4n C 0
4n C 1
4n C 2
4n C 3
k
=4
=4
=4
=4
k
=4 C
=4
=4
=4 C
k k
0
0
Fig. 7.14 Implementation of forward MDCT as application of window function and then calculation of DCT-IV. The block C represents DCT-IV matrix
It differs from the k in (7.14) for some k by , as shown in Table 7.1. A phase
difference of causes the cosine function to switch its sign to negative, so some
of the analysis and synthesis filters will have negative values when compared
with those given by (7.14). This is equivalent to that a different Ds is used in
the analysis and synthesis banks in Sect. 7.5.2. As stated there, this kind of sign
change is insignificant as long as it is complemented in both analysis and synthesis
banks.
139
Fig. 7.15 Implementation of MDCT as application of an overlapping window function and then
calculation of DCT-IV
of the window function directly. In particular, the input to the DCT block may be
expressed as
un D x.M=2 C n/h.3M=2 C n/ x.M=2 1 n/h.3M=2 1 n/ (7.85)
and
unCM=2 D x.n M /h.n/ x.1 n/h.M 1 n/
(7.86)
for n D 0; 1; : : : ; M=2 1, respectively. For both equations above, the current block
is considered as consisting of samples x.0/; x.1/; : : : ; x.M 1/ and the past block
as x.M /; x.M 1/; : : : ; x.1/ which essentially amounts to a delay line.
The first half of the inputs to the DCT-IV block, namely un in (7.85), are obviously obtained by applying the second half of the window function to the current
block of input data. The second half, namely unCM=2 in (7.86), are calculated by
applying the first half of the window function to the previous block of data. This
constitutes an overlap with the previous block of data. Therefore, the implementation of MDCT may be considered as application of an overlapping window function
and then calculation of DCT-IV, as shown in Fig. 7.15.
The following is the Matlab code for implementing MDCT:
function [y] = mdct(x, n0, M, h)
%
% [y] = mdct(x, n0, M, h)
%
% x: Input array. M samples before n0 are considered as the
%
delay line
% n0: Start of a block of new data
% M: Block size
% h: Window function
% y: MDCT coefficients
%
% Here is an example for generating the sine win
% n = 0:(2*M-1);
% h = sin((n+0.5)*0.5*pi/M);
%
140
% Convert to DCT4
for n=0:(M/2-1)
u(n+1) = - x(n0+M/2+n+1)*h(3*M/2+n+1)
- x(n0+M/2-1-n+1)*h(3*M/2-1-n+1);
u(n+M/2+1) = x(n0+n-M+1)*h(n+1)
- x(n0+-1-n+1)*h(M-1-n+1);
end
%
% DCT4, you can use any DCT4 subroutines
y=dct4(u);
The inverse of Fig. 7.14 is shown in Fig. 7.16 which does not utilize the symmetric property of the window function imposed by the linear phase condition. The
output of the synthesis bank may be expressed as
x.n/ D xd.n/ C uM=2Cn h.n/;
xd.n/ D uM=21n h.M C n/;
for n D 0; 1; : : : ; M=2 1I
(7.87)
and
x.n/ D xd.n/ u3M=21n h.n/;
xd.n/ D uM=2Cn h.M C n/;
for n D M=2; M=2 C 1; : : : ; M 1I
(7.88)
Fig. 7.16 Implementation of backward MDCT as calculation of DCT-IV and then application of
window function. The block C represents DCT-IV matrix
141
If Figs. 7.14 and 7.16 were followed strictly, we could end up with half length for
the delay lines at the cost of an increased complexity for swapping variables.
Part IV
Entropy Coding
Chapter 8
Let us consider a 2-bit quantizer that represents quantized values using the following
set of quantization indexes:
f0; 1; 2; 3g:
(8.1)
Each quantization index given above is called a source symbol, or simply a symbol,
and the set is called a symbol set. When applied to quantize a sequence of input
samples, the quantizer produces a sequence of quantization indexes, such as the
following:
1; 2; 1; 0; 1; 2; 1; 2; 1; 0; 1; 2; 2; 1; 2; 1; 2; 3; 2; 1; 2; 1; 1; 2; 1; 0; 1; 2; 1; 2:
(8.2)
60
D 2 bits/symbol;
30
(8.3)
145
146
Symbol
0
1
2
3
Codeword
00
01
10
11
As an example, the first two columns of Table 8.2 are such a variable-length
codebook that is built using the unary numeral system. Assuming that the sequence
is iid and that the number that a symbol appears in the sequence accurately reflects
its frequency of occurrence, the probability distribution may be estimated as
p.0/ D
14
12
1
3
; p.1/ D
; p.2/ D
; p.3/ D
:
30
30
30
30
(8.4)
With this unary codebook, the most frequently occurring symbol 1 is coded with
one bit and the least frequently occurring symbol 3 is coded with four bits. The
average codeword length is
LD
14
12
1
3
3C
1C
2C
4 D 1:7 bits;
30
30
30
30
which is 0.3 bits better than the the fixed-length code in Table 8.1.
(8.5)
(8.6)
147
consisting of M source symbols. The symbols in the source sequence (8.5) is often
assumed to be an independently and identically distributed (iid) random variable
with a probability distribution of
p.sm / D pm ; m D 0; 1; : : : ; M 1:
(8.7)
(8.8)
C D fc0 ; c1 ; : : : ; cM 1 g;
(8.9)
called a codebook or simply code. The goal is to find a codebook that minimizes
the average codeword length without loss of information.
The codewords are often represented using binary numerals, in which case the
resultant sequence of codewords, called a codeword sequence or simply a code
sequence, is referred to as a bit stream. While other radix of representation, such
as hexadecimal, can also be used, binary radix is assumed in this book without loss
of generality.
This coding process has to be lossless in the sense that the complete source
sequence can be recovered or decoded from the received codeword sequence without any error or loss of information, so it is called lossless compression coding or
simply lossless coding.
In what amounts to symbol code approach to lossless compression coding, a
one-to-one mapping:
sm
! cm ;
for m D 0; 1; : : : ; M 1;
(8.10)
is established between each source symbol sm in the symbol set and a codeword
cm in the codebook and then deployed to encode the source sequence or decode the
codeword sequence symbol by symbol. The codebooks in Tables 8.1 and 8.2 are
symbol codes.
Let l.cm / denotes the codeword length of codeword cm in codebook (8.9), then
the average codeword length per source symbol of the code sequence (8.8) is
LD
M
1
X
p.sm /l.cm /:
(8.11)
mD0
Due to the symbol code mapping (8.10), the equation above becomes:
LD
M
1
X
p.cm /l.cm /;
mD0
(8.12)
148
Apparently any symbol set and consequently source sequence can be represented
or coded using a binary numeral system with
L D ceil log2 .M / bits;
(8.13)
where the function ceil.x/ returns the smallest integer no less than x. This results
in a fixed-length codebook or fixed-length code in which each codeword is coded
with an L-bits binary numeral or is said to have a codeword length of L bits. This
fixed length binary codebook is considered as the baseline code for an information
source and is used by PCM in (2.30). The performance of a variable-length code
may be assessed by compression ratio:
RD
L0
;
L
(8.14)
where L0 and L are the average codeword lengths of the fixed-length code and
the variable-length code, respectively. For the example unary code in Table 8.2, the
compression ratio is
2
1:176:
RD
1:7
8.2 Entropy
In pursuit of a codebook that delivers an average codeword length as low as possible,
it is critical to know if there exists a minimal average codeword length and, if it
exists, what it is. Due to (8.12), the average codeword length is weighted by the
probability distribution of the given information source, so it can be expected that
the answer is dependent on this probability model. In fact, it was discovered by
Claude E. Shannon, an electrical engineer at Bell Labs, in 1948 that this minimum is
the entropy of the information source which is solely determined by the probability
distribution [85, 86].
8.2.1 Entropy
When a message X from an information source is received by the receiver which
turns out to be symbol sm , the associated self-information is
I.X D sm / D log p.sm /:
(8.15)
The average information per symbol for all messages emitted by the information
source is obviously dependent on the probability that each symbol occurs and is
thus given by:
M
1
X
H.X / D
p.sm / log p.sm /:
(8.16)
mD0
8.2 Entropy
149
This is called entropy and is the minimal average codeword length for the given
information source (to be proved later).
The unit of entropy is determined by the logarithmic base. The bit, based on the
binary logarithm (log2 ), is the most commonly used unit. Other units include the
nat, based on the natural logarithm (loge ), and the hartley, based on the common
logarithm (log10 ). Due to
log2 x
;
loga x D
log2 a
conversion between these units are simple and straigthforward, so binary logarithm
(log2 ) is always assumed in this book unless stated otherwise.
The use of logarithm as a measure of information makes sense intuitively. Let us
first note that the function in (8.15) is a decreasing function of the probability p.sm /
and equals zero when p.sm / D 1. This means that
A less likely event carries more information because the amount of surprise
is larger. When the receiver knows that an event is sure to happen, i.e.,
p.X D sm / D 1, before receiving the message X , the event of receiving X
to discover that X D sm carries no information at all. So the self-information
(8.15) is zero.
More bits need to be allocated to encode less likely events because they carry
more information. This is consistent with our strategy for codeword assignment: assigning longer codewords to less frequently occurring symbols. Ideally,
the length of the codeword assigned to encode a symbol should be its selfinformation. If this were done, the average codeword length would be the same
as the entropy. Partly because of this, variable-length coding is often referred to
as entropy coding.
To view another intuitive perspective about entropy, let us suppose that we received two messages (symbols) from the source: Xi and Xj . Since the source is iid,
we have p.Xi ; Xj / D p.Xi /p.Xj /. Consequently, the self-information carried by
the two messages is
I.Xi ; Xj / D log p.Xi ; Xj /
D log .p.Xi // log p.Xj /
D I.Xi / C I.Xj /:
(8.17)
message carries.
The number of bits to code two messages should be the sum of coding each
message individually.
In addition to the intuitive perspectives outlined above, there are other considerations which ensure that the choice of using logarithm for entropy is not arbitrary.
See [85, 86] or [83] for details.
150
As an example, let us calculate the entropy for the source sequence (8.2):
H.X / D
3
log2
30
12
log2
30
3
30
12
30
14
log2
30
1
log2
30
14
30
1
30
1:5376 bits:
Compared with this value of entropy, the average codeword length of 1.7 bits
achieved by the unary code in Table 8.2 is quite impressive.
2
15
13
; p.0/ D
; p.1/ D
:
30
30
30
Its entropy is
H.X / D
13
log2
30
13
30
2
log2
30
2
30
15
log2
30
15
30
1:2833 bits,
8.2 Entropy
151
which is 1:5376 1:2833 D 0:2543 bits less than the entropy achieved with the
false iid assumption.
Another approach to exploiting the correlation in example sequence (8.2) is to
borrow the idea from vector quantization and consider the symbol sequence as a
sequence of vectors or block symbols:
.1; 2/; .1; 0/; .1; 2/; .1; 2/; .1; 0/; .1; 2/; .2; 1/;
.2; 1/; .2; 3/; .2; 1/; .2; 1/; .1; 2/; .1; 0/; .1; 2/; .1; 2/
with an alphabet of
f(2, 3), (1, 0), (2, 1), (1, 2)g:
The probabilities of occurrence for all block symbols or vectors are
p..2; 3// D
3
4
7
1
; p..1; 0// D
; p..2; 1// D
; p..1; 2// D
:
15
15
15
15
1
15
4
15
3
log2
15
7
log2
15
3
15
7
15
Data model
iid
Prediction
Block coding
152
Symbol
0
1
2
3
Codeword
000
0
00
0000
Frequency
3
14
12
1
Codeword length
3
1
2
4
153
Fixed-length codes such as the one in Table 8.1 is also uniquely decodable
because all codewords are of the same length and unique in the codebook. To decode a sequence of symbols coded with a fixed-length code of n bits, one can cut
the sequence into blocks of n bits each, extract the bits in each block, and look up
the codebook to find the symbol it represents.
The unique decodability imposes limit on codeword lengths. In particular,
McMillans inequality states that a binary codebook (8.9) is uniquely decodable if
and only if
M
1
X
2l.cm / 1;
(8.18)
mD0
where l.cm / is again the length of codeword cm . See [43] for proof.
Given the source sequence (8.5) and the probability distribution (8.7), there are
many uniquely decodable codes satisfying the requirement for decodability (8.18).
But these codes are not equal because their average codeword lengths may be different. Among all of them, the one that produces the minimum average codeword
length
M
1
X
p.cm /l.cm /
(8.19)
Lopt D min
fl.cm /g
mD0
Symbol
0
1
2
Codebook A
0
01
011
Codebook B
0
01
11
154
reception of f011g which the receiver can decide immediately that it is codeword
f011g, the decoder has to wait until the reception of f0g, the start of the next source
symbol. Therefore, the decoding delay may be more than one codeword.
For Codebook B, the decoding delay may be as long as the whole sequence.
For example, the source sequence f01111g may decode as f0,2,2g. However, when
another f1g is subsequently received to give a codeword sequence of f011111g, the
interpretation of the initial 5s bits (f01111g) becomes totally different because the
codeword sequence f011111g decodes as f1,2,2g now. The decoder cannot make the
decision until it sees the end of the sequence. The primary reason for the delayed
decision is that the codeword f0g is a prefix of codeword f01g. When the receiver
sees f0g, it cannot decide whether it is codeword f0g or just the first bit of codeword
f01g. To resolve this, it has to wait until the end of the sequence and work backward.
Apparently, whether or not a codeword is a prefix of another codeword is critical
to whether it is instantaneously decodable. A codebook in which no codeword is
a prefix of any other codewords is referred to as a prefix code, or more precisely
prefix-free code. The unary code in Table 8.2 and fixed-length code in Table 8.1 are
both prefix codes and are instantaneously decodable. The two uniquely decodable
codes in Table 8.5 are not prefix codes and are not instantaneous decodable.
In fact, this association is not a coincidence: a codebook is instantaneously decodable if and only if it is a prefix-free code. To see this, let us assume that there is
a codeword in an instantaneously decodable code that is a prefix of at least another
codeword. Because of this, this codeword is obviously not instantaneously decodable, as shown by Codebook A and Codebook B in the above example. Therefore,
an instantaneously decodable code has to be prefix-free code.
On the other hand, all codewords of a prefix-free code can be decoded instantaneously upon reception because there is no ambiguity with any other codewords in
the codebook.
155
Fig. 8.2 Binary trees for two noninstantaneous codes. Since left tree (for Codebook A) takes codewords from internal nodes f0g and f01g, and the right tree (for Codebook B) from f0g, respectively,
both are not prefix-free codes. The branches are not labeled due to the convention that left branches
represent 0 and right branches represents 1
Figure 8.2 shows the trees for the two noninstantaneous codes given in Table 8.5.
Since Codebook A takes codewords from internal nodes f0g and f01g, and Codebook B from f0g, respectively, both are not prefix-free codes. Note that the branches
are not labeled because both trees follow the convention that left branches represent
0 and right branches represent 1. This convention is followed throughout this
book unless stated otherwise.
156
2l.cm / 1:
(8.20)
mD0
M
1
X
2l.cm / :
mD0
(8.21)
157
Consequently, we have
LD
M
1
X
p.cm /l.cm /
mD0
M
1
X
mD0
M
1
X
h
i
p.cm / log2 2l.cm / K
mD0
M
1
X
"
p.cm / log2
mD0
M
1
X
p.cm / log2
mD0
DH
M
1
X
p.cm /2l.cm / K
p.cm /
M
1
h
i
X
1
C
p.cm / log2 p.cm /2l.cm / K
p.cm /
mD0
p.cm / log2
mD0
1
:
p.cm /2l.cm / K
Due to
log2 .x/
1
.x 1/; 8x > 0;
ln 2
p.cm / log2
mD0
1
p.cm /2l.cm / K
M 1
1 X
1
p.cm /
1
ln 2 mD0
p.cm /2l.cm / K
M 1
1
1 X
/
p.c
m
ln 2 mD0 2l.cm / K
1
D
ln 2
"
M 1
M 1
1 X l.cm / X
2
p.cm /
K mD0
mD0
1
1 1
ln 2
D 0:
D
Therefore, we have
L
H:
(8.22)
158
(8.23)
2l.cm / D
mD0
M
1
X
mD0
M
1
X
2log2 p.cm /
mD0
M
1
X
p.cm /
mD0
D 1:
(8.24)
M
1
X
mD0
M
1
X
mD0
D 1
M
1
X
mD0
D 1 C H.X /;
(8.25)
159
Combining this and the lower entropy bound (8.22), we obtain the following bound
for the optimal codeword length:
H.X / Lopt < 1 C H.X /:
(8.26)
(8.28)
Since source sequence (8.5) is assumed to be iid, the probability distribution for a
block symbol is
p.sm0 ; sm1 smn1 / D p.sm0 /p.sm1 / p.smn1 /:
(8.29)
1
M
1 M
X
X
m0 D0 m1 D0
M
1
X
mn1 D0
M
1
X
m0 D0
D nH.X /:
(8.30)
(8.31)
where Lnopt is the optimal codeword length for the block codebook that is used
to code the sequence of block symbols. The average codeword length per source
symbol is obviously
Lnopt
:
(8.32)
Lopt D
n
160
1
C H.X /:
n
(8.33)
Chapter 9
Huffman Coding
Now that we are assured that, given a probability distribution, if there is an optimal uniquely decodable code, there is a prefix-free code with the same average
codeword length, the next step is the construction of such optimal prefix-free codes.
Huffmans algorithm [29], developed by David A. Huffman in 1952 when he was a
Ph.D. student at MIT, is just such a simple algorithm.
14
12
1
50
3
C1
C2
C3
D
1:6667 bits.
30
30
30
30
30
161
162
9 Huffman Coding
Fig. 9.1 Steps involved in constructing a Huffman codebook for the probability distribution in
(8.4)
This is better than 1:7 bits achieved using the unary code given in Table 8.2, but still
larger than the entropy of 1.5376 bits.
To present Huffmans algorithm for an arbitrary information source, let us consider the symbol set (8.6) with a probability distribution (8.7), Huffmans algorithm
generates the codebook (8.9) with the following recursive procedure:
1. Let m0 , m1 , : : :, mM 1 be a permutation of 0, 1, : : :, M 1 for which
pm0
pm1
pmM 1 ;
(9.1)
(9.2)
2. Merge the two least probable symbols smM 2 and smM 1 into a new metasymbol s 0 with the associated probability of
p 0 D pmM 2 C pmM 1 :
(9.3)
(9.4)
(9.5)
3. Repeat step 1 until the number of symbols is two. Then return with the following
codebook
f0; 1g:
(9.6)
9.2 Optimality
163
(9.7)
(9.8)
9.2 Optimality
Huffmans algorithm produces optimal codes. This can be proved by induction on
the number of source symbols M . In particular, for M D 2, the codebook produced
by Huffmans algorithm is obviously optimal: we cannot do any better than using
one bit to code each symbol. For M > 2, let us assume that Huffmans algorithm
produces an optimal code for a symbol set of size M 1, we prove that Huffmans
algorithm produces an optimal code for a symbol set of size M .
164
9 Huffman Coding
Fig. 9.3 Removal of the internal node 1 in the tree on the top produces the tree at the bottom
which has shorter average codeword length
with shorter average codeword length. Application of this procedure to the two least
probable symbols in a codebook ensures that they always have the same codeword
length. Otherwise, the longest codeword must be from at least one internal node
which grows only one branch.
Third, if the codewords for the two least probable symbols do not grow from
the same last internal node, the last internal node for the least probable symbol
must grow another codeword whose length is the same as the second least probable
9.2 Optimality
165
symbol. Due to the same codeword length, these two codewords can be swapped
with no impact to the average codeword length. Consequently, the codewords for
the two least probable symbols can always grow from the same last internal node.
In other words, the two codewords can always be of the form of c 0 0 and c 0 1.
(9.9)
for all 0 m M 3:
(9.10)
Then the last two symbols can be merged to give the following symbol set:
fs0 ; s1 ; : : : ; sM 3 ; s 0 g
(9.11)
where
p0 ; p1 ; : : : ; pM 3 ; p 0
(9.12)
p 0 D pM 2 C pM 1 :
(9.13)
Applying Huffmans recursive procedure in Sect. 9.1 to symbol set (9.11) produces a Huffman codebook with an average codeword length of LM 1 . By the
induction hypothesis, this Huffman codebook is optimal.
The last step of Huffman procedure grows the last codeword in symbol set (9.11)
into two codewords by attaching one bit (0 and 1) to its end to produce a codebook of size M for symbol set (9.9). This additional bit is added with a probability
of pM 2 CpM 1 , so the average codeword length for the new Huffman codebook is
LM D LM 1 C pM 2 C pM 1 :
(9.14)
Suppose that there were another instantaneous codebook for the symbol set in
O M that is less than LM :
(8.6) with an average codeword length of L
O M < LM :
L
(9.15)
166
9 Huffman Coding
As shown in Sect. 9.2.1, this codebook can be modified so that the codewords for
two least probable symbols sM 2 and sM 1 have the forms of c 0 0 and c 0 1, while
keeping the average codeword length the same or less. This means the symbol set is
permutated to have a form given in (9.9) with the corresponding probability distribution given by (9.10). This codebook can be used to produce another codebook of
size M 1 for symbol set (9.11) by keeping the codewords for fs0 ; s1 ; : : : ; sM 3 g
the same and encoding the last symbol s 0 using c. Let us denote its average codeword length as LO M 1 . Following the same argument as that which leads to (9.14),
we can establish
(9.16)
LO M D LO M 1 C pM 2 C pM 1 :
Subtracting (9.16) from (9.14), we have
LM LO M D LM 1 LO M 1
(9.17)
(9.18)
or
O M LM 0
L
(9.19)
LO M
LM
(9.20)
which contradicts the supposition in (9.15). Therefore, Huffmans algorithm produces an optimal codebook for M as well.
To summarize, it was proven above that, if Huffmans algorithm produces an
optimal codebook for a symbol set of size M 1, it produces an optimal codebook
for a symbol set of size M . Since it produces the optimal codebook for M D 2, it
produces optimal codebooks for any M .
167
It obviously has an average codeword length of one bit, regardless the underlying
probability distribution and entropy. The more skewed the probability distribution
is, the smaller the entropy is, hence the less efficient the Huffman code is. As an
example, let us consider the probability distribution of p1 D 0:1 and p2 D 0:9. It
results in an entropy of
H D 0:1 log.0:1/ 0:9 log.0:9/ 0:4690 bits;
which is obviously much smaller than the one bit delivered by the Huffman code.
1
.0:01 3 C 0:09 3 C 0:09 2 C 0:81 1/ D 0:645 bits;
2
which is significantly smaller than the one bit achieved by the M D 2 Huffman
code.
Table 9.1 Probability
distribution when two
symbols are coded together as
one block symbol
Symbol
Probability
00
01
10
11
0.01
0.09
0.09
0.81
Fig. 9.4 Huffman code when two symbols are coded together as one block symbol
168
9 Huffman Coding
Symbol
000
001
010
011
100
101
110
111
Probability
0.001
0.009
0.009
0.081
0.009
0.081
0.081
0.729
Fig. 9.5 Huffman code when three symbols are coded together as one block symbol
This can be further improved by coding three symbols as one block symbol. This
gives us the probability distribution in Table 9.2 and the Huffman code in Fig. 9.5.
The average codeword length is
LD
1
.0:001 7 C 0:009 7 C 0:009 6 C 0:081 4
3
C 0:009 5 C 0:081 3 C 0:081 2 C 0:729 1/
D 0:5423 bits;
which is more than 0.1 bit better than coding two symbols as a block symbol and
closer to the entropy.
169
(9.21)
s0 D B B1 M;
s1 D B1 B2 M;
::
:
sn2 D Bn2 Bn1 M;
sn1 D Bn1 Bn M:
B1 D B=M I
B2 D B1=M I
B3 D B2=M I
::
:
(9.22)
Bn D Bn1 =M
170
9 Huffman Coding
Without loss of generality, let us represent a symbol set by its indexes starting
from zero, thus each symbol in the symbol set corresponds to a nonnegative
integer x. This x can be represented as
x D q M C r;
(9.23)
where M is the maximum value of the reduced symbol set f0; 1; : : : ; M g, q is the
quotient, and r is the remainder. Once M is agreed upon by the encoder and decoder,
only q and r need to be conveyed to the decoder.
Usually r is encoded using a Huffman codebook and q by other means. One
simple approach to encoding q is to represent it by repeating q times the symbol M .
In this way a single Huffman codebook can be used to encode a large symbol set,
no matter how large it is.
n D 1I
Unpack one bit from the bit stream;
Concatenate the bit to the previously unpacked bits to form a word with n bits;
Search the codewords of n bits in the Huffman codebook;
Stop if a codeword is found to be equal to the unpacked word;
n D n C 1 and go back to step 2.
Part V
Audio Coding
While presented from the perspective of audio coding, the chapters in the previous
parts cover theoretical aspects of the coding technology that can be applied to the
coding of signals in general. The chapters in this part are devoted to the coding of
audio signals. In particular, Chap. 10 covers perceptual models which determines
which part of the source signal is inaudible (perceptually irrelevant) and thus can be
removed. Chapter 11 addresses the resolution challenge posed by the frequent interruption by transients to the otherwise quasi-stationary audio signals. Chapter 12
deals with widely used methods for joint channel coding as well as the coding of
low-frequency effect (LFE) channels. Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms. Chapter 14
is devoted to performance assessment of audio coding algorithms and Chap. 15
presents dynamic resolution adaptation (DRA) audio coding standard as an example to illustrate how to integrate the technologies described in this book to create a
practical audio coding algorithm.
Chapter 10
Perceptual Model
Although data model and quantization have been discussed in detail in the earlier
chapters as the tool for effectively removing perceptual irrelevance, a question still
remains as to which part of the source signal is perceptually irrelevant. Feasible
answers to this question obviously depend on the underlying application. For audio
coding, perceptual irrelevance is ultimately determined by the human ear, so perceptual models need to be built that mimic the human auditory system so as to indicate
to an audio coder which parts of the source audio signal are perceptually irrelevant,
hence can be removed without audible artifacts.
When a quantizer removes perceptual irrelevance, it essentially substitutes quantization noise for perceptually irrelevant parts of the source signal, so the quantization process should be properly controlled to ensure that quantization noise is not
audible.
Quantization noise is not audible if its power is below the sensitivity threshold
of the human ear. This threshold is very low in an absolutely quiet environment
(threshold in quiet), but becomes significantly elevated in the presence of other
sounds due to masking. Masking is a phenomenon where a strong sound makes
a weak sound less audible or even completely inaudible when the power of the
weak sound is below a certain threshold jointly determined by the characteristics of
both sounds.
Quantization noise may be masked by signal components that occur simultaneously with the signal component being quantized. This is called simultaneous
masking and is exploited most extensively in audio coding. Quantization noise may
also be masked by signal components that are ahead of and/or behind it. This is
called temporal masking.
The task of the perceptual model is to explore the threshold in quiet and the
simultaneous/temporal masking to come up with an estimate of global masking
threshold which is a function of frequency and time. The audio coder can then adjust its quantization process in such a way that all quantization noises are below this
threshold to ensure that they are not audible.
173
174
10 Perceptual Model
p
dB;
p0
(10.1)
where p0 is a reference level of 2 105 Pa. This reference level corresponds to the
best hearing sensitivity of an average listener at around 1,000 Hz.
Another description of sound waves is the sound intensity I which is defined
as the sound power per unit area. For a spherical or plane progressive wave, sound
intensity I is proportional to the square of sound pressure, so sound intensity level
is related to the sound pressure level by
l D 20 log10
p
I
D 10 log10
dB;
p0
I0
(10.2)
where I0 D 1012 W=m2 is the reference level. Due to this relationship, SPL level
and sound intensity level are identical on logarithmic scale.
When a sound signal is considered as a wide sense stationary random process, its
spectrum or intensity density level is defined as
L.f / D
P .f /
;
I0
(10.3)
175
100
90
Threshold in Quiet (dB SPL)
80
70
60
50
40
30
20
10
0
10
102
103
Frequency (Hz)
104
Tq .f / D 3:64
f
1;000
0:8
f
1;000
4
dB
(10.4)
176
10 Perceptual Model
Fig. 10.2 Major steps involved in the conversion of sound waves into neural signals in the
human ear
177
Fig. 10.3 Instantaneous displacement of basilar membrane for an input sound wave consisting of
three tones (400, 1,600, and 6,400 Hz) that is presented to the oval window of basilar membrane.
Note that the three excitations do not appear simultaneously because the wave needs time to travel
along the basilar membrane
An observation from Fig. 10.3 is that the auditory filters are continuously placed
along the length of the basilar membrane and is activated in response to the frequency components of the input sound wave. If the frequency components of the
sound wave are close to each other, these auditory filters overlap significantly.
There is, of course, no decimation in the continuous-time world of neurons. As
will be discussed later in this chapter, the frequency responses of these auditory
filters are asymmetric, nonlinear, level-dependent, and with nonuniform bandwidth
that increases with frequency. Therefore, auditory filters are very different from the
discretely placed, almost nonoverlapping and often maximally decimated subband
filters that we are familiar with.
(10.5)
where fc is the center frequency, the phase, A the amplitude, n the filters order, t
the time, and B is the filters bandwidth. Figure 10.4 shows the magnitude response
of the gammatone filter with fc D 1;000 Hz, n D 4, and B determined by (10.7).
A widely used model that catches the asymmetric nature of auditory filters is
the rounded exponential filter, denoted as roex.pl ; pu /, whose power spectrum is
given by [73]
8
fc f
178
10 Perceptual Model
20
10
0
10
20
30
40
50
60
0
1000
2000
3000
Frequency (Hz)
4000
Fig. 10.4 Magnitude response of a gammatone filter that models the auditory filters
0
20
40
60
80
100
120
140
160
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Frequency (Hz)
Fig. 10.5 Power spectrum of roex.pl ; pu / filter (fc D 1;000 Hz, pl D 40, and pu D 10) that
models auditory filters
where fc represents the center frequency, pl determines the slope of the filter below
fc , and pu determines the slope of the filter above fc . Figure 10.5 shows the power
spectrum of this filter for fc D 1;000 Hz, pl D 40 and pu D 10.
179
The roex.pl ; pu / filter with pl D pu was used to estimate the critical bandwidth
(ERB) of the auditory filters [53, 70]. When the order of the gammatone filter is in
the range 35, the shape of its magnitude characteristic is very similar to that of
the roex.pl ; pu / filter with pl D pu [72]. In particular, when the order n D 4, the
suggested bandwidth for gammatone filter is
B D 1:019ERB;
(10.7)
(10.8)
and is shown in Fig. 10.6. In fact, one Bark approximately corresponds to a distance
of about 1.3 mm along the basilar membrane [102]. The Bark scale is apparently
neither linear nor logarithmic with respect to the linear frequency scale.
The Bark scale was proposed by Eberhard Zwicker in 1961 [101] and named in
memory of Heinrich Barkhausen who introduced the phon, a scale for loudness
as perceived by the human ear.
180
10 Perceptual Model
25
Bark
20
15
10
5
0
2000
4000
25
Bark
20
15
10
5
0
100
101
102
Frequency (Hz)
103
104
Fig. 10.6 Relationships of the Bark scale with respect to the linear frequency scale (top) and the
logarithmic frequency scale (bottom)
listeners and measure the threshold in quiet [102]. For example, to estimate the critical bandwidth near 1,000 Hz, the threshold in quiet is measured by placing more
and more equally powered tones starting from 920 Hz with a 20 Hz frequency increment. Figure 10.7 shows that the measured threshold in quiet, which is the total
power of all tones, remains at about 3 dB when the number of tones increases from
one to eight and begins to increase afterwards. This indicates that the critical bandwidth near 1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends
at 920 C 160 D 1,080 Hz.
To see this, let us denote the power of each tone as 2 and the total number of
tones as n. Before n reaches eight, the power of each tone is
2 D
100:3
n
(10.9)
to maintain a total power of 3 dB in the passband. When n reaches eight, the power
of each tones is
100:3
2 D
:
(10.10)
8
When n is more than eight, only the first eight tones fall into the passband and the
others are filtered out by the auditory subband filter. Therefore, the power for each
tone given in (10.10) needs to be maintained to maintain a total power of 3 dB in the
passband. Consequently, the total power of all tones is
181
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
4
5
6
7
Total Number of Tones
10
Fig. 10.7 The measurement of critical bandwidth near 1,000 Hz by placing more and more uniformly spaced and equally powered tones starting from 920 Hz with a 20 Hz increment. The total
power of all tones indicated by the cross remains constant when the number of tones increases
from one to eight and begins to increase afterwards. This indicates that critical bandwidth near
1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends at 920 C 160 D 1,080 Hz.
This is so because tones after the eighth are filtered out by the auditory filter, so do not contribute
to the power in the passband. The addition of more tones causes an increase in total power, but the
power perceived in the passband remains constant
P D
100:3
n:
8
(10.11)
To summarize, the total power of all tones as a function of the number of tones is
(
P D
100:3 ; n 8I
100:3
n;
8
(10.12)
otherwise.
182
10 Perceptual Model
Fig. 10.8 Measurement of critical bandwidth with masking in frequency gap. A test signal is
placed at the center frequency f0 where the critical bandwidth is to be measured and two masking
signals of equal power are placed at equal distance from the test signal. In (a), the test signal is
a narrow-band noise and the two maskers are tones. In (b), two narrow-band noises are used as
maskers to mask the tone in the center. The masked threshold versus the frequency separation
between the maskers is shown in (c). The frequency separation where the masked threshold begins
to drop off may be considered as the critical bandwidth
by the maskers. In order for the test signal to become audible, its power has to be
raised to above a certain level, denoted as masked threshold or sometimes masking
threshold.
When the frequency separation between the maskers is within the critical bandwidth, all of their powers fall into the critical band where the test signal resides and
the total masking power is the summation of that of the two maskers. In this case,
no matter how wide the frequency separation is, the total power in the critical band
is constant, so a constant masked threshold is expected. As the separation becomes
wider than the critical bandwidth, part of the powers of the maskers begin to fall
out of the critical band and are hence filtered out by the auditory filter. This causes
the total masking power in the critical band to become less, so their ability to mask
the test signal becomes weaker, causing the masked threshold to decrease accordingly. This is summarized in the curve of the masked threshold versus the frequency
separation shown in Fig. 10.8c. This curve is flat or constant for small frequency
separations and begins to fall off when the separation becomes larger than a certain
value, which may be considered as the critical bandwidth.
Data from many subjective tests were collected to produce the critical bandwidth
listed in Table 10.1 and shown in Fig. 10.9 [101, 102]. Here the lowest critical band
is considered to have a frequency range between 0 and 100 Hz, which includes the
inaudible frequency range between 0 and 20 Hz. Some authors may choose to assume the lowest critical band to have a frequency range between 20 and 100 Hz.
183
Band number
1
2
3
4
5
6
7
8
9
10
11
12
13
13
14
15
16
17
18
19
20
21
22
23
Upper frequency
boundary (Hz)
100
200
300
400
510
630
770
920
1;080
1;270
1;480
1;720
2;000
2;320
2;700
3;150
3;700
4;400
5;300
6;400
7;700
9;500
12;000
15;500
3500
3000
2500
2000
1500
1000
500
0
10
15
Critical Band Number
20
Critical bandwidth
(Hz)
100
100
100
100
110
120
140
150
160
190
210
240
280
320
380
450
550
700
900
1;100
1;300
1;800
2;500
3;500
184
10 Perceptual Model
Many audio applications, such as CD and DVD, deploy sample rates that allow
for a frequency range higher than the maximum 15,500 Hz given in Table 10.1.
For these applications, one more critical band may be added, which starts from
15,500 Hz and ends with half of the sample rate.
The critical bandwidth listed in Table 10.1 might give the wrong impression that
critical bands are discrete and nonoverlapping. To emphasize that critical bands are
continuously placed on the frequency scale, an analytic expression is usually more
useful. One such approximation is given below
"
f
fG D 25 C 75 1 C 1:4
1;000
2 #0:69
Hz;
(10.13)
L.f /df:
(10.14)
f 2z
For tones, the critical band level is obviously the same as the tones intensity level.
For noise, the critical band level is the total of the noise intensity density levels
within the critical band.
185
Bandwidth (Hz)
103
Traditional CB
ERB
102
102
103
Center Frequency (Hz)
104
(10.15)
(10.16)
186
10 Perceptual Model
2
Masker
:
2
Maskee
(10.17)
In order for a masker to completely mask a maskee, the SMR has to pass a certain
threshold, called SMR threshold
TSMR D minfSMR j the maskee is inaudibleg:
(10.18)
(10.19)
187
Tone
Noise
Tone
Tone masks tone (TMT)
Noise masks tone (NMT)
Noise
Tone masks noise (TMN)
Noise masks noise (NMN)
188
10 Perceptual Model
Masking type
SMR threshold (dB)
TMT
15
TMN
2128
NMN
26
NMT
26
(10.21)
189
If the masking index at critical band ze due to the masker is I.ze /, the masked
threshold at critical band ze is
LT .zr ; ze / D I.ze /L.zr / SF .zr ; ze /:
(10.23)
(10.24)
where
z D zr ze Bark;
(10.25)
C8 min 0; .Kz 0:5/2 2.Kz 0:5/ dB
(10.26)
190
10 Perceptual Model
50
100
Schroeder
Schroeder With Dip
MPEG Psycho Model 2
150
10
Bark Scale
Fig. 10.13 Comparison of Schroeders, modified Schroeders, and MPEG Psychoacoustic Model
2 spreading functions. The dips in modified Schroeders and MPEG Model near the top are intended to model additional nonlinear effects in auditory system
introduces a dip near the top (see Fig. 10.13) that is intended to model additional
nonlinear effects in auditory system as reported in [102]. This function is used in
MPEG Psychoacoustic Model 2 [60] with following parameter:
(
KD
3; z < 0;
(10.27)
1:5; otherwise:
The three models above are independent of the SPL or critical band level of
the masker. While simple, this is not a realistic reflection of the auditory system.
A model that accounts for level dependency is the spreading function used in MPEG
Psychoacoustic Model 1 given below
SF.z; Lr / D
< .0:4Lr C 6/z;
3 z < 1
1 z < 0
17z;
0 z < 1
/
17;
1 z < 8
.z
1/.17
0:15L
r
: 0;
otherwise;
dB
(10.28)
where Lr is the critical band level of the masker [55]. This spreading function is
shown in Fig. 10.14 for masker critical band levels at 20, 40, 60, 80, and 100 dB,
respectively. It apparently delivers increased masking for higher masker SPL on
both sides of the masking curve to match the nonlinear masking properties of the
auditory system [102].
191
0
10
20
Lr = 100 dB
30
Lr =80 dB
40
50
Lr = 60 dB
60
Lr = 40 dB
70
80
Lr = 20 dB
90
100
2
Bark Scale
Fig. 10.14 The spreading function of MPEG Psychoacoustic Model 1 for masker critical band
levels at 20, 40, 60, 80, and 100 dB, respectively. Increased masking is provided for higher masker
critical band levels on both sides of the masking curve
otherwise
dB;
(10.29)
where f is the masker frequency in hertz. Figure 10.15 shows this model at Lr D
60 dB and for f D 100; 200; and 1;000, respectively.
192
10 Perceptual Model
0
20
f = 100 Hz
40
f = 200 Hz
60
80
f = 10000 Hz
100
120
5
10
Bark Scale
Fig. 10.15 Terhardts spreading function at masker critical band level of Lr D 60 dB and masker
frequencies of f D 100; 200; and 1;000 Hz, respectively. It shows reduced masking slope as the
masker frequency increases
be dominated by the stronger one. When they are equal, how do the two masking
effects add up. If it were intensity addition, a 3 dB gain would be expected. If it
were sound pressure addition, a gain of 6 dB would be expected.
Experiment using a tone masker placed at a low critical band and a critical-band
wide noise masker placed at a high critical band to mask a tone maskee at a critical
band in between (they are not in the same critical band) indicates a masking effect
gain of 12 dB when the two maskers are of equal power. Even when one is much
weaker than the other one, a gain between 6 and 8 dB is still observed. Therefore,
the addition of masking effect is stronger than sound pressure addition and much
stronger than intensity addition [102].
When the experiment is performed within the same critical band, however, the
gain of masking effect is only 3 dB. This correlates well to intensity addition.
In practical audio coding systems, intensity addition is often performed for simplicity, so the total masked threshold is calculated using the following formula:
LT .ze / D I.ze /
(10.30)
all zr
Since threshold in quiet establishes the absolute minimal masked threshold, the
global masked threshold curve is
LG .ze / D maxfLT .ze /; LQ .ze /g;
(10.31)
193
where LQ .ze / is the critical band level that represents the threshold in quiet.
A conservative approach to establish this critical band level is to use the minimal
threshold in the whole critical band:
LQ .ze / D f .ze / min Tq .f /;
f 2ze
(10.32)
Fig. 10.16 Schematic drawing illustrating temporal masking. The masked threshold is indicated
by the solid line
194
10 Perceptual Model
where e2k is again the MSQE for subband k. This critical band level of quantization
noise may be normalized by the masked threshold of the same critical band using
the following NMR (noise to mask ratio):
NMR.z/ D
q2 .z/
LG .z/
(10.34)
Quantization noise for each critical band normalized in this way may be considered
as equal in terms of its contribution to the perceptually meaningful total distortion.
For this reason, NMR can be viewed as the variance of perceptual quantization
error. Consequently, the total average perceptual quantization error becomes
p2 D
1 X
NMR.z/;
Z
all z
(10.35)
195
where Z is the number of critical bands. Note that only critical bands with NMR
1
need to be considered because the other ones are completely inaudible and thus have
no contribution to p2 .
Comparing the formula above with (5.22) and (6.66), we know that, if subbands
are replaced by critical bands and quantization noise by perceptual quantization
noise NMR.z/, the derivation of optimal bit allocation and coding gain in Sect. 5.2
applies. So the optimal bit allocation strategy becomes:
Allocating bits to individual critical bands so that the NMR for all critical
bands are equalized.
Since NMR.z/ 1 for a critical band z means that quantization noise in the
critical band is completely masked, the bit allocation strategy should ensure that
NMR.z/ 1 for all critical bands should the bit resource is abundant enough.
2
.z/
Masker
;
LG .z/
(10.36)
2
where Masker
.z/ is the power of the masker in critical band z. This ratio should be
the same either in the DFT or subband domain, so can be used to obtain the masked
threshold in the subband domain
L0G .z/ D
02
.z/
Masker
;
TSMR .z/
(10.37)
02
where Masker
.z/ is the signal power in the subband domain within critical band z.
196
10 Perceptual Model
(10.38)
according to (10.34). Since the normalization of quantization error for all subbands
within a critical band by the masked threshold also means that all subbands are quantized together with the same quantization step size, so the MSQE for each subband
is the same:
q2 .z/
LG .z/NMR0
D
; for all k 2 z;
(10.39)
e2k D
k.z/
k.z/
where k.z/ represents the number of subbands within critical band z Therefore,
the SNR for a subband in critical band z is
SNR.k/ D
y2k
e2k
y2k k.z/
LG .z/NMR0
for all k 2 z:
(10.40)
y2k k.z/
LG .z/NMR0
!
a
;
b
for all k 2 z;
(10.41)
so the total number of bits assigned to all subbands within critical band z is
r.z/ D
"
k2z
10
log2
b log2 10
y2k k.z/
LG .z/NMR0
#
a
:
b
(10.42)
!#
y2k k.z/
LG .z/NMR0
a
:
b
(10.43)
For perceptual transparency, the quantization noise in each critical band must be
below the masked threshold, i.e.,
NMR0 1:
(10.44)
A smaller NMR0 means a lower quantization noise level relative to the masked
threshold, so requires more bits to be allocated to critical bands. Therefore, setting
NMR0 D 1 requires the least number of bits and at the same time ensures that
quantization noise is just at masked threshold. This leads to the following minimum
average bit rate:
"
10
1 XX
log2
RD
M
b log2 10
all z k2z
y2k k.z/
LG .z/
!#
a
:
b
(10.45)
197
The derivation above does not assume any quantization scheme because a general set of parameters a and b has been used. If uniform quantization is used and
subband samples are assumed to have the matching uniform distribution, a and b
are determined by (2.45) and (2.44), respectively, so the above minimum average
bit rate becomes
!
y2k k.z/
1 XX
:
(10.46)
0:5 log2
RD
M
LG .z/
allz k2z
To avoid negative values that the logarithmic function may produce, we add one to
its argument to arrive at
y2 k.z/
1 XX
RD
0:5 log2 1 C k
M
LG .z/
allz k2z
!
;
(10.47)
(10.48)
Note that the x2 .z/ above is in decibel. Since the spectral flatness measure is always positive and less than one, its decibel value is always negative. Therefore, the
198
10 Perceptual Model
tonality index is always positive. It is also limited to less than one in the equation
above.
The tonality index indicates that the spectral components in critical band z is T .z/
degree tone-like and 1 T .z/ degree noise-like, so it is used to weigh the masking
indexes given in (10.20) and (10.21), respectively, to give a total masking index of
I.z/ D T .z/ITMN .z/ C .1 T .z//INMT .z/
D .14:5 C z/T .z/ K.1 T .z// dB;
(10.49)
P .k/:
(10.50)
k2z
Using (10.30) with a masking spread function, the cumulative masking effects for
each critical band can be obtained as
X
y2 .zr /SF .zr ; z/; for all z:
(10.51)
LT .z/ D I.z/
all zr
Denoting
E.z/ D
(10.52)
all zr
we obtain the total masked threshold
LT .z/ D 10 log10 E.z/ C I.z/
D 10 log10 E.z/ .14:5 C z/T .z/ K.1 T .z// dB
(10.53)
Chapter 11
Transients
199
200
11 Transients
Amplitude
0.1
0
0.1
10
20
30
40
Time
Magnitude (dB)
50
100
5
10
Frequency (kHz)
15
20
10
Frequency (kHz)
15
20
Magnitude (dB)
50
100
Fig. 11.1 An episode of quasistationary audio signal (top) and its MDCT spectra of 1,024 (middle)
and 128 (bottom) subbands
energy compaction. It can be expected that the closer together the frequency components are, the more number of subbands is needed to resolve them.
A filter bank with a large number of subbands, conveniently referred to as a long
filter bank in this book, is able to deliver this advantage because using a large number of subbands to represent the full frequency range, as determined by the sample
rate, means that each subband is allocated a small frequency range, hence the frequency resolution is finer. Also, such a filter bank has a long prototype filter that
covers a large number of time samples, hence is able to resolve minute signal variations with frequency.
201
Amplitude
0.5
0.5
10
20
30
40
Time
Magnitude (dB)
20
40
60
80
100
120
5
10
Frequency (kHz)
15
20
10
Frequency (kHz)
15
20
Magnitude (dB)
20
40
60
80
100
120
Fig. 11.2 A transient that interrupts quasistationary episodes of an audio signal (top) and its
MDCT spectra of 1,024 (middle) and 128 (bottom) subbands, respectively
202
11 Transients
spectral flatness measure and bad coding gain, according to (5.65) and (6.81). As
long as the prototype filter covers a transient attack, this spectral flattening effect
is reflected in the whole block of subband samples. For overlapping filter banks,
this affects multiple blocks of subband samples whose prototype filter covers the
transient. For a long filter bank with a long prototype filter, this flattening effect
affects a large number of subband samples, resulting low coding gain for a larger
number of subband samples. Using a short filter bank with a short prototype filter,
however, helps to isolate this spectral flatting effect to a smaller number of subband
samples. Before and after those affected subband samples, the coding gain may
go back to normal. Therefore, applying a short filter bank to cover the transients
improves the overall coding gain.
Amplitude
0.5
0
0.5
5
10
15
20
25
30
35
40
45
10
15
20
25
Time
30
35
40
45
Amplitude
0.5
0
0.5
Fig. 11.3 Pre-echo artifacts produced by a long 1024-subband MDCT. The top figure shows the
error between the reconstructed signal and the original, which is, therefore, the quantization noise.
The bottom shows the reconstructed signal alone, the quantization noise before the transient attack
is clearly visible and audible
203
the original signal. When looking at the reconstructed signal alone (see bottom of
Fig. 11.3), however, the quantization noise is not visible after the transient attack
because it is visually masked by the signal. For the ear, it is also not audible due
to simultaneous and postmasking. Before the transient attack, however, it is clearly
visible. For the ear, it is also very audible and frequently very annoying because it
is supposed to be quiet before the transient attack (see original signal at the top of
Fig. 11.2). This frequently annoying quantization noise that occurs before the transient attack is called pre-echo artifacts.
One approach to mitigate pre-echo artifacts is to use a short filter bank whose fine
time localization or resolution would help at least limit the extent that the artifacts
appear. For example, a short 128-subband MDCT is used to process the same transient signal to give the reconstructed signal and the quantization noise in Fig. 11.4.
The quantization noise that occurs before the transient attack is still visible, but is
much shorter, less than 5 ms in fact. As discussed in Sect. 10.5, the period of strong
premasking may be considered as long as 5 ms, so the short pretransient quantization noise is unlikely to be audible.
For a 128-subband MDCT, the window size is 256 samples, which covers a
period of 256=44:1 5:8 ms for an audio signal of 44.1 kHz sample rate. In
Amplitude
0.5
0
0.5
5
10
15
20
25
30
35
40
45
10
15
20
25
Time
30
35
40
45
Amplitude
0.5
0
0.5
Fig. 11.4 Pre-echo artifacts produced by a short 128-subband MDCT. The top figure shows the
quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but may be inaudible because it could be shorter than premasking which
may last as long as 5 ms
204
11 Transients
order for significant quantization noise to build up, the MDCT window must cover
a significant amount of signal energy, so the number of input samples after the
transient attack and still covered by the MDCT window must be significant. This
means that the number of input samples before the transient attack is much shorter
than 256, so the period for pre-echo artifacts is much shorter than 5.8 ms and thus is
very likely to be masked by premasking. Therefore, a 128-subband MDCT is likely
to suppress most pre-echo artifacts.
jh.t/j2 dt D 1
(11.1)
1
and H.f / its Fourier transform. The dispersion about zero in both time and frequency domain may defined by
Z
Dt D
Z
and
Df D
t 2 jh.t/j2 dt
(11.2)
(11.3)
1
1
1
respectively. It is obvious that they represent the energy concentration of h.t/ and
H.f / toward zero in time and frequency domains, respectively, hence their respective time and frequency resolutions. The Fourier uncertainty principle states
that [75]
1
Dt Df
(11.4)
16 2
The equality is attained only in the case that h.t/ is an Gaussian function.
205
206
11 Transients
good time resolution helps to isolate the transient attack and limit pre-echo artifacts.
For quasistationary episodes, the second stage is deployed to further decompose the
subband samples from the first stage so that a much better frequency resolution is
delivered. It is desirable that the first stage filter banks have good time resolution
while the second stage transforms have good frequency resolution.
A variation to this scheme is to replace the transform in the second stage with
linear prediction, as shown in Fig. 11.6. The linear prediction in each subband is
switched on whenever the prediction gain is large enough and off otherwise. This
approach is deployed by DTS Coherent Acoustic where the first stage is a 32band CMFB with a 512-tap prototype filter [88]. The DTS scheme suffers from
the poor time resolution of the first filter bank stage because 512 taps translates
into 512=44:1 11:6 ms for a sample rate of 44.1 kHz, far longer than the effective
premasking period of no more than 5 ms.
A more involved but computationally inexpensive scheme is to switch the number of subbands of an MDCT in such a way that a smaller number of subbands is
deployed to code transients and a large number of subbands to code quasistationary
episodes. It seems to have become the dominant scheme in audio coding for the
adaptation of timefrequency resolution with time, as can be seen in Table 11.1.
This switched MDCT is often cascaded to the output of a CMFB in some audio
coding algorithms. For example, it is deployed by MPEG-1&2 Layer III [55, 56],
whose first stage is a 32-band CMFB with a prototype filter of 512 taps and the
second stage is an MDCT which switches between 6 and 18 subbands. In this
128/512
128/256
32/128 and 32/256
128/1,024
6/18
128/1,024
64, 128, 256, 512, 1,024, 2,048, 4,096 or 8,192
64, 128, 256, 512, 1,024 or 2,048
128/1,024
207
208
11 Transients
1
0.8
Left Window
Right Window
0.6
Any Valid Window
0.4
0.2
0
Current Block
500
1000
1500
2000
2500
3000
1
0.8
Left Window
Right Window
0.6
Any Valid Window
0.4
0.2
0
Current Block
500
1000
1500
2000
2500
3000
Fig. 11.7 The PR conditions only apply to two window halves that operate on the current block
of input samples, so the second half of the right window can have other shapes that satisfy the PR
conditions on the next block
An inspection of the MDCT operation in Figs. 7.15, however, indicates that each
block of the input samples are operated on by two window functions only. This
is illustrated in Fig. 11.7, where the middle block marked by the dotted lines is
considered as the current block. The first half of the left window, denoted as hL .n/,
operates on the previous block and is thus not related to the current block. Similarly,
the second half of the right window, denoted as hR .n/, operates on next block and is
thus not related to the current block, either. Only the second half of the left window
hL .n/ and the first half of the right hR .n/ operates on the current block. To ensure
perfection reconstruction for the current block of samples, only the window halves
covering it need to be constrained by the PR conditions (7.11) and (7.80). Therefore,
they can be rewritten as
hR .n/ D hL .2M 1 n/
(11.5)
(11.6)
and
for n D 0; 1; : : : ; M 1, respectively.
209
M m M 1:
(11.7)
and
hR .n/ D hN L .M 1 n/
(11.8)
(11.9)
210
11 Transients
This, however, does not restrict the selection of the second half of the right window which can be totally different from the first half, thus enabling switching to a
different window.
211
1
WS_S2S
0.5
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1
WL_L2L
0.5
0
200
400
600
800
1000
1
WL_L2S
0.5
0
200
400
600
800
1000
1
WL_S2L
0.5
0
200
400
600
800
1000
1
WL_S2S
0.5
0
200
400
600
800
1000
Fig. 11.8 Window functions produced from a set of 2048-tap and 256-tap sine windows. Note that
the length of W S S2S is only 256 taps
(
WS S2S:
hS
S2S .n/
D
(
and
WL L2L:
hL L2L .n/ D
hm
L .n/; 0 n < mI
hm
R .n/; m n < 2m
hM
L .n/; 0 n < M I
hM
R .n/; M n < 2M:
(11.10)
(11.11)
The transitional windows can then be expressed in terms of the long and short window halves as follows:
8
M
hL .n/; 0 n < M I
< 1;
M n < 3M=2 m=2I
(11.12)
WL L2S: hL L2S .n/ D
m
h
.n/;
3M=2
m=2
n
<
3M=2
C
m=2I
:
0;
3M=2 C m=2 n < 2M:
212
WL S2L:
WL S2S:
11 Transients
hL S2L .n/ D
8
0;
0 n < M=2 m=2I
m
< h .n/; M=2 m=2 n < M=2 C m=2I
L
(11.13)
1;
M=2 C m=2 n < M I
: M
hR .n/; M n < 2M:
8
0;
0 n < M=2 m=2I
< L
hL S2S .n/ D 1;
M=2 C m=2 n < 3M=2 m=2I (11.14)
:
0;
3M=2 C m=2 n < 2M:
Figure 11.9 shows some window switching examples using the primary and
transitional windows in Fig. 11.8. Transitional windows are placed back to back
in the second and third rows and eight short windows are placed in the last row.
As shown in Fig. 11.8, the long-to-short transitional windows WL L2S and
WL S2S provide for a short window half to be placed in the middle of their second
half. In the coordinate of the long transitional window, the short window must be
1
0.5
0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1
0.5
0
1
0.5
0
1
0.5
0
213
placed within 3M=2 m=2; 3M=2 C m=2 (see (11.12) and (11.14), respectively).
After 3M=2 C m=2, the long transitional window does not impose any constraint
because its window values have become zero, so other methods can be used to represent the samples beyond 3M=2 C m=2.
The same argument applies to the first half of the short-to-long transitional windows WL S2L and WL S2S as well (see Fig. 11.8), leading to no constraint placed
on samples before M=2 m=2, so other signal representation methods can be
accommodated for the samples before M=2 m=2.
MDCT with the short window function WS S2S is a simple method for representing those samples. Due to the overlapping of m samples between the short
windows and transitional windows, M=m short windows need to be used to
represent the samples between the long-to-short (WL L2S and WL S2S) and
short-to-long (WL S2L and WL S2S) transitional windows. See the last row of
Fig. 11.9, for example.
These M=m short windows amount to a total of M samples, which are the
same as the block length of the long windows, so this window switching scheme
is amenable to maintaining a constant frame size, which is highly desirable for
convenient real-time processing of signals. For the seek of convenience, a long block
may sometimes be referred to as the frame in the remainder of this book.
Under such a scheme, the short window represents the fine time resolution mode
and the long windows, including the transitional ones, represent the fine frequency
resolution mode, thus amounting to a double-resolution switched MDCT.
If constant frame size is forgone, other possible number of short windows can
be used. This typically involves more sophisticated window sequencing and buffer
management.
If the window size of the short window is set zero, i.e., m D 0, there is absolutely
no need for any form of short window. The unconstrained regions before M=2 and
after 3M=2 are open for any kind of representation methods. For example, a DCT-IV
or wavelet transform may be deployed to represent samples in those regions.
214
Table 11.2 Possible window
switching scenarios
11 Transients
Current window half
WL X2S
WL X2L
WS S2S
WS S2S
in the middle of the next frame, we need a look-ahead interval of up to a half frame,
which causes additional coding delays. This look-ahead interval of half frame is necessary for other window switching situations, such as transition from short windows
to long windows (see the last row of Fig. 11.9 again).
If there is a transient in the transient detection interval, the short windows have
to be placed to cover it. Since the short windows only cover the second half of the
current frame, the first half of the current frame needs to be covered by a WL X2S
long window. The short windows also cover the first half of the next frame, whose
next half is covered by a WL S2X long window. This complete the transition from a
long window to short windows and back to a long window. Table 11.2 summarizes
the possible window transition scenarios between two frames.
The decoder needs to know exactly what window the encoder used to generate
the current block of MDCT coefficients in order to use the same window to perform
the inverse MDCT, so information about window sequencing needs to be included
in the bit stream. This can be done using three bits to convey an window index
that identifies the windows in (11.10) to (11.14) and shown in Fig. 11.8. If window
WL S2S is forbidden, only two bits are needed to convey the window index.
11.3.3 Implementation
Double-resolution switched IMDCT is widely used in a variety of audio coding
standards, its implementation is straight forward and fully illustrated in the third
row of Fig. 11.9. In particular, starting from the second window in that row (the
WL L2S window), the IMDCT may be implemented in the following steps:
1. Copy the .M m/=2 samples, where the WL L2S is one, from the delay line
directly to the output;
2. Do M=2m short IMDCT and put all results to the output because all of them
belong to the current frame;
3. Do another IMDCT, put first m=2 samples of its result to the output and store
the remaining m=2 samples to the delay line, because the first half belong to the
current frame and the rest to the next frame;
5. Do M=2m 1 short IMDCT and put all results to the delay line because all of
them belong to the next frame;
6. Clear the remaining .M C m/=2 samples of the delay line to zero because the
last WS S2S has ended.
215
216
11 Transients
On the decoder side, the regular decoder shown in Fig. 4.4 is used to reconstruct the
MDCT coefficients for the block with a transient.
The first theoretical justification for TNS is the spectral flattening effect of transients. For the short window that covers a transient attack, the resultant MDCT
coefficients are either close to or essentially flat, thus are amenable for linear prediction. As discussed in Chap. 4, the resultant prediction gain may be considered as
the same as the coding gain.
The second theoretic justification is that an open-loop DPCM shapes the spectrum of quantization noise toward the spectral envelop of the input signal (see
Sect. 4.5.2). For TNS, the input signal is the MDCT coefficients and their spectrum
is the time-domain samples covered by the short window, so the quantization noise
of MDCT coefficients in the time domain is shaped toward the envelop of the
time-domain samples. This means that more quantization noise is placed after the
transient attack and less noise before the transient attack. Therefore, there is less
likelihood for pre-echo artifacts.
As an example, let us apply the TNS method to the MDCT block that covers
the transient attack in the audio signal in Fig. 11.2 with a predictor order of 2. The
prediction coefficients are obtained from the autocorrelation matrix using (4.63).
The quantization noise and the reconstructed signal are shown in Fig. 11.10. While
the quantization noise before the transient attack is still visible, it is significantly
shorter than that of the regular short window shown in Fig. 11.4.
Amplitude
0.5
0
0.5
5
10
15
20
25
30
35
40
45
10
15
20
25
Time
30
35
40
45
Amplitude
0.5
0
0.5
Fig. 11.10 Pre-echo artifacts for TNS. The top figure shows the quantization noise and the bottom
the reconstructed signal. The quantization noise before the transient attack is still visible, but is
significantly shorter than that of the regular short window. However, the concentration of quantization noise in a short period of time (top) elevates the noise intensity significantly and hence may
become audible
217
WS B2B:
8
0;
0 n < m=2 B=2I
hB
:
0;
3m=2 C B=2 n < 2m:
218
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
11 Transients
WB_B2B
200
400
600
800
1000
1200
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
WS_S2S
200
400
600
800
1000
1200
WS_S2B
200
400
600
800
1000
1200
WS_B2S
200
400
600
800
1000
1200
WS_B2B
200
400
600
800
1000
1200
WL_L2L
200
400
600
800
1000
1200
WL_L2S
200
400
600
800
1000
1200
WL_S2L
200
400
600
800
1000
1200
WL_S2S
200
400
600
800
1000
1200
WL_L2B
200
400
600
800
1000
1200
WL_B2L
200
400
600
800
1000
1200
WL_B2B
200
400
600
800
1000
1200
WL_S2B
200
400
600
800
1000
1200
WL_B2S
200
400
600
800
1000
1200
Fig. 11.11 Window functions for a transient-localized 1024/128-subband MDCT. The first window is plotted using a dashed line to emphasize that it is a virtual window (not directly used). Its
length is 64 taps. All windows are built using the sine window as the model window
Its nominal length is the same as the regular short window WS S2S, but its effective
length is only
.3m=2 C B=2/ .m=2 B=2/ D m C B
(11.16)
219
because its leading and trailing .mB/=2 taps are zero, respectively. Its overlapping
with its neighboring windows is only B.
As an example, the WB B2B virtual window at the top of Fig. 11.11 may be
used in combination with the short window BS S2S in the same figure to build
the brief window shown as the fifth window in Fig. 11.11. Its effective length of
128 C 32 D 160 is significantly shorter than the 256 taps of the short window, so
should provide better transient localization.
For a sample rate of 44.1 kHz, this corresponds to 160=44:1 3:6 ms. Compared
with the 5:8-ms length of the regular short window, this amounts to an improvement
of 1:3 ms, or 22.4%. This improvement is critical for pre-echo control because the
3.6-ms length of the brief window is well within the 5-ms range of premasking,
while the 5.8-ms length of the regular short window is not.
Figure 11.12 shows the quantization noise achieved by this brief window for a
piece of real audio signal. Its pre-echo artifacts are obviously shorter and weaker
than those with the regular short window (see Fig. 11.4) but longer and weaker than
those delivered by TNS (see Fig. 11.10). One may argue that TNS might deliver
better pre-echo control due to its significantly shorter but more powerful pre-echo
artifacts. Due to the premasking effect that often lasts up to 5 ms, however, pre-echo
artifacts significantly shorter than the premasking period is most likely inaudible
and thus irrelevant. Therefore, the simple brief window approach to pre-echo control
serves the purpose well.
Amplitude
0.5
0.5
5
10
15
20
10
15
20
25
30
35
40
45
25
30
35
40
45
Amplitude
0.5
0.5
Time
Fig. 11.12 Pre-echo artifacts for transient localized MDCT. The top figure shows the quantization
noise and the bottom the reconstructed signal. The quantization noise before the transient attack is
still visible, but is remarkably shorter than that of the regular short window
220
11 Transients
1
0.5
0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1
0.5
0
1
0.5
0
1
0.5
0
Fig. 11.13 Window sequence examples. The top sequence is for the conventional method which
does not use the brief window. The brief window is used to cover blocks with transients for the
sequences in the rest of the figure. The second and the third sequences are for a transient occurring
in the first and the third blocks, respectively. Two transients occur in the first and sixth blocks in
the last sequence
221
WS X2S
WS X2B
WL
WL
WL
WL
WL
WL
WL
WL
WL
L2L
L2S
L2B
S2L
S2S
S2B
B2L
B2S
B2B
WL L2X
WS S2X
WS B2X
WL L2X
WS S2X
WS B2X
WL L2X
WS S2X
WS B2X
Pretransient
WL L2B
WL S2B
WL B2B
WS S2B
WS B2B
Transient
WS B2B
Posttransient
WL B2L
WL B2S
WL B2B
WS B2S
WS B2B
222
11 Transients
Table 11.5 Determination of the first half of the first window in
a frame with detected transients
Last window in previous frame First window in current frame
WL X2S
WS X2S
WS B2B
WS S2X
WS S2X
WS B2X
WL S2X
WS S2X
WS B2X
223
Label
WS B2B
WS B2S
WS S2B
WS S2S
Window index
Window label
0
1
2
3
4
5
6
7
8
9
10
11
12
WS S2S
WL L2L
WL L2S
WL S2L
WL S2S
WS B2B
WS S2B
WS B2S
WL L2B
WL B2L
WL B2B
WL S2B
WL B2S
complete set of window labels shown in Table 11.8. The total number of windows
labels is now 13, requiring 4 bits to transmit the index to the decoder.
A pseudo CCC function is given in Sect. 15.4.5 that illustrates how to
obtain short window sequencing from this window index and transient location
information.
224
11 Transients
Frame type
Quasistationary
Smooth transient
Transient
Windows
A long window
A multiple of medium windows
A multiple of short and medium windows
225
WS_S2S
200
400
600
800
1000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
WM_M2M
200
400
600
800
1000
WM_M2S
200
400
600
800
1000
WM_S2M
200
400
600
800
1000
WM_S2S
200
400
600
800
200
400
600
800
1000
WL_L2L
1000
WL_L2M
200
400
600
800
1000
WL_M2L
200
400
600
800
1000
WL_M2M
200
400
600
800
1000
WL_L2S
200
400
600
800
1000
WL_S2L
200
400
600
800
1000
WL_S2S
200
400
600
800
1000
WL_M2S
200
400
600
800
1000
WL_S2M
200
400
600
800
1000
Fig. 11.14 Window functions produced for a 1024/256/64 triple-resolution switched MDCT using
the sine window as the model window
cope with transient. Since it can be much shorter than WS B2B, it can deliver much
better time localization. Also, it is smoother than WS B2B, so it should have better
frequency resolution property.
226
11 Transients
1
0.5
0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1
0.5
0
1
0.5
0
1
0.5
0
Fig. 11.15 Window sequence examples for a 1024/256/64 triple-resolution switched MDCT. The
all medium window frame on the top is suitable for handling slow transient frames. For the rest of
the figure, the short window can be used to better localize transients and the medium window to
achieve better coding gains for the remainder of a frame with detected transients
Comparing Figs. 11.15 and 11.13, we notice that they are also very similar, the
only difference is essentially that the WS B2B window in Fig. 11.13 is replaced by
four WS S2S windows in Fig. 11.15. Therefore, window sequencing procedures are
similar and thus are not discussed here.
However, the addition of another resolution mode means that the resultant audio
coding algorithm will become much more complex because each resolution mode
usually requires its own sets of critical bands, quantizers, and entropy codes. See
Chap. 13 for more explanation.
227
h.k/x.n k/:
(11.17)
(11.18)
L1
X
(11.19)
i D0
For reduced computational load, the following L1 metric is also a good choice:
E.k/ D
L1
X
(11.20)
i D0
Other sophisticated norms, such as perceptual entropy [34] can also be used.
At this point, transient detection decision can be made based on the variations of
the metric or energy among the blocks. As a simple example, let us first calculate
Emax D max E.k/;
(11.21)
(11.22)
0k<K
and
0k<K
228
11 Transients
(11.23)
1; if there is a transient;
0; otherwise.
(11.24)
(11.25)
Emax C Emin
2
(11.26)
The Dmax used above is the maximum of absolute metric difference defined below
Dmax D max jE.k/ E.k 1/j:
0<k<K
(11.27)
229
If none of the conditions above are satisfied, a decision cannot be made for now.
Instead, we need to move on to the secondary stage of decision, which focuses on
eliminating large preattack and postattack peaks.
In particular, let us denote
kmax D fk j E.k/ D Emax ; 0 k < Kg;
(11.28)
which is the index for the block that has the maximum metric or energy. It may be
considered as the block where transient attack occurs.
To find the preattack peak, search backward from kmax for the block where the
metric begins to increase:
for (k=kmax-1; k>0; k--) {
if ( E[k-1] > E[k] ) {
break;
}
}
PreK = k-1;
The preattack peak is
P reE max D
max
0k<PreK
E.k/:
(11.29)
max
E.k/:
PostKk<K
(11.30)
For final decision, declare that the current frame contains transients if
Emax > k3 maxfPreEmax ; PostEmax g;
where k3 is a tunable parameter.
(11.31)
Chapter 12
Multichannel audio signals or programs, including the most widely used stereo and
5.1 surround sounds, are considered as consisting of discrete channels. Since a multichannel signal is intended for reproduction of coherent sound field, there is strong
correlation between its discrete channels. This inter-channel correlation can obviously be exploited to reduce bit rate.
On the receiving end, the human auditory system relies on a lot of cues in the
audio signal to achieve sound localization and the processing involved is very complex. However, a lot of psychoacoustic experiments have consistently indicated that
some components of the audio signal are either insignificant or even irrelevant for
sound localization, thus can be removed for bit rate reduction.
Surround sounds usually include one or more special channels, called low
frequency effect or LFE channels, which are specifically intended for deep and lowpitched sounds with a frequency range from 3 to 120 Hz. The significantly reduced
bandwidth presents a great opportunity for reducing bit rate.
It is obvious that a great deal of bit rate reduction can be achieved by jointly coding all channels of an audio signal through exploitation of inter-channel redundancy
and irrelevancy. Unfortunately, joint channel coding has not reached the same level
of sophistication and effectiveness as that of intra-channel coding. Therefore, only a
few widely used and simple methods are covered in this chapter. See [25] to explore
further.
(12.1)
231
232
and difference
D D 0:5.L R/
(12.2)
channels. A mono receiver can process the sum signal only, so the listener can hear
both left and right channels in a single loudspeaker. A stereo receiver, however, can
decode the left channel by
LDS CD
(12.3)
and the right channel by
R D S D;
(12.4)
respectively, so the listener can enjoy stereo sound reproduction. The sum signal is
also referred to as the main signal and the difference signal as the side signal, so this
technology is often called main/side stereo coding.
From the perspective of audio coding, this old technology obviously provides a
means for exploiting the strong correlation between stereo channels. In particular, if
the correlation between the left and right channels is strongly positive, the difference
channel becomes very weak, thus needs less bits to encode. If the correlation is
strongly negative, the sum channel becomes weak, thus needs less bits to encode.
However, if the correlation between the left and right channels is weak, both sum and
difference signals are strong, then there is not much coding gain. Also, if the left or
right channel is much stronger than the other one, sum/difference coding is unlikely
to provide any coding gain either. Therefore, the encoder needs to dynamically make
the decision as to whether or not sum/difference coding is deployed and indicates
the decision to the decoder.
Instead of the time domain approach used by stereo FM radio and TV broadcasting, sum/difference coding in audio coding is mostly performed in the subband
domain. In addition, the decision as to whether sum/difference coding is deployed
for a frame is tailored to each critical band. In other words, there is a sum/difference
coding decision for each critical band [35].
233
Joint intensity coding is, of course, performed on critical band basis, so only
critical band z is considered in the following discussion. Since sound localization is
dominated by inter-aural amplitude differences only at high frequencies, higher than
45 kHz, the critical band z should be selected in such a way that its lower frequency
bound is higher than 45 kHz. All critical bands higher than this can be considered
for joint intensity coding.
To illustrate the basic idea of and the steps typically involved in joint intensity
coding, let us suppose that there are K channels that can be jointly coded and denote
the nth subband sample from the kth channel as X.k; n/. The first step of joint
intensity coding is to calculate the power or intensity of all subband samples in
critical band z for each channel:
X
X 2 .k; n/; 0 k < K:
(12.5)
k2 D
n2z
At the second step, all subband samples in critical band z are jointed together to
form a joint channel:
J.n/ D
X.k; n/;
n 2 z:
(12.6)
0k<K
Without loss of generality, let us suppose that these joint subband samples are embedded in the zth critical band of the zeroth channel. These joint subband samples
need to be adjusted so that the power of the 0th channel in critical band z is unchanged:
s
X.0; n/ D
02
J.n/;
2
n2z J .n/
n 2 z:
(12.7)
At this moment, these joint subband samples can be coded as a normal critical band
of the zeroth channel. All subband samples in critical band z in other channels are
discarded, thus achieving significantly bit saving.
At the third step, the steering vector for jointed channels are calculated:
s
k D
k2
02
0 < k < K:
(12.8)
They are subsequently quantized and embedded into the bit stream as side information. This completes the encoder side of joint intensity coding.
At the decoder side, the joint subband samples for critical band z embedded as part of the zeroth channel are first decoded and reconstructed as XO .0; n/.
Then the steering vectors are unpacked from the bit stream and reconstructed as
O k ; k D 1; 2; : : : ; K 1. And finally, subband samples of all jointed channels can
be reconstructed from the joint channel using the steering vector as follows:
X.k; n/ D Ok XO .0; n/; n 2 z and 0 < k < K:
(12.9)
234
Since a large chunk of subband samples are discarded, joint intensity coding
introduces significant distortion. When properly designed and not aggressively deployed, the distortion may appear as mild collapsing of stereo imaging. A listener
may complain that the sound field is narrower. When aggressively deployed, significant collapsing of stereo imaging is possible.
After subband samples are jointed together and then reconstructed using the
steering vector at the decoder side, they can no longer cancel out aliasing from other
subbands. In other words, the aliasing canceling property necessary for perfect reconstruction is destroyed forever. If stopband attenuation of the prototype filter is
high, this may not be a significant problem. For subband coders whose stopband
attenuation is not high, however, there exists a problem that aliasing may become
easily audible. This is the case for MDCT whose stopband attenuation is usually
less than 20 dB.
Chapter 13
Implementation Issues
235
236
13 Implementation Issues
237
transient are usually similar to the transient block, so a transient may be considered
as the start of a quasistationary segment, called transient segment, each consisting
of one or more short blocks. For this reason transient localization may also be called
transient segmentation. Under such a scheme, the first transient segment of a frame
starts from the first block of the frame and ends before the first transient block. The
second transient segment starts from the first transient block and ends either before
the next transient block or with the end of the frame. This procedure continues until
the end of the frame. This segmentation of subband samples by blocks based on
transient locations and frame boundaries is illustrated in Fig. 13.1.
Within each block, either short or long, the subband samples are segmented along
the frequency axis in such a way that critical bands are approximated to fully utilize
Fig. 13.1 Segmentation of blocks of subband samples within a frame based on transient locations
and frame boundaries
238
13 Implementation Issues
..
.
Quantization Unit
..
.
..
.
Quantization Unit
..
.
Quantization Unit
Quantization Unit
Quantization Unit
Quantization Unit
Segment 1
Segment 2
Segment 0
Fig. 13.2 Timefrequency tiling of subband samples in a frame by transient segments in the time
domain and critical bands in the frequency (subband) domain. Subband samples in each tile share
similar statistical properties along the time axis and are within one critical band along the frequency
axis, so may be considered as one unit for quantization and bit allocation purpose
results from perceptual models. Subband segments thus obtained may be conveniently referred to as critical band segments or simply critical bands.
The subband samples for a frame with detected transient are, therefore, segmented along the time axis into transient segments and along the frequency axis
into critical band segments. Combined together, they segment all subband samples
in a frame into a timefrequency tiling as shown in Fig. 13.2. All subband samples within each tile share similar statistical properties along the time axis and are
within one critical band along the frequency axis, so they should be considered as
one group or unit for quantization and bit allocation purpose. Such a tile is called a
quantization unit.
For a frame with a long block, the timefrequency tiling for a frame is obviously
reduced to tiling of critical bands along the frequency axis only, so a quantization
unit consists of subband samples in one critical band.
239
Fig. 13.3 All quantization indexes within a quantization unit are coded using the same entropy
codebook. The application scopes of entropy codebooks overlap with that of quantization units
240
13 Implementation Issues
Fig. 13.4 Statistic-adaptive approach to entropy codebook assignment, which ignores the boundaries of the quantization units and, instead, adaptively matches codebooks to the local statistic
properties of quantization indexes
241
because most of the indexes in the unit are much smaller and thus are not suitable
for being coded using a large codebook. Using the statistics-adaptive approach, however, the largest quantization index is segmented into codebook segment C as shown
in Fig. 13.4, so it shares a codebook with other large quantization indexes in the segment. Also, all quantization indexes in codebook segment D are small, so a small
codebook can be selected for them, which results in fewer bits for coding these
quantization indexes.
With the fixed assignment approach, only the codebook indexes need to be transferred to the decoder as side information, because the codebook application scopes
are the same as the quantization units, which are usually determined by transient
segments and critical bands. The statistics-adaptive approach, however, needs to
transfer the codebook application scopes to the decoder as side information, in
addition to the codebook indexes, since they are dynamic and independent of the
quantization units. This overhead must be contained in order for the statisticsadaptive approach to offer any advantage. This can be accomplished by imposing a
limit on the number of segments, thus restricting the amount of side information.
Bit Rate
Samples Per Frame:
Sample Rate
(13.1)
As an example, let us consider a bit rate of 128 kbps and a frame size of 1,024
samples. For a sample rate of 48 kHz, the number of bits assigned to code each
frame is
242
13 Implementation Issues
128
1;024 2;730:
48
While simple, this approach is not a good match to the dynamic nature of audio signals. For example, frames whose audio samples are either silent or of small
magnitude demand a small number of bits. Allocating the same number of bits to
such frames means a significant number of them are not used and thus wasted. On
the other hand, frames with transients tend to demand a large number of bits, allocating the same number of bits to such frames results in inadequate number of bits
and thus distortion is likely to become audible. Therefore, a better approach is to
adaptively allocate bits to frames in proportion to their individual demand while full
conformance to the target bit rate is maintained.
Adaptive inter-frame bit allocation is a very complex problem that are affected
by a variety of factors, such as the type of bit rate constraints (maximal and average
bit rates), the size of decoder buffers, and bit demands of individual audio frames.
Audio decoders usually have very limited buffer size, which leads to the possibility of buffer underflow and overflow. Buffer underflow, also called buffer
underrun, is a condition that occurs when the input buffer is fed with data at a
lower speed than the data is being read from it for a prolonged period of time so that
there is not enough data for the decoder to decode. When this happens, real-time
audio playback has to be paused until enough data is fed to the input buffer. Buffer
overflow is the opposite condition where the input buffer is fed with data at a higher
speed than the data is being read from it for a prolonged period of time so that the
input buffer is full and can no longer take more data. When this occurs, some of the
input data have to be discarded, causing decoding error.
The bit demands of individual audio frames vary with time. Since it is impossible
to predict the bit demand for frames in the future, optimal bit allocation is very
problematic for real-time encoding. For applications where real-time encoding is not
required, one usually can make a pilot run of encoding to estimate the bit demand
for each frame, thus making inter-frame bit allocation easier.
243
masking thresholds, or NMR for all quantization units are less than one.
Frame Header
Audio Data for Normal Channel 0
Audio Data for Normal Channel 1
::
:
Payloads
Audio Data for LFE Channel 0
Audio Data for LFE Channel 1
::
:
Error Protection Codes
Auxiliary data
End of Frame
244
Table 13.2 A basic frame
header that assists the
synchronization of audio
frames and provides a
description of the audio
signal that is carried by the
audio bit stream
13 Implementation Issues
Synchronization word
Compressed data size
Sample rate
Number of normal channels
Number of LFE channels
Speaker configuration
The synchronization word signals the start of the frame. Some audio coding
algorithms impose a constraint that the synchronization word cannot appear
anywhere inside a frame for easy synchronization with the frame: once the synchronization word is found, it is the start of the frame and there is no ambiguity. In such
a case, it is unnecessary to use the compressed data size codeword and end of frame
signature.
Forbidding the synchronization word to appear inside a frame imposes a certain degree of inflexibility to the encoding process. To lessen this inflexibility, it is
imperative to use a long synchronization word. A good example is 0xffffffff (hexadecimal) which requires that 32 consecutive binary 1s do not appear anywhere
inside the frame.
Without the restriction above, a synchronization word might appear in the middle
of a frame. If seeking synchronization word begins in the middle of such a frame,
an identification of the synchronization word inside the frame triggers false synchronization. However, compressed data size codeword, which goes right after the
synchronization word, would point to a position in the bit stream where the end of
frame signature is absent, so synchronization cannot be confirmed. Consequently, a
new round of synchronization loop is lunched.
The second role of the frame header is to provide a description of the audio
signal carried by the bit stream. This facilitates the decoding of the compressed
audio frames as well as the playback of the decoded audio. A minimal description
must include the sample rate, the number of normal and LFE channels, and speaker
configuration.
245
Window index
Transient locations
Joint channel coding
Quantization step sizes
Bit allocation
Entropy codebook application scope 0
Quantization
Entropy codebook application scope 1
::
:
Indexes
Entropy codebook application scope M 1
Side
Information
The bits representing the quantization indexes for all coded subband samples
may be packed sequentially from low frequency to high frequency. For frames with
short windows, they are usually laid out one transient segment after another. When
entropy coding is used for quantization indexes, these bits are obviously placed
based on the application scopes of respective entropy codebooks.
246
13 Implementation Issues
data and are user-defined, may be attached to the audio frame as auxiliary data.
The decoder can choose to ignore all of them without interrupting the decoding and
playback of the audio signal.
The auxiliary data may even provide a mechanism for future extension to the
audio coding algorithm. Data for coder extension may be attached to the original
frame as auxiliary data to let the corresponding new decoders to decode. This extension in the auxiliary data is simply ignored by the older decoders which does not
recognize the extension.
247
widely used for delivering audio signals in various digital audio systems, most
notably the compact disc (CD), the representation of high-fidelity sounds may need
more than 20 bits per sample, so 24 bits per sample has been considered as more
appropriate.
The 24 bits are also amenable for microprocessors that do not implement fixedpoint arithmetic in hardware and usually are at least of 32-bit data width. For such
processors, the 24 bits for audio signals may be placed to occupy the lower 24 bits
of a 32-bit word, leaving the upper 8 bits as a buffer for coping with overflowing,
which may occur during a series of accumulation operations.
Any noninteger parameters that the decoder may use, such as the quantization
step sizes, must be converted to 24-bit integer representation.
Division, which is extremely expensive on such systems, should be strictly disallowed.
Modern microprocessor usually come with a small amount of fast on-chip internal memory and a large amount of slow off-chip or external memory. If the foot-print
of the decoding algorithm could fit within the fast internal memory, decoding would
be significantly accelerated. Software development time and reliability may also be
improved if managing various accesses to the external memory is avoided. Memory
foot print may be reduced by using small entropy and quantization codebooks, as
well as a small audio frame size.
(13.2)
Since the left most bit is the sign bit, the dynamic range of this fixed-point number is
#
"
2b1 2b1 1
:
(13.3)
f ;
2
2f
248
13 Implementation Issues
There are various notations for representing word length and radix point in a
binary fixed-point type. The one most widely used in the signal processing community is the Q notation Qm.f and its abbreviation Qf which assumes a default
value of m D 0. For example, Q0.31 describes a type with 0 integer bit and 31
fractional bits stored as a 32-bit 2s complement integer, which may be further abbreviated to Q31. The 24 bits per sample suggested above for representing variables
and parameters in an audio coder may be designated as Q8.23 if implemented using
a 32-bit integer (bD32). Using a 24-bit integer, it is Q0.23, or simply Q23.
To add or subtract two fixed-point numbers, it is sufficient to add or subtract the
underlying integers
y
xy
x
f D
;
(13.4)
f
2
2
2f
which is exactly the same as integer addition or subtraction. Both operations may
overflow, but does not underflow.
For multiplication, however, the result needs to be rescaled
x
y f
xy
2 D f
f
f
2
2
2
(13.5)
( 1 << (f-1) )
// Rounding factor
249
Chapter 14
Quality Evaluation
251
252
14 Quality Evaluation
253
Impairment description
Imperceptible
Perceptible, but not annoying
Slightly annoying
Annoying
Very annoying
Grade
5.0
4.0
3.0
2.0
1.0
254
14 Quality Evaluation
Chapter 15
255
256
Fig. 15.1 DRA encoder. The solid lines indicates flow of audio data and the dashed lines flow of
control parameters. The dashed boxes indicate optional modules that do not have to be invoked
to the adoption of the algorithm in the market place. Decoder cost is a key factor
affecting consume purchase decision and power consumption is becoming a prominent issue in an era of mobile devices and global warming.
When DRA audio coding algorithm was conceived, it was set as the design goal
to deliver high coding efficiency and high audio quality (if bit rate allows) with minimal decoder complexity. To accomplish this, a principle was set that only necessary
modules would be used, each module would be implemented with minimal impact
to the decoder complexity, and a signal path of 24 bits would be maintained when
the bit rate allows. This results in the encoder and decoder shown in Figs. 15.1 and
15.2, respectively.
15.2 Architecture
Transient localized MDCT (TLM) discussed in Sect. 11.5 is deployed as a simple approach to optimizing the coding gain for both quasistationary and transient
episodes of audio signals. The number of taps for long, short and virtual windows
are 2,048, 256 and 64, respectively. According to (11.16), the effective length for
15.2 Architecture
257
Fig. 15.2 DRA decoder. The solid lines indicates flow of audio data and the dashed lines flow of
control parameters. The dashed boxes indicate optional modules that do not have to be invoked
the resultant brief window is 160 taps, corresponding to a period of mere 3.6 ms for
a sample rate of 44.1 kHz. This is well within the 5 ms range that strong premasking
typically lasts.
The implementation costs of the inverse TLM is essentially the same as the regular switched IMDCT because both can be accomplished using the same software
or hardware implementation. A pseudo CCC implementation example is given in
Sect. 11.5.4. The only additional cost of inverse TLM over the regular switched
IMDCT is more complex window sequencing when in short MDCT mode. Fortunately, this window sequencing involves a very limited number of assignment and
logic operations as can be seen in the pseudo CCC code in Sect. 15.4.5.
Only midtread uniform quantization discussed in Sect. 2.3.2 is deployed to quantize all MDCT coefficients, so the implementation code for inverse quantization is
minimal, only involving an integer multiplication of the quantization index with the
quantization step size (see (2.22)).
Logarithmic companding discussed in Sect. 2.4.2.2 is used to quantize the quantization step sizes with a step size of 0.2, resulting in a total of 116 quantized values
that cover a dynamic range of 24 bits for the quantization indexes of MDCT coefficients. The quantized step size values are further converted into Q.23 fractional
format for convenient and exact implementation on fixed-point microprocessors.
The resultant quantization table is given in Table A.1.
All MDCT coefficients in one quantization unit (see Sect. 13.1.2) share one quantization step size. The critical bands that define the quantization units along the
frequency axis are given in various tables in Sect. A.2 for long and short windows
and for a variety of sample rates.
Statistic codebook assignment discussed in Sect. 13.2.2 is used to improve the
efficiency of Huffman coding with small impact to decoding complexity, but with
258
a potentially significant overhead due to the bits needed to transfer the codebook
application scopes. But this can be properly managed by the encoder so that the
efficiency gain significantly exceeds the overhead.
The module in the decoder to reconstruct the number of quantization units involves a simple search to find the lowest critical band that accommodates the highest
(in frequency) nonzero quantization index (see Sect. 15.3.3.4 for details) , so its implementation cost is very low.
The optional sum/difference coding and joint intensity coding module implement
the simplest methods discussed in Chap. 12 so as to incur the least decoding cost
while reaping the the gains of joint channel coding. The decision for sum/difference
coding and steering vectors for joint intensity coding are individually based on quantization units. The quantization of the steering vectors for joint intensity coding
reuses the quantization step size table given in Table A.1 with a bias to provide for
the need that the elements of these steering vectors may be larger as well as smaller
than one, corresponding to the possibility that the intensities of the joint channels
may be stronger or weaker than that of the source channel.
As usual, transient detection, perceptual model and global bit allocation modules
are not specified by DRA standard and can be implemented by any methods that can
provide data in formats suitable for DRA encoding.
259
them from carrying sound imaging information, so joint channel coding does not
need to be deployed. Therefore, an LFE channel has a simpler bit stream structure
as shown in Table 15.3.
The end of frame signature is placed right after the audio data and before the
auxiliary data so that the decoder can check immediately if the end of frame signature can be confirmed after all audio data are decoded. This enables error checking
without the burden of handling auxiliary data.
260
so the decoding of DRA bit stream should start with a search for this synchronization
codeword, as illustrated by the following pseudo CCC code:
SeekSyncWord()
{
while ( Unpack(16) != 0x7FFF ) {
continue;
}
}
where
Unpack(nNumBits)
is a function that unpacks nNumBits bits from the bit stream. For the code above,
nNumBitsD16. Note that the seeking loop above may become an infinite loop in
case that the input stream does not contain nSyncWord and thus may cause the
decoder to hang up. This must be properly taken care of by aborting the loop and
relaunching the SeekSyncWord() routine.
Since nSyncWordD0x7FFF is not prohibited from appearing in the middle of
a DRA frame, the SeekSyncWord() function listed above may end up with a false
nSyncWord. A function that properly handles this situation should skip nNumWord
(see Table 15.4) number of 32-bit codewords after this nSyncWord and then seek
and confirm the next nSyncWord before declaring synchronization. The following
pseudo CCC code illustrates this procedure:
Sync()
{
// Seek the first nSyncWord
while ( Unpack(16) != 0x7FFF ) {
continue;
}
// Unpack frame header type (0=short, 1=long)
nFrameHeaderType = Unpack(1);
// Number of 32-bit codewords in the frame
nBits4NumWord = (nFrameHeaderType==0) ? 10 : 13;
nNumWord = Unpack(nBits4NumWord);
// Skip all audio data in the frame
Unpack(nNumWord*32-nBits4NumWord-1);
// Seek and confirm the second nSyncWord
while ( Unpack(16) != 0x7FFF ) {
continue;
}
}
In case that the input stream is not a valid DRA stream, the nSyncWord may not
appear after all, causing the two search loops to hang up forever. This situation must
be taken care of by aborting the loops and relaunching the Sync() routine.
261
Table 15.4 DRA frame header. If there is a slash in the column under Number of bits, such as
L/R, L is the number of bits for short frame header and R is for long frame header. A value of zero
means that the codeword is absent in the frame
DRA codeword
Number of bits
Explanation
nSyncWord
nFrameHeaderType
nNumWord
16
1
10/13
nNumBlocksPerFrm
nSampleRateIndex
nNumNormalCh
3/6
nNumLfeCh
1/2
bAuxChCfg
bUseSumDiff
1/0
bUseJIC
1/0
nJicCb
5/0
0; not deployed;
bUseJIC D
1; deployed.
Transferred only for short frame header and when
nNumNormalCh > 1.
Joint intensity coding is deployed for all critical
bands starting from nJicCbC1 and ending with
the last active critical band. nJicCb is transferred
only for short frame header and if
bUseJIC DD 1
262
Index
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
nFrameHeaderType D
The short frames are intended for lower bit rate DRA streams whose number of
channels is limited to no more than 8.1 surround sound and for which joint channel
coding may be optionally enabled. The long frames are intended for higher bit rate
DRA streams which may support up to 64.3 surround sound and for which joint
channel coding is disallowed.
The DRA frame header is described in Table 15.4 and its unpacking from the bit
stream is delineated by the following pseudo CCC code:
UnpackFrameHeader()
{
// Verify sync codeword (16 bits)
nSyncWord = Unpack(16);
if ( nSyncWord != 0x7FFF ) {
throw("Sync codeword not verified.");
}
// Type of frame header (0=short, 1=long)
nFrameHeaderType = Unpack(1);
// Number of 32-bit codewords in the frame
if ( nFrameHeaderType == 0 ) {
// 10 bits for short frame header
nNumWord = Unpack(10);
}
263
else {
// 13 bits for long frame header
nNumWord = Unpack(13);
}
// Number of MDCT blocks per frame
nNumBlocksPerFrm = 1<<Unpack(2);
// Sample rate index
nSampleRateIndex = Unpack(4);
// Number of normal and LFE channels
if ( nFrmHeaderType == 0 ) {
// Short frame header
nNumNormalCh = Unpack(3)+1;
nNumLfeCh = Unpack(1);
}
else {
// Long frame header
nNumNormalCh = Unpack(6)+1;
nNumLfeCh = Unpack(2);
}
// Flag indicating if speaker configuration is attached to the
// end of the frame as the first part of auxiliary data
bAuxChCfg = Unpack(1);
// Joint channel coding
if ( nFrmHeaderType == 0 ) {
// Present only for short frame headers
if ( nNumNormalCh>1 ) {
// Present only for multichannel audio
// Decision for sum/difference coding
bUseSumDiff = Unpack(1);
// Decision for joint intensity coding
bUseJIC = Unpack(1);
}
else {
// Not present if thre is only one channel
bUseSumDiff = 0;
bUseJIC = 0;
}
// The start critical band that JIC is deployed
if ( bUseJIC == 1 ) {
// Present only when JIC is enabled
nJicCb = Unpack(5)+1;
}
else {
nJicCb = 0;
}
}
else {
// Joint channel coding is not deployed for long frame
// headers
bUseSumDiff = 0;
bUseJIC = 0;
nJicCb = 0;
}
}
264
nNumNormalCh
1
2
3
4
5
6
7
8
Speaker configuration
Mono
Stereo
2.1
3.1
5.1
6.1
7.1
Speaker configuration
F
FL, FR
FL, FR, FC
FL, FR, SL, SR
FL, FR, SL, SR, FC
FL, FR, SL, SR, BC, FC
FL, FR, SL, SR, BL, BR, FC
FL, FR, SL, SR, BL, BR, FC, overhead
nNumNormalCh
1
2
2
3
5
6
7
nNumLfeCh
0
0
1
1
1
1
1
Ordinal number
Normal channel
0
1
2
3
4
5
6
7
Front left
Front right
Surround left
Surround right
Back left
Back right
Front center
Overhead
265
configurations. The placement of center channel toward the tail is to facilitate joint
channel coding by letting left and right channel pairs start from the first channel. If
any of these channels are not present, the channels below are moved upward to take
the place. Any speaker configuration beyond Table 15.8 needs to be specified in the
auxiliary data by setting bAuxChCfg to one.
Inside each audio channel, data components are placed as shown in Table 15.2
for normal channels and in Table 15.3 for LFE channels. All these data components
are delineated below.
Table 15.9 DRA codewords for window sequencing. The Number of Bits is represented as L/S
where L is the number of bits for long MDCT windows and S is for short windows. A value of
zero means that the codeword is absent in the frame
DRA codeword
Number of bits Explanation
nWinIndex
nNumSegments
4/4
0/2
nNumBlocksInSegment
0/Varied
Window index
Number of transient segments. Not transferred if a
long window is used
Number short MDCT blocks for each transient
segment. There are (nNumSegments-1) of such
codewords in the bit stream. The length of the
last transient segment is not transferred but
inferred from nNumBlocksPerFrm, the number
of short MDCT blocks in a frame. Not
transferred if a long window is used or
nNumSegmentsD1
266
Index
0
1
2
3
4
5
6
Length
2
2
3
3
4
3
4
Codeword
0
3
3
2
9
5
8
267
When the function above is called, it should be passed with a pointer to the
Huffman codebook given in Table 15.10 which is further passed to the Huffman
decoding function HuffDec().
268
Table 15.11 DRA codewords used to represent Huffman codebook assignment for a transient
segment identified by nSegment
DRA codeword
Number of bits Explanation
anHSNumBands[nSegment]
4
Number of Huffman codebooks used to
encode the quantization indexes in the
transient segment identified by
nSegment
mnHSBandEdge[nSegment][nBand] Variable
Application scope for nBand-th Huffman
codebook. nBand goes from 0 to
anHSNumBands[nSegment]-1
mnHS[nSegment][nBand]
Variable
Selection index for the corresponding
nBand-th Huffman codebook. nBand
goes from 0 to anHSNumBands
[nSegment]-1
269
Huffman codebooks in Tables A.28 and A.29 are enough to cover any possible
application scope values used by DAR standard.
Each unit of application scope encoded in the bit stream corresponds to four
quantization indexes, so the decoded value should be multiplied by four to get the
application scopes that correspond to quantization indexes.
The difference between the codebook selection index for the nBand-th application scope with its predecessor:
nDiff D mnHS[nSegment][nBand] mnHS[nSegment][nBand-1]
is encoded using one of the Huffman codebooks given in the last row of Table 15.12,
depending on the presence or absence of transients in the frame, except for the first
one in a transient segment, which is packed directly into the bit stream using four
bits. Therefore, on the decoder side, the selection index is obtained as
mnHS[nSegment][nBand] D nDiff C mnHS[nSegment][nBand-1]:
The unpacking of codebook assignment is illustrated by the pseudo CCC code
given below. Proper Huffman codebooks specified in Table 15.12 should be passed
to this function depending on the presence or absence of transients in the frame. The
HuffDecRecursive() function used in the code implements the recursive procedure
given in the last paragraph of Sect. 9.4.
UnpackCodeookAssignment(
pCodebook4Scope,
// pointer to a Huffman codebook for
// decoding application scope
pCodebook4Selection // pointer to a Huffman codebook for
// decoding selection index
)
{
// The left boundary of the first application scope is
// always zero
nLeftBoundary = 0;
// Loop through transient segments to unpack application
// scopes
270
271
Transient
N/A
Table A.43
Table A.44
Table A.45
Table A.46
Table A.47
Table A.48
Table A.49
Table A.50
Table A.51
272
Table A.32
Table A.33
out of this range. Any value out of this range must be moved back by taking the
remainder against 16.
Sign bits for all nonzero quantization indexes coded above follow immediately.
One bit is packed for each nonzero quantization index and the negative sign is indicated by the value of zero.
The unpacking of all quantization indexes is delineated by the following pseudo
CCC code:
UnpackQIndex(
int anQIndexBooks[9],
//
//
//
int *pQuotientWidthBook //
//
//
)
{
// Reset the history value for the number of bits for packing
// quotients
nLastQuotientWidth = 0;
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Initialize the start of the application scope to the
// first quantization index in the current transient
// segment. Since each transient segment starts at the
// anEdgeSegment[nSegment]-th short MDCT block and each
// short MDCT block consist of 128 samples, we have
nStart = anEdgeSegment[nSegment]*128;
// Loop through all Huffman codebooks in the transient
// segment
for (nBand=0; nBand<anHSNumBands[nSegment]; nBand++) {
// Establish the end of the application scope by adding
// the width of the application scope to its start
nEnd = nStart + mnHSBandEdge[nSegment][nBand];
// Load the selection index for the Huffman codebook
// corresponding to the application scope
nHSelect = mnHS[nSegment][nBand];
// Consider different types of Huffman codebooks
if ( nHSelect==0 ) {
// No Huffman codebook is assigned. All quantization
// indexes are zero, so set them as such
for (nBin=nStart; nBin<nEnd; nBin++) {
anQIndex[nBin] = 0;
}
}
else {
// The pointers to the Huffman codebooks are stored in
// array anQIndexBooks[9], so this is the right way to
273
274
275
It is obvious that there are anMaxActCb[nSegment] quantization units in transient segment indexed by nSegment, each of which has a quantization step size that
needs to be conveyed to the decoder, so the number of the corresponding quantization indexes is also anMaxActCb[nSegment]. The differences of all these indexes
between successive quantization units are encoded using one of the Huffman codebooks as indicated in Table 15.15.
276
Table A.52
Table A.53
Reconstructing the indexes from these difference values encoded in the bit stream
may generate values out of the valid range of 0; 115. Any value out of this range
must be moved within this range by taking the remainder. The unpacking operation
is delineated by the following pseudo CCC code:
UnpackQStepIndex(pQStepBook)
// pQStepBook: Huffman codebook used to encode the difference
//
between quantization step size indexes
{
// Reset the quantization step size index history
nLastQStepIndex = 0;
// Loop through the transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Loop through the active critical bands
for (nBand=0; nBand<anMaxActCb[nSegment]; nBand++) {
// Decode the difference from the bit stream
nQStepIndex = HuffDec(pQStepBook);
// Add it to the history
nLastQStepIndex += nQStepIndex;
// Take the remainder if out of [0,115]
nLastQStepIndex = nLastQStepIndex % 116
// Store the index
mnQStepIndex[nSegment][nBand] = nLastQStepIndex;
}
}
}
277
If joint intensity coding is not deployed (nJicCbD0, see Table 15.4), sum/
difference coding may be deployed to the larger of the maximal active critical
bands of both sum and difference channels. If joint intensity coding is deployed
(nJicCb0), this number is limited by nJicCb.
Sum/difference coding may be turned off for the whole transient segment and
this decision is conveyed to the decoder by one bit per transient segment. If it is
turned on, there is one bit decision for each critical band or quantization unit in the
transient segment. The unpacking of these decisions is delineated by the following
pseudo CCC code:
UnpackSumDff(anMaxActCb4Sum[])
// anMaxActCb4Sum: Maximal active critical bands for the sum
//
(left) channel
// Note:
This function is run only for the difference
//
(right) channel
{
// Loop through the transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Get the larger of the maximal active critical bands of
// both sum and difference channels
nMaxCb =
__max(anMaxActCb4Sum[nSegment], anMaxActCb[nSegment]);
// nMaxCb identifies the highest critical band that
// sum/difference coding is deployed only if joint intensity
// coding is not used (nJicCb=0).
if ( nJicCb>0 ) {
// Otherwise, sum/difference coding is deployed to
// critical bands before joint intensity coding, so we
// need to get the smaller of nMaxCb and nJicCb
nMaxCb = __min(nJicCb, nMaxCb);
}
// If it is deployed, unpack the relevant decisions
if ( nMaxCb>0 ) {
// Sum/Difference coding may be turned off for the whole
// transient segment, so need to unpack this decision
anSumDffAllOff[nSegment] = Unpack(1);
//
if ( anSumDffAllOff[nSegment] == 0 ) {
// Not turned off completely, unpack the decision for
// each quantization unit
for (nBand=0; nBand<nMaxCb; nBand++) {
mnSumDffOn[nSegment][nBand] = Unpack(1);
}
}
else {
// Turned off completely, so turn off the decision for
// each quantization unit
for (nBand=0; nBand<nMaxCb; nBand++) {
mnSumDffOn[nSegment][nBand] = 0;
}
}
}
}
}
278
279
window sequencing for LFE channels is, therefore, very easy and can be delineated
by the following pseudo CCC code:
SetupLfeWinSequencing()
{
// The left edge of the first transient segment is
anEdgeSegment[0] = 0;
//
if ( nNumBlocksPerFrm == 8 ) {
// The frame size is equal to the long MDCT block
// the long window is always used
nWinIndex = WL_L2L;
// It is always a quasi-stationary frame so there
// one transient segment
nNumSegment = 1;
// whose length is equal to one long MDCT block
anEdgeSegment[1] = 1;
}
else {
// The frame is shorter than the long MDCT block,
// short window is always used
nWinIndex = WS_S2S;
// It is always a quasi-stationary frame so there
// one transient segment
nNumSegment = 1;
// whose length is equal to the frame size
anEdgeSegment[1] = nNumBlocksPerFrm;
}
}
size, so
is only
so the
is only
280
15.4 Decoding
281
SetupLfeWinSequencing();
// Unpack codebook assignment
UnpackCodebookAssignment();
// Unpack quantization indexes
UnpackQIndex();
// Reconstruct the maximal active critical bands
ReconstructMaxActCb();
// Unpack indexes for quantization step sizes
UnpackQStepIndex();
}
// Check for error (end of frame signature)
CheckError()
// Unpack auxiliary data
UnpackAuxiliaryData()
}
15.4 Decoding
After a DRA frame is unpacked, all components are made ready for reconstructing
the PCM samples for all channels in the frame. The reconstruction involves the
following steps and each of them is described in this section:
Inverse quantization
Sum/difference decoding
Joint intensity decoding
De-interleaving
Inverse MDCT
282
1
1
D
aunStepSize[57]
2;702
to make up for the need that JIC scale factors can be either larger or smaller than
one.
15.4 Decoding
283
Joint intensity decoding may be delineated by the following pseudo CCC code:
DecJIC(
srcCh,
//
//
aunStepSize[116], //
//
*pnCBEdge
)
{
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Point to the first quantization index in the current
// transient segment.
nBin0 = anEdgeSegment[nSegment]*128;
// The number of short MDCT blocks in a transient
// segment is
nNumBlocks =
anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment];
// Loops through active critical bands that are JIC coded
for (nBand=nJicCb; nBand<srcCh.anMaxCb[nSegment]; nBand++){
// The quantization unit starts from
nStart = nBin0 + nNumBlocks * pnCBEdge[nBand-1];
// and end with
nEnd = nBin0 + nNumBlocks * pnCBEdge[nBand];
// Load the index to the scale factor
nQStepSelect = mnQStepIndex[nSegment][nBand];
// Look up the step size table for the JIC scale factor
rStepSize = aunStepSize[nQStepSelect];
// Loop through all quantization indexes in the
// quantization unit
for (nBin=nStart; nBin<nEnd; nBin++) {
// Reconstructed MDCT coefficient in the source channel
// is multiplied by the scale factor and then the bias
// to generate the MDCT coefficient in the joint
// channel. The order of multiplication is critical to
// prevent overflow for integer implementation
afBinReconst[nBin]=
(rStepSize*srcCh.afBinReconst[nBin]) * rJicScaleBias;
}
}
}
}
284
//
//
//
*pnCBEdge //
{
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Point to the first quantization index in the current
// transient segment.
nBin0 = anEdgeSegment[nSegment]*128;
// The number of short MDCT blocks in a transient
// segment is
nNumBlocks =
anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment];
// Get the larger of the maximal active critical bands of
// both sum and difference channels
nMaxCb =
__max(LeftCh.anMaxActCb[nSegment], anMaxActCb[nSegment]);
// nMaxCb identifies the highest critical band that
// sum/difference coding is deployed only if joint intensity
// coding is not used (nJicCb=0).
if ( nJicCb>0 ) {
// Otherwise, sum/difference coding is deployed to
// critical bands before joint intensity coding, so we
// need to get the smaller of nMaxCb and nJicCb
nMaxCb = __min(nJicCb, nMaxCb);
}
// Loops through active critical bands that are
// sum/difference coded
for (nBand=0; nBand<nMaxCb; nBand++) {
if ( mnSumDffOn[nSegment][nBand] == 1 ) {
// Sum/difference coding is turned on.
// The quantization unit starts from
nStart = nBin0 + nNumBlocks * pnCBEdge[nBand-1];
// and end with
nEnd = nBin0 + nNumBlocks * pnCBEdge[nBand];
// Loop through all quantization indexes
for (nBin=nStart; nBin<nEnd; nBin++) {
// Left channel
rL = LeftCh.afBinReconst[nBin] + afBinReconst[nBin];
// Right channel
rR = LeftCh.afBinReconst[nBin] - afBinReconst[nBin];
// Store the decoded left and right channels
LeftCh.afBinReconst[nBin] = rL;
afBinReconst[nBin] = rR;
}
}
}
}
}
15.4 Decoding
285
15.4.4 De-Interleaving
For a frame in which the long MDCT is deployed, the 1,024 MDCT coefficients are
naturally placed from low frequency to high frequency.
For a frame in which the short MDCT is deployed, the MDCT coefficients are
said to be naturally placed if the 128 coefficients from the first short MDCT block
are placed from low frequency to high frequency, then the 128 coefficients from the
second block, from the third block, and until the last block of the frame. Under this
placement scheme, MDCT coefficients corresponding to the same frequency from
succeeding MDCT blocks are 128 coefficients apart.
To facilitate quantization and entropy coding, however, DRA encoder interleaves
these coefficients within each transient segment based on frequency: the MDCT
coefficients of all MDCT blocks within the transient segment corresponding to the
same frequency are placed together right next to each other based on their respective
time order. Then these group are laid out from low frequency to high frequency. This
interleaving operation is applied to each transient segment in the frame. Under this
placement scheme, MDCT coefficients from the same MDCT blocks with succeeding frequencies are nNumBlocks coefficients apart within each transient segment,
where nNumBlocks is the number of short MDCT blocks in the transient segment.
When it is time to do inverse short MDCT, this interleaving must be reverted.
The following pseudo CCC code illustrates how this should be accomplished:
DeInterleave()
{
// Index to the naturally ordered coefficient
p = 0;
// Loop through all transient segments
for (nSegment=0; nSegment<nNumSegment; nSegment++) {
// Point to the first coefficient in the current transient
// segment.
nBin0 = anEdgeSegment[nSegment]*128;
// The number of short MDCT blocks in a transient segment is
nNumBlocks =
anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment];
// Loop through all blocks in the transient segment
for (nBlock=0; nBlock<nNumBlocks; nBlock++) {
// Index to the interleaved coefficient is set to point to
// the first coefficient in each short MDCT block
q = nBin0;
// Loop through all coefficients in one short MDCT block
for (n=0; n<128; n++) {
// Copy the coefficient
afBinNatural[p] = afBinInterleaved[q];
// The next interleaved coefficient is nNumBlocks apart
q += nNumBlocks;
// Increment the index for the natural order
p++;
}
286
//
//
int nWinIndex,
//
int anWinShort[] //
//
//
)
{
int nSegment, nBlock;
// First MDCT block
if ( nWinIndex==WS_S2S || nWinIndex==WS_S2B ) {
// The first half of nWinIndex unpacked from the bit stream
// indicates that there is no transient in the first MDCT
// block, so set its window index to WS_S2S
anWinShort[0] = WS_S2S;
// Make sure that the first half of this window matches the
// second half of the last window in the preceding frame
switch ( nWinLast ) {
case WS_B2B:
// The second half of the last window is "brief", so
// the first half must be changed to "brief" also
anWinShort[0] = WS_B2S;
break;
case WL_L2S:
case WL_S2S:
case WL_B2S:
case WS_S2S:
case WS_B2S:
// The second half of the last window is "short",
// which matches the first half of the current one, so
15.4 Decoding
287
288
15.4 Decoding
289
}
break;
//
default:
// The only other case, which is WS_S2B, should not happen
// by itself, so flag error
throw("WS_S2B should not happen by itself.");
}
}
290
Table 15.16 Grades for ITU-R BS.1116 compliant subjective listening tests
Lab
Date
Stereo @ 128 kbps 5.1 @ 320 kbps 5.1 @ 384 kbps
NTICRT
SLDST
SLDST
SLDST
SLDST
08/2004
10/2004
01/2005
07/2005
08/2006
4.6
4.2
4.1
4.7
4.6
4.7
4.0
4.2
4.5
4.9
291
was only interested in surround sounds, so DRA was tested at 384 kbps as well
as 320 kbps. This test was conducted in comparison with two major international
codecs, DRA came out as the clear winner.
The test results from SLDST are more consistent because the same group of
engineers performed the tests in the same lab and the listener pool was largely
unchanged. The later the test, the more difficult the test became, because the test
administrators and many listeners had learned the characteristics of DRA-processed
audio and knew where to look for distortions. The gradually increasing test grades
reflect continuous improvement to the encoder, especially to its perceptual model
and transient detection unit.
Appendix A
Large Tables
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388607
1
2
5
9
18
37
74
147
294
588
1176
2353
4705
9410
18820
37641
75281
150562
301124
602249
1204498
2408995
4817990
1
3
5
11
21
42
84
169
338
676
1351
2702
5405
10809
21619
43238
86475
172951
345901
691802
1383604
2767209
5534417
Step Size
Step Size
2
3
6
12
24
49
97
194
388
776
1552
3104
6208
12417
24834
49667
99334
198668
397336
794672
1589344
3178688
6357376
2
3
7
14
28
56
111
223
446
891
1783
3566
7132
14263
28526
57052
114105
228210
456419
912838
1825677
3651354
7302707
293
294
A Large Tables
26
53
80
108
138
170
204
241
282
328
381
443
517
606
715
848
1010
1024
101:563
207:031
312:5
421:875
539:063
664:063
796:875
941:406
1101:56
1281:25
1488:28
1730:47
2019:53
2367:19
2792:97
3312:5
3945:31
4000
4
8
12
16
20
25
30
36
42
49
57
67
79
94
112
128
125
250
375
500
625
781:25
937:5
1125
1312:5
1531:25
1781:25
2093:75
2468:75
2937:5
3500
4000
19
39
59
80
102
125
150
177
207
241
280
326
380
446
526
625
745
888
1024
102:283
209:949
317:615
430:664
549:097
672:913
807:495
952:844
1114:34
1297:38
1507:32
1754:96
2045:65
2400:95
2831:62
3364:56
4010:56
4780:37
5512:5
4
8
12
16
20
24
28
33
39
46
54
64
76
91
109
128
172:266
344:531
516:797
689:063
861:328
1033:59
1205:86
1421:19
1679:59
1981:05
2325:59
2756:25
3273:05
3919:04
4694:24
5512:5
295
296
A Large Tables
Table A.10 Critical bands for long
MDCT at 22050 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
10
107:666
1
20
215:332
2
30
322:998
3
41
441:431
4
52
559:863
5
64
689:063
6
77
829:028
7
91
979:761
8
107
1152:03
9
125
1345:83
10
146
1571:92
11
170
1830:32
12
199
2142:55
13
234
2519:38
14
277
2982:35
15
330
3552:98
16
394
4242:04
17
469
5049:54
18
557
5997
19
661
7116:72
20
790
8505:62
21
968
10422:1
22
1024
11025
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
4
8
12
16
20
24
29
35
42
51
61
73
87
105
128
344:531
689:063
1033:59
1378:13
1722:66
2067:19
2497:85
3014:65
3617:58
4392:77
5254:1
6287:7
7493:55
9043:95
11025
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
4
8
12
16
20
24
29
35
42
51
61
73
87
106
128
375
750
1125
1500
1875
2250
2718:75
3281:25
3937:5
4781:25
5718:75
6843:75
8156:25
9937:5
12000
297
298
A Large Tables
Table A.18 Critical bands for long
MDCT at 48000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
5
117:188
1
10
234:375
2
15
351:563
3
20
468:75
4
25
585:938
5
31
726:563
6
37
867:188
7
44
1031:25
8
52
1218:75
9
61
1429:69
10
71
1664:06
11
83
1945:31
12
98
2296:88
13
116
2718:75
14
138
3234:38
15
165
3867:19
16
197
4617:19
17
235
5507:81
18
279
6539:06
19
332
7781:25
20
401
9398:44
21
503
11789:1
22
692
16218:8
23
1024
24000
0
1
2
3
4
5
6
7
8
9
10
11
12
4
8
12
16
20
24
29
35
42
51
65
92
128
750
1500
2250
3000
3750
4500
5437.5
6562.5
7875
9562.5
12187.5
17250
24000
299
4
8
12
16
22
37
128
2728:13
5456:25
8184:38
10912:5
15004:7
25235:2
87300
300
A Large Tables
Table A.26 Critical bands for long
MDCT at 192000 Hz sample rate
Critical Upper Edge Upper Edge
Bands
(Subbands) (Hz)
0
4
375
1
8
750
2
12
1125
3
16
1500
4
20
1875
5
24
2250
6
29
2718:75
7
35
3281:25
8
42
3937:5
9
51
4781:25
10
61
5718:75
11
73
6843:75
12
87
8156:25
13
106
9937:5
14
136
12750
15
197
18468:8
16
465
43593:8
17
1024
96000
4
8
12
16
23
47
128
3000
6000
9000
12000
17250
35250
96000
301
302
A Large Tables
Table A.29 Huffman codebook
for application scopes for frames
with detected transients
Index
Length
Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
4
3
3
3
4
4
4
4
4
4
5
5
6
6
6
6
6
7
7
7
8
8
9
8
10
9
9
11
10
10
11
9
5
1
0
6
7
E
F
A
9
8
9
17
1B
19
18
10
2C
5A
34
5B
6A
6B
8D
45
11D
89
8C
223
110
11C
222
8F
11
10
8
6
5
5
5
3
2
2
3
4
5
6
6
7
9
11
B0
59
17
A
3
6
7
5
1
3
4
0
4
B
4
A
2D
B1
12
12
12
9
7
5
4
3
1
3
4
5
5
6
8
11
11
12
CA
CB
C8
18
7
6
1
3
1
2
2
7
0
2
D
67
66
C9
303
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
15
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
304
A Large Tables
10
9
10
9
7
8
10
9
10
9
7
9
7
5
7
8
7
8
10
8
9
8
7
8
10
8
10
8
7
8
7
4
7
8
7
8
7
4
7
4
17F
D5
3E0
D6
29
DC
3E1
D7
3E6
96
7E
BD
2B
F
23
F7
2D
DE
3E5
F5
1B6
F0
34
D0
12B
DF
129
D9
26
DA
20
9
2C
F6
30
D3
31
A
37
E
0
C
36
B
27
D4
32
F2
2A
E
22
BE
28
D7
3E3
D1
17E
D4
33
F1
1B7
F4
3E7
D2
2E
F3
7F
8
24
D8
7D
BC
128
D6
3E2
DD
21
97
3E4
D5
12A
7
6
5
6
7
6
5
4
5
6
5
3
2
4
5
6
4
4
5
6
7
6
5
6
7
59
3
17
B
5A
9
8
7
9
A
2
4
3
5
3
D
A
6
7
8
5B
C
0
2
58
11
10
9
9
8
9
9
10
11
10
9
8
7
6
7
8
9
31E
17D
15E
EF
41
FA
81
1D9
3CB
17C
ED
74
21
28
34
7C
F0
305
306
A Large Tables
Table A.36 (continued)
64
9
F1
65
8
7A
66
7
33
67
7
3F
68
7
2B
69
8
6D
70
9
DF
71
10
1BC
72
11
31F
73
10
1BD
74
9
80
75
8
AC
76
8
5E
77
9
FB
78
9
58
79
10
18E
80
11
3CA
Table A.37 Codebook 82
Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
13
13
13
12
12
11
11
10
10
11
11
11
12
12
12
13
14
13
12
12
12
11
10
10
10
9
62D
A4
BCA
53
50
6E0
753
307
3A8
328
2F 5
7B1
7B7
51
CEE
19DB
33B4
E0B
CEC
F 61
552
77D
210
368
1F
13C
307
308
A Large Tables
Table A.37 (continued)
164
7
72
165
8
EB
166
8
FF
167
9
1F 4
168
10
66
169
11
3C 4
170
11
2A8
171
10
12
172
9
10A
173
8
D9
174
8
CC
175
7
51
176
7
1F
177
6
2F
178
6
D
179
6
2D
180
7
7E
181
8
6E
182
8
A6
183
8
9B
184
9
182
185
10
306
186
11
12
187
11
EE
188
11
329
189
9
139
190
9
7A
191
9
BF
192
8
EE
193
7
50
194
7
7C
195
7
2B
196
7
71
197
8
79
198
8
DB
199
9
CF
200
9
1E4
201
10
178
202
10
234
203
11
67B
204
11
4ED
205
11
183
206
10
13
207
9
19B
208
9
1B5
209
9
CB
210
8
AB
(continued)
11
46D
11
18A
10
235
10
16
10
17B
9
119
10
C4
10
3EA
11
77C
11
6FB
11
4D0
12
6D1
13
A5
14
1C14
13
62C
13
BCB
12
FAD
12
5E4
12
784
11
605
11
3C 5
11
3C 0
10
3AA
10
277
11
3D
11
548
12
704
11
4D1
13
1D4A
13
1D4B
14
33B5
Table A.38
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
(continued)
5
2
5
5
4
B
4
4
3
6
3
3
3
7
4
3
4
A
5
A
5
1
5
11
6
7
6
25
6
16
6
0
7
2E
7
4E
7
48
8
23
8
20
309
310
A Large Tables
Table A.39 (continued)
20
7
18
21
7
3E
22
7
3C
23
6
36
24
6
A
25
6
E
26
5
1A
27
5
3
28
5
C
29
4
0
30
3
5
31
3
2
32
3
4
33
4
F
34
5
E
35
5
1D
36
5
18
37
6
1A
38
6
8
39
6
38
40
6
32
41
7
37
42
7
19
43
7
9
44
7
A
45
8
7E
46
8
3D
47
7
1A
48
7
B
49
7
66
50
7
67
51
8
3F
52
8
27
53
8
2F
54
9
FF
55
9
DB
56
9
5C
57
9
78
58
9
22
59
9
1B9
60
10
1EE
61
9
21
62
10
BB
10
11
10
10
10
10
10
10
10
10
9
10
9
10
10
9
9
9
9
9
9
9
9
9
9
8
8
8
8
8
8
7
9
9
9
8
9
8
9
9
8
8
1
F8
1E2
59
1CF
62
7D
24
2AA
1C 0
1B5
1E3
130
1C1
1D6
15A
15B
152
1B6
1C 6
1A
1B4
2D
1
32
9E
9A
E1
F7
17
3
40
1B7
1C 9
33
9F
EA
DE
38
F0
FC
A8
(continued)
8
AB
8
C
8
74
8
A
8
1D
7
45
8
71
7
6E
7
6C
7
7F
7
7
6
21
7
3B
7
3F
6
3E
6
4
5
12
5
1A
5
C
4
C
4
5
3
1
4
4
4
B
5
D
5
1D
5
14
6
1
6
3C
7
3D
6
23
7
3E
7
D
7
73
7
7A
7
57
8
79
7
44
8
72
7
41
8
14
8
FD
8
E2
311
312
A Large Tables
Table A.41 Codebook 1271
Index Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
11
13
10
11
11
10
11
10
10
11
11
10
10
10
11
10
10
10
10
11
11
10
10
10
10
9
11
10
10
10
10
10
10
10
11
10
10
9
11
10
9
11
10
10
10
9
5DA
B8E
2E2
5DB
5D8
38
5D9
39
3E
74D
2B6
3A7
3F
158
76
159
2E3
2E0
2E1
3D4
2B7
3C
3D
32
3A4
177
2B4
33
15E
1EB
3A5
174
3BA
43
77
15F
15C
121
3D5
10A
1EA
3EA
15D
10B
3BB
142
(continued)
9
E8
9
1E2
9
B9
9
10
9
25
9
11
9
F8
9
16
9
9A
8
92
8
E0
9
FC
8
E4
8
EC
8
F2
8
48
8
1E
8
44
8
5F
8
11
8
7F
8
75
7
5F
7
20
7
E
7
37
7
35
6
25
6
6
6
14
5
15
5
1F
4
8
4
2
3
6
4
3
5
C
5
0
5
13
6
1C
6
5
6
2C
7
3B
7
36
7
23
7
2D
7
52
313
314
A Large Tables
Table A.41 (continued)
187
10
157
188
11
4B
189
10
3BF
190
10
10E
191
9
B3
192
9
9C
193
9
20
194
9
3F
195
9
1E0
196
9
B0
197
10
154
198
9
1C C
199
10
1A7
200
9
145
201
9
122
202
10
1A4
203
9
16D
204
9
B1
205
9
8A
206
9
1E1
207
9
1EC
208
10
1A5
209
10
17A
210
9
123
211
11
3EB
212
10
1E0
213
10
155
214
10
10F
215
10
10C
216
10
13A
217
10
34
218
11
3E8
219
10
1E1
220
10
13B
5
4
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
8
8
7
8
8
8
8
8
8
8
8
8
8
8
8
2
A
D
7
4
0
16
11
17
18
13
15
B
2
32
25
34
2E
20
3E
3B
22
15
14
1B
E
6E
5F
66
61
4E
7E
71
49
66
64
5A
43
49
4B
1E
19
35
D8
CF
DB
315
316
A Large Tables
Table A.42 (continued)
93
9
1BC
94
9
189
95
9
18D
96
10
166
97
10
194
98
9
182
99
9
1AB
100
9
183
101
10
167
102
10
164
103
9
18A
104
10
1D5
105
9
1AE
106
10
1ED
107
10
1FF
108
9
13D
109
10
162
110
10
120
111
10
19E
112
9
132
113
10
148
114
10
149
115
10
1E0
116
10
121
117
10
165
118
10
1EA
119
10
12A
120
10
105
121
10
2F2
122
10
19F
123
10
C8
124
10
19C
125
10
145
126
10
1E2
127
10
108
128
10
19D
129
9
133
130
10
1CA
131
9
108
132
10
14E
133
9
13C
134
10
1CB
135
9
10E
136
10
C9
137
9
109
138
10
1C2
139
10
1D4
Table A.42
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
(continued)
10
14F
10
102
10
14C
10
1E1
9
10F
10
14D
10
142
10
103
9
10A
10
16C
10
1EB
10
CE
10
1E3
10
1E6
10
109
10
33
10
62
10
146
10
100
10
6F
10
317
10
143
10
101
10
7E
10
106
10
147
10
338
10
107
10
CF
10
3A
10
339
10
7F
10
366
10
CC
10
C6
10
367
11
3FA
10
37B
10
3B
10
1FE
11
3FB
10
240
10
63
10
CD
10
364
10
6A
10
352
(continued)
10
261
10
26E
10
26F
10
104
10
30
10
31
11
32B
10
365
10
353
10
350
10
31E
11
39A
10
2F3
10
11E
10
310
10
241
11
3C8
11
3C9
10
6B
11
2C6
10
6E
10
36A
10
32
10
36B
10
26C
10
311
10
26D
11
3F8
10
216
10
368
11
39B
10
217
10
369
11
2C7
11
23F
317
318
A Large Tables
8
6
5
6
8
6
5
4
5
6
5
3
2
3
5
6
5
4
5
6
8
6
6
6
8
13
22
12
20
11
39
1
3
1F
21
1E
5
1
6
1D
27
0
2
3
38
10
23
5
26
12
319
320
A Large Tables
Table A.45 (continued)
63
10
2D9
64
9
141
65
9
E6
66
7
58
67
6
25
68
8
7E
69
8
A5
70
9
D1
71
10
70
72
12
687
73
11
34A
74
10
1F C
75
9
D3
76
8
1D
77
9
167
78
11
3FB
79
10
1CE
80
11
34B
Table A.46 Codebook 82
Index
Length Codeword
0
12
413
1
13
824
2
13
825
3
13
83A
4
11
203
5
11
7BE
6
12
410
7
10
3DC
8
9
112
9
10
2A0
10
10
3DD
11
10
2A1
12
13
83B
13
12
411
14
13
838
15
12
416
16
13
839
17
12
417
18
11
7BF
19
12
414
20
12
415
21
11
7BC
22
12
46A
23
10
3D2
24
10
3D3
25
8
A0
321
322
A Large Tables
Table A.46 (continued)
169
10
70
170
10
3EF
171
10
13F
172
9
18
173
8
86
174
8
8E
175
8
D9
176
7
51
177
7
1F
178
6
2
179
7
24
180
8
28
181
8
D1
182
9
7B
183
8
F9
184
10
114
185
10
3EC
186
11
218
187
12
46D
188
11
219
189
10
115
190
10
DA
191
9
154
192
10
DB
193
8
14
194
8
B8
195
7
69
196
8
D4
197
8
DA
198
9
78
199
9
43
200
9
19
201
9
155
202
10
71
203
11
21E
204
11
7B5
205
11
21F
206
11
78A
207
10
2BE
208
10
76
209
10
3ED
210
9
A
211
8
A6
212
8
A7
213
9
40
214
9
1E5
215
10
D8
323
9
9
8
7
7
7
6
7
6
6
6
5
5
4
3
2
3
4
4
5
6
6
6
7
6
6
7
7
7
8
8
7E
7F
3E
65
12
1E
27
19
24
8
D
1E
5
D
5
1
0
E
8
18
E
3F
26
18
3E
25
13
66
64
CE
CF
324
A Large Tables
Table A.48 Codebook 311
Index
Length Codeword
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
10
10
9
10
9
9
8
8
8
9
9
9
7
8
8
7
8
7
7
7
7
7
7
7
6
6
5
5
5
4
3
3
3
4
5
5
5
6
6
6
7
6
7
7
7
3BC
170
4F
3BD
42
64
20
CE
CF
77
65
B9
64
5D
3A
24
C
62
65
72
18
2F
2C
2D
2
13
1B
0
A
1
5
3
4
F
8
5
1A
F
D
38
25
30
1C
73
7
325
326
A Large Tables
Table A.49 (continued)
117
10
35F
118
10
35C
119
10
35D
120
11
3CE
121
12
48F
122
11
3CF
123
10
1A5
124
11
246
125
10
122
126
10
1E6
Table A.50 Codebook 1271
Index
Length Codeword
0
10
52
1
9
2E
2
10
53
3
10
50
4
10
51
5
10
56
6
10
57
7
10
54
8
10
55
9
10
3AA
10
10
3AB
11
10
3A8
12
9
2F
13
10
3A9
14
9
2C
15
9
2D
16
10
3AE
17
10
3AF
18
9
22
19
10
3AC
20
10
3AD
21
10
3A2
22
10
3A3
23
9
23
24
10
3A0
25
10
3A1
26
9
20
27
10
3A6
28
10
3A7
29
10
3A4
30
9
F4
327
328
A Large Tables
Table A.50 (continued)
171
9
C
172
9
D
173
10
38B
174
8
77
175
10
388
176
9
2
177
8
D9
178
10
389
179
8
DE
180
8
9D
181
9
3
182
8
92
183
8
66
184
10
38E
185
8
93
186
10
38F
187
10
38C
188
10
38D
189
9
0
190
10
382
191
8
DF
192
8
90
193
10
383
194
10
380
195
10
381
196
8
DC
197
8
DD
198
8
91
199
9
1
200
9
6
201
8
96
202
9
7
203
8
4C
204
8
97
205
9
4
206
8
94
207
10
386
208
9
5
209
10
387
210
10
384
211
9
1A
212
9
1B
5
5
6
6
6
7
7
6
7
7
7
7
7
7
7
8
7
7
7
7
8
7
6
7
8
7
8
8
8
7
7
9
7
7
8
7
8
7
8
8
8
9
8
8
8
8
3
F
1A
8
B
23
14
9
33
15
36
20
21
26
1A
D4
37
4A
4B
27
D5
48
20
24
44
25
45
AA
5A
1B
49
1AE
4E
18
5B
32
58
4F
59
5E
AB
1AF
A8
A9
5F
AE
Table A.51
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
(continued)
8
5C
8
5D
7
4C
8
AF
9
1AC
8
AC
9
1AD
9
1A2
9
1A3
8
AD
9
1A0
9
1A1
7
19
8
52
9
1A6
8
A2
8
53
9
1A7
8
A3
8
A0
8
50
8
A1
9
1A4
8
A6
8
A7
8
A4
8
51
9
1A5
9
1BA
9
1BB
9
1B8
8
56
9
1B9
9
1BE
8
57
8
A5
8
BA
8
BB
9
1BF
8
54
9
1BC
9
1BD
9
1B2
8
B8
9
1B3
8
55
7
4D
329
330
A Large Tables
Table A.51
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
(continued)
8
B9
8
3A
9
1B0
9
1B1
8
BE
8
BF
8
3B
8
BC
9
1B6
9
1B7
8
BD
9
1B4
9
1B5
8
B2
9
18A
9
18B
9
188
9
189
8
38
9
18E
8
B3
9
18F
9
18C
7
42
9
18D
8
B0
9
182
9
183
9
180
9
181
8
B1
9
186
8
B6
8
39
9
187
9
184
9
185
8
B7
8
3E
7
43
8
3F
9
19A
8
B4
8
3C
9
19B
9
198
9
199
Table A.51
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
(continued)
9
19E
8
B5
9
19F
9
19C
8
A
8
B
8
3D
9
19D
9
192
9
193
9
190
8
8
9
191
9
196
9
197
8
62
9
194
9
195
8
9
9
1EA
9
1EB
8
63
9
1E8
9
1E9
8
E
9
1EE
9
1EF
9
1EC
9
1ED
9
1E2
9
1E3
8
F
9
1E0
9
1E1
7
A
9
1E6
8
C
8
D
9
1E7
9
1E4
8
2
8
3
8
0
9
1E5
9
1FA
8
1
9
1FB
(continued)
9
1F 8
8
6
8
7
9
1F 9
9
1FE
9
1FF
9
1F C
9
1FD
9
1F 2
9
1F 3
9
1F 0
8
4
9
1F 1
9
1F 6
9
1F 7
9
1F 4
8
5
9
1F 5
8
8A
9
1CA
9
1CB
9
1C 8
9
1C 9
9
1CE
9
1CF
8
8B
9
1C C
8
88
9
1CD
9
1C 2
9
1C 3
8
89
8
8E
9
1C 0
9
1C1
331
332
A Large Tables
3
3
4
4
5
5
6
6
6
7
7
7
8
8
9
9
9
10
10
10
11
11
11
11
11
12
12
11
9
9
9
9
9
9
9
10
10
10
11
11
1
5
6
F
9
1C
1F
2
32
3A
76
66
47
CF
E1
19C
138
1CF
1CB
3AD
394
39D
7D
4B2
4B3
733
F8
395
12E
18
EE
1C
E4
1D4
129
3F
3C
251
4F 3
759
333
3
4
4
5
5
5
5
5
6
6
6
6
6
7
7
7
7
7
8
8
8
8
8
10
9
11
11
10
10
12
12
12
0
D
A
E
9
8
6
19
F
17
B
31
38
14
10
12
5D
5F
63
26
27
B9
BD
39D
47
3E4
2DB
8D
16B
7C 3
60E
60F
334
A Large Tables
Table A.53 (continued)
32
12
60C
33
12
60D
34
12
602
35
11
738
36
13
C12
37
12
603
38
13
C13
39
12
600
40
11
739
41
13
C10
42
13
C11
43
12
601
44
11
73E
45
12
606
46
10
8A
47
11
73F
48
11
2D8
49
11
2D9
50
10
8B
51
10
1F 3
52
10
88
53
9
B1
54
9
B2
55
9
B7
56
9
B0
57
9
B3
58
9
FA
59
9
C2
60
9
1C C
61
9
1CA
62
10
1F 1
63
10
168
64
11
73C
65
11
73D
66
12
607
67
11
2D2
68
11
3E5
69
12
604
70
12
605
71
13
C16
72
12
5AA
73
12
5AB
References
335
336
References
22. Golub, G.H., Loan, C.F.V.: Matrix Computations, third edn. The Johns Hopkins University
Press, Baltimore (1996)
23. Hall, J.L.: Auditory psychophysics for coding applications. In: V.K. Madisetti, D. Williams
(eds.) The Digital Signal Processing Handbook, pp. 39.139.25. CRC, Boca Raton (1997)
24. Hellman, R.: Asymmetry of masking between noise and tone. Perception and Psychophysics
11, 241246 (1972)
25. Herre, J.: From joint stereo to spatial audio coding recent progress and standardization. 7th
International Conference on Digital Audio Effects (DAFX04) pp. 157162 (2004)
26. Herre, J., Johnston, J.D.: Enhancing the performance of perceptual audio coders by using
temporal noise shaping (TNS). 101st AES Convention (1996 Reprint #4384)
27. Hoffner, R.: Multichannel television sound broadcasting in the united states. 82nd AES Convention, Paper Number: 2435 (1987)
28. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990)
29. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proceedings
of the I.R.E. pp. 10981102 (1952)
30. ITU: ITU-R Recommendation BS.1116: Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. ITU (1997)
31. ITU: ITU-R Recommendation BS.1387: Method for objective measurements of perceived
audio quality (PEAQ). ITU (1998)
32. Jayant, N., Johnston, J., Safranek, R.: Signal compression based on models of human perception. Proceedings of the IEEE 81, 13851422 (1993)
33. Jayant, N.S., Noll, P.: Digital Coding of Waveforms: Principles and Applications to Speech
and Video. Prentice Hall, Englewood Cliffs (1984)
34. Johnston, J.D.: Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications 6(2), 314323 (1988)
35. Johnston, J.D., Ferreira, A.J.: Sum-difference stereo transform coding. IEEE International
Conference on Acoustics, Speech, and Signal Processing 2, 569572 (1992)
36. Johnston, J.D., Sinha, D., Dorward, S., Quackenbush, S.: AT&T perceptual audio coding
(PAC). Collected Papers on Digital Audio Bit-Rate Reduction pp. 7381 (1996)
37. JPEG: Information technology Digital compression and coding of continuous-tone still images Requirements and guideline, vol. 10918-1. ISO/IEC (1994)
38. Kimme, E., Kuo, F.: Synthesis of optimal filters for a feedback quantization system. IEEE
Transactions on Circuit Theory 10, 405413 (1963)
39. Koilpillai, R.D., Vaidyanathan, P.P.: New results on cosine-modulated fir filter banks satisfying perfect reconstruction. IEEE International Conference on Acoustics, Speech, and Signal
Processing 3, 17931796 (1991)
40. Koilpillai, R.D., Vaidyanathan, P.P.: Cosine-modulated fir filter banks satisfying perfect reconstruction. IEEE Transactions on Signal Processing 40, 770783 (1992)
41. Levinson, N.: The wiener rms error criterion in filter design and prediction. Journal of Mathematical Physics 25, 261278 (1947)
42. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Transactions
on Communications 28, 8495 (1980)
43. MacKay, D.J.C.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
44. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Code. North-Holland,
Amsterdam (1988)
45. Malvar, H.: Lapped transforms for efficient transform/subband coding. IEEE Transactions on
Acoustics, Speech and Signal Processing 38, 969978 (1990)
46. Malvar, H.: Modulated qmf filter banks with perfect reconstruction. Electronics Letters 26,
906907 (1990)
47. Malvar, H.: Extended lapped transforms: fast algorithms and applications. International Conference on Acoustics, Speech, and Signal Processing 3, 17971800 (1991)
48. Malvar, H., Staelin, D.H.: The lot: transform coding without blocking effects. IEEE Transactions on Acoustics, Speech and Signal Processing 37, 553559 (1989)
References
337
49. Malvar, H.S.: Signal Processing with Lapped Transforms. Artech House, Norwood (1992)
50. Masson, J., Picel, Z.: Flexible design of computationally efficient nearly perfect qmf filter
banks. IEEE International Conference on Acoustics, Speech, and Signal Processing 10,
541544 (1985)
51. McDonald, R., Schultheiss, P.: Information rates of gaussian signals under criteria constraining the error spectrum. Proceedings of the IEEE 52, 415416 (1964)
52. Miller, G.A.: Sensitivity to changes in the intensity of white noise and its relation to masking
and loudness. Journal of the Acoustical Society of America 19, 609619 (1947)
53. Moore, B.C.: An Introduction to the Psychology of Hearing. Academic, London (1997)
54. MPEG: Coding of moving pictures and associated audio for digital storage media at up to
about 1.5 Mbit/s Part 2: Video, vol. 11172-2. ISO/IEC (1993)
55. MPEG: Coding of moving pictures and associated audio for digital storage media at up to
about 1.5 Mbit/s Part 3: Audio, vol. 11172-3. ISO/IEC (1993)
56. MPEG: Information technology: generic coding of moving pictures and associated audio information Part 3: Audio, vol. 13818-3. ISO/IEC (1998)
57. MPEG: Information technology: generic coding of moving pictures and associated audio information Part 2: Video, vol. 13818-2. ISO/IEC (2000)
58. MPEG: Coding of Audio-Visual Objects: Visual, vol. 14496-2. ISO/IEC (2004)
59. MPEG: Coding of Audio-Visual Objects: Audio, vol. 14496-3. ISO/IEC (2005)
60. MPEG: Information technology: Generic coding of moving pictures and associated audio
information Part 7: Advanced Audio Coding (AAC), vol. 13818-7. ISO/IEC (2006)
61. MPEG: Coding of Audio-Visual Objects: Advanced Video Coding, vol. 14496-10. ISO/IEC
(2008)
62. NHK: Super Hi-vision. http://www.nhk.or.jp/digital/en/superhivision/index.html
63. Norbert, W.: Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT,
Cambridge (1964)
64. Nussbaumer, H.J.: Pseudo qmf filter bank. IBM Technical Disclosure Bulletin 24, 30813087
(1981)
65. Nyquist, H.: Certain topics in telegraph transmission theory. Transactions AIEE 47, 617644
(1928)
66. Oliver, B.M.: Efficient Coding. Bell System Technical Journal 21, 724750 (1952)
67. Oppenheim, A.V., Schafer, R.W., Yoder, M.T., Padgett, W.T.: Discrete-Time Signal Processing. Prentice-Hall, New Jersey (2009)
68. Oppenheim, A.V., Willsky, A.S., Hamid, S.: Signals and Systems. Prentice-Hall, Englewood
Cliffs, NJ (1996)
69. Painter, T., Spanias, A.: Perceptual coding of digital audio. Proceedings of the IEEE 88(4),
451513 (2000)
70. Patterson, R.D.: Auditory filter shapes derived with noise stimuli. Journal of the Acoustical
Society of America 59(3), 640654 (1976)
71. Patterson, R.D., Allerhand, M., Giguere, C.: Time-domain modeling of peripheral auditory
processing: A modular architecture and software platform. Journal of the Acoustical Society
of America 98(4), 18901894 (1995)
72. Patterson, R.D., Moore, B.C.J.: Auditory filters and excitation patterns as representations of
frequency resolution. In: B.C.J. Moore (ed.) Frequency Selectivity in Hearing, pp. 123177.
Academic, London (1986)
73. Patterson, R.D., Nimmo-Smith, I., Weber, D.L., Milroy, R.: The deterioration of hearing with
age: Frequency selectivity, the critical ratio, the audiogram, and speech threshold. Journal of
the Acoustical Society of America 72(6), 17881803 (1982)
74. Pearl, J.: On coding and filtering stationary signals by discrete fourier transforms. IEEE
Transactions on Information Theory 19(2), 229232 (1973)
75. Pinsky, M.A.: Introduction to Fourier Analysis and Wavelets. American Mathematical Society
(2009)
76. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition:
The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)
338
References
77. Princen, J.P., Bradley, A.B.: Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on ASSP 34(5), 11531161 (1986)
78. Princen, J.P., Johnson, A., Bradley, A.B.: Subband/transform coding using filter bank designs
based on time domain aliasing cancellation. In: IEEE International Conference on Acoustics,
Speech, and Signal Processing, pp. 21612164 (1987)
79. Ramstad, T., Tanem, J.: Cosine-modulated analysis-synthesis filter bank with critical sampling and perfect reconstruction. IEEE International Conference on Acoustics, Speech, and
Signal Processing 3, 17891792 (1991)
80. Rao, K.R., Yip, P.: Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic, Boston (1990)
81. Rothweiler, J.: Polyphase quadrature filters - a new subband coding technique. IEEE International Conference on Acoustics, Speech, and Signal Processing 8, 12801283 (1983)
82. Sathe, V., Vaidyanathan, P.: Effects of multirate systems on the statistical properties of random
signals. IEEE Transactions on Signal Processing 41, 131146 (1993)
83. Sayood, K.: Introduction to Data Compression, second edn. Morgan Kaufmann, Burlington
(2000)
84. Schroeder, M.R., Atal, B.S., Hall, J.L.: Optimizing digital speech coders by exploiting masking properties of the human ear. Journal of the Acoustical Society of America 66(6),
16471652 (1979)
85. Shannon, C.: A mathematical theory of communication. Bell System Technical Journal 27,
379423 (1948)
86. Shannon, C.: A mathematical theory of communication. Bell System Technical Journal 27,
623656 (1948)
87. Sinha, D., Johnston, J.D.: Audio compression at low bit rates using a signal adaptive switched
filter bank. IEEE International Conference on Acoustics, Speech, and Signal Processing 2,
10531056 (1996)
88. Smyth, M.: White Paper: An Overview of the Coherent Acoustics Coding System. DTS,
Agoura Hills (1999)
89. Steele, J.M.: The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical
Inequalities. Cambridge University Press, Cambridge (2004)
90. Stein, E., Shakarchi, R.: Fourier Analysis: An introduction. Princeton University Press,
Princeton (2003)
91. Terhardt, E.: Calculating virtual pitch. Hearing research 1(1(2)), 155182 (1979)
92. Tsutsui, K.: ATRAC (adaptive transform acoustic coding) and ATRAC 2. In: V. Madisetti
and D. Williams (eds.) The Digital Signal Processing Handbook, pp. 43.1643.20. CRC, Boca
Raton (1998)
93. Vaidyanathan, P.P.: Multirate Systems And Filter Banks. Prentice Hall, Englewood Cliffs
(1992)
94. Vary, P.: On the design of digital filter banks based on a modified principle of polyphase. AEU
33(7/8), 293300 (1979)
95. Wikipedia: Windows Media Audio. http://en.wikipedia.org/wiki/Windows Media Audio
(2007)
96. Xiph.org, F.: Vorbis I specification. Xiph.org Foundation (2004)
97. You, Y.L., Ma, W.: Mobile Multimedia Broadcasting Standards: Technology and Practice. In:
F.L. Luo, chap. 20 DRA Audio Coding Standard: An Overview, pp. 587606. Springer, US
(2008)
98. You, Y.L. et al.: Multichannel Digital Audio Coding Technology, vol. SJ/T11368-2006. Ministry of Information Industry, Peoples Republic of China (2007)
99. You, Y.L. et al.: DRA Digital Audio Coding Technology Specification, vol. Q/DCHL0012008. China Hualu Group Co., Ltd. (2009)
100. You, Y.L. et al.: Multichannel Digital Audio Coding Technology, vol. GB/T 22726-2008.
National Standardization Committee, Peoples Republic of China (2009)
101. Zwicker, E.: Subdivision of the audible frequency range into critical bands. Journal of the
Acoustical Society of America 33(2), 248248 (1961)
102. Zwicker, E., Fastl, H.: Psychoacoustics: Facts and Models. Springer, Berlin (1999)
Index
339
340
Index
compact representation, 5
Companding (defined), 39
companding (defined), 40
compressed data size codeword (defined), 243
compression ratio, 148, 255
compression ratio (defined), 6
compressor (defined), 39
cosine modulated analysis filter (defined), 121
cosine modulated filter banks, 11
cosine modulated synthesis filter (defined), 122
critical band (defined), 179
critical band level (defined), 184
critical band rate (defined), 179
critical band rate scale (defined), 184
critical band segments (defined), 238
critical bands (defined), 238
critical bandwidth, 179, 182
critical bandwidth (defined), 179
critically sampled (defined), 97
DAC (defined), 23
data modeling, 9
decimator (defined), 96
decision boundaries (defined), 21
decision cells, 45
decision interval, 19
decision intervals, 47
decision intervals (defined), 21
decision regions, 45
decision regions (defined), 47
decoding delay (defined), 154
decorrelation (defined), 85
delay chain (defined), 93
DFT, 11
DFT (defined), 86
differential pulse code modulation (defined),
59
digital audio (compression) coding (defined), 5
digital-to-analog conversion (defined), 23
Discrete Fourier Transform, 11
Discrete Fourier Transform (defined), 86
Double blind principle (defined), 253
double buffering (defined), 236
double-blind, triple-stimulus with hidden
reference (defined), 254
double-resolution switched MDCT, 223
double-resolution switched MDCT (defined),
213
downsampler (defined), 96
DPCM, 69
DPCM (defined), 59
DRA (defined), 255
Dynamic Resolution Adaptation (defined), 255
filter banks, 11
fine quantization, 69, 71, 111
fine quantization (defined), 62
fixed length, 145
fixed-length code (defined), 145, 148
fixed-length codebook (defined), 145, 148
fixed-length codes, 28
fixed-point (defined), 247
fixed-point arithmetic (defined), 246
flattening effect, 202
forward quantization (defined), 21
Forward Quantization., 19
Fourier uncertainty principle, 12
Fourier uncertainty principle (defined), 204
frame, 213
frame-based processing, 236
frequency coefficients (defined), 86
frequency domain (defined), 86
frequency resolution (defined), 200
frequency transforms (defined), 86
gammatone filter (defined), 177
geometric mean (defined), 77
Givens rotation (defined), 125
global bit allocation, 13
granular error (defined), 29
granular noise (defined), 29
half-sine window (defined), 137
Heisenberg uncertainty principle, 12
Heisenberg uncertainty principle (defined),
204
Index
Huffman code, 9
Huffman code (defined), 161
hybrid filter bank (defined), 205
IDCT, 89
ideal bandpass filter (defined), 95
ideal subband coder (defined), 110
iid, 146
iid (defined), 147
images (defined), 101
independently and identically distributed
(defined), 147
inputoutput map of a quantizer (defined), 21
instantaneously decodable (defined), 153
intensity density level, 184
intensity density level (defined), 174
inter-aural amplitude differences (defined), 232
inter-aural time differences (defined), 232
inter-frame bit allocation, 290
inter-frame bit allocation (defined), 241
internal node (defined), 154
interpolator (defined), 97
intra-frame bit allocation (defined), 241
Inverse Quantization, 20
inverse quantization, 281
inverse quantization (defined), 21
inverse transform (defined), 74
JIC, 282
JIC (defined), 232
joint channel coding (defined), 231
joint intensity coding, 14, 258, 259, 278, 282
joint intensity coding (defined), 232
lapped transforms, 11
lattice structure (defined), 126
LBG algorithm (defined), 49
leaf (defined), 154
LevinsonDurbin recursion (defined), 63
LFE, 231
LFE (defined), 4, 234
linear phase condition, 207
linear phase condition (defined), 121
Linear prediction, 51
341
linear prediction, 10
linear prediction coding (defined), 55
linear predictor (defined), 53
linear transform (defined), 73
Lloyd-Max quantizer (defined), 36
loading factor (defined), 29
Logarithmic companding, 257
long filter bank (defined), 200
look-ahead, 213
lossless (defined), 6, 147
lossless coding (defined), 147
lossless compression coding (defined), 147
lossy (defined), 6
low frequency effect, 231
low-frequency effect (defined), 4
Low-frequency effects (defined), 234
LPC (defined), 55
LPC decoder (defined), 56
LPC encoder (defined), 55
LPC reconstruction filter, 58, 69, 71
LPC reconstruction filter (defined), 56
342
modulating (defined), 94
monaural sound (defined), 4
MSQE, 47
MSQE (defined), 22
multiplexer, 14
Index
ping-pong buffering (defined), 236
polyphase (component) matrix (defined), 105
polyphase components, 106
polyphase implementation, 107
polyphase representation, 106
postmasking (defined), 193
power-complementary conditions, 207
PR (defined), 103
pre-echo artifacts (defined), 203
prediction coefficients (defined), 53
prediction error, 10, 51, 53
prediction error (defined), 55
prediction filter (defined), 53
prediction gain (defined), 55
prediction residue (defined), 55
predictor order (defined), 53
prefix (defined), 153
prefix code (defined), 154
prefix-free code (defined), 154
premasking, 257
premasking (defined), 193
primary windows (defined), 210
probability density function (defined), 21
prototype filter (defined), 94
pseudo QMF (quadrature mirror filter)
(defined), 122
Pulse-code modulation, 3
pulse-code modulation (defined), 24
quantization, 8, 19
quantization distortion (defined), 21
quantization error, 8
quantization error (defined), 21
quantization error accumulation, 59
quantization index, 19
quantization index (defined), 21
quantization noise (defined), 21
quantization noise accumulation (defined), 57
quantization step size, 8, 24, 275
quantization step size (defined), 24
quantization table, 21
quantization table (defined), 19
quantization unit, 257, 275, 281
quantization unit (defined), 238
quantized value, 20
quantized values (defined), 21
quasistationary (defined), 199
re-quantize (defined), 24
re-quantizing (defined), 17
reconstructed vector (defined), 47
reconstruction error (defined), 57
Index
representative values (defined), 21
representative vector (defined), 47
representative vectors, 45
residue, 10, 53
reversal or anti-diagonal matrix (defined), 131
rotation vector (defined), 126
rounded exponential filter (defined), 177
rounding function (defined), 26
sample rate, 3
sample rate (defined), 19
sample rate compressor (defined), 96
sampling period (defined), 19
sampling theorem, 3
sampling theorem (defined), 19
SBC (defined), 91
scalar quantization (defined), 21
scalar quantizer, 8
self-information (defined), 148
Shannons noiseless coding theorem, 167
siblings (defined), 163
side information (defined), 244
signal-to-mask ratio (defined), 186
signal-to-noise ratio (defined), 23, 252
simultaneous masking (defined), 193
sine window (defined), 137
SMR, 252
SMR (defined), 186
SMR threshold, 195
SMR threshold (defined), 186
SNR (defined), 252
sound intensity level, 184
sound intensity level (defined), 174
sound pressure level (defined), 174
source sequence, 145
source sequence (defined), 147
source signal, 5
source symbol, 145
source symbols (defined), 146
spectral envelop (defined), 69
spectral flatness measure, 113
spectrum flatness measure (defined), 85
SPL (defined), 174
spread of masking (defined), 188
SQ, 8
SQ (defined), 17, 21
SRN (defined), 23
statistic-adaptive codebook assignment, 268
statistical redundancy. (defined), 6
statistical redundancy, 9
steering vector (defined), 233
stereo (defined), 4
stopband attenuation, 94
343
subband (defined), 93
subband coding, 91
subband coding (defined), 97
subband filter (defined), 93
subband filters (defined), 93
subband samples (defined), 93
subbands (defined), 93
subjective listening test, 290
sum/difference coding, 14, 258, 259, 276
sum/difference coding (defined), 231
Sum/difference decoding, 283
surround sound (defined), 4
Switched MDCT (defined), 207
switched-window MDCT (defined), 207
symbol, 145
symbol code (defined), 147
symbol set, 145
symbol set (defined), 146
synchronization codeword (defined), 243
synthesis filter banks (defined), 92
TC (defined), 73
TDAC (defined), 136
temporal masking (defined), 193
Temporal noise shaping (defined), 215
temporal-frequency analysis, 69
test set (defined), 49
test signal (defined), 181
the DCT, 89
the inverse DCT, 89
THQ (defined), 174
threshold in quiet (defined), 174
timefrequency analysis (defined), 86
timefrequency resolution (defined), 204
timefrequency tiling (defined), 238
time-domain aliasing cancellation (defined),
136
TLM, 256
TLM (defined), 220
TNS (defined), 215
tonal frequency components, 199
training set (defined), 48
transform (defined), 73
Transform coding (defined), 73
transform coefficients (defined), 73
transformation matrix (defined), 73
transformed block (defined), 73
transient attack (defined), 202
transient detection interval, 227
transient detection interval (defined), 213
transient function (defined), 228
Transient localized MDCT, 256
transient segment (defined), 237
344
transient segmentation (defined), 237
transient-detection blocks (defined), 227
transient-localized MDCT (defined), 220
transitional windows (defined), 210
transparent (defined), 254
triple-resolution switched MDCT (defined),
224
truncate function (defined), 27
type-I polyphase (component) vector (defined),
105
type-I polyphase component (defined), 104
type-I polyphase representation (defined), 104
type-II DCT (defined), 88
type-II polyphase components (defined), 106
Type-II polyphase representation (defined),
106
Type-IV DCT (defined), 89
unary code, 9, 152, 162
unary codebook, 146
unary numeral system, 146, 152
unequal error protection (defined), 245
uniform quantization, 24
uniform quantizer (defined), 24
uniform scalar quantization, 8
unique decodability (defined), 152
unitary (defined), 86, 124
upsampler (defined), 97
Index
variable-length codebooks (defined), 145
variable-length codes (defined), 145
vector quantization, 8
vector quantization (defined), 43
vector-quantized, 47
virtual window (defined), 217
virtual windows, 256
Voronoi regions (defined), 48
VQ, 8
VQ (defined), 17, 43
VQ codebook, 43
VQ codebook (defined), 47
VQ quantization error (defined), 47
water-filling, 81
water-filling algorithm (defined), 242
whitening filter (defined), 67
WienerHopf equations, 67
WienerHopf equations (defined), 63
window (defined), 137
window function (defined), 137