Professional Documents
Culture Documents
1. Introduction
Advances in digital audio technology are fueled by two sources: hardware developments and
new signal processing techniques. When processors dissipated tens of watts of power and
memory densities were on the order of kilobits per square inch, portable playback devices like an
MP3 player were not possible. Now, however, power dissipation, memory densities, and
processor speeds have improved by several orders of magnitude.
This paper introduces digital audio signal compression, a technique essential to the
implementation of many digital audio applications. Digital audio signal compression is the
removal of redundant or otherwise irrelevant information from a digital audio signal, a process
that is useful for conserving both transmission bandwidth and storage space. We begin by
defining some useful terminology. We then present a typical “encoder” (as compression
algorithms are often called) and explain how it functions. Finally consider some standards that
employ digital audio signal compression, and discuss the future of the field.
2. Terminology
This paper focuses on audio compression techniques, which differ from those used in speech
compression. Speech compression uses a model of the human vocal tract to express particular
signal in a compressed format. This technique is not usually applied in the field of audio
compression due to the vast array of sounds that can be generated – models that represent audio
generation would be too complex to implement. So instead of modeling the source of sounds,
modern audio compression models the receiver, i.e., the human ear.
When we speak of compression, we must distinguish between two different types: lossless
and lossy. Lossless compression retains all the information in a given signal, i.e., a decoder can
perfectly reconstruct a compressed signal. In contrast, lossy compression eliminates information
from the original signal. As a result, a reconstructed signal may differ from the original. With
audio signals, the differences between the original and reconstructed signals only matter if they
are detectable by the human ear. As we will explore shortly, audio compression employs both
lossy and lossless techniques.
Figure 1 shows a generic encoder or “compressor that takes blocks of sampled audio signal
as its input. These blocks typically consist of between 500 and 1500 samples per channel,
depending on the encoder specification. For example, the MPEG-1 layer III (MP3) specification
takes 576 samples per channel per input block. The output is a compressed representation of the
input block (a “frame”) that can be transmitted or stored for subsequent decoding.
No matter what you do, your ears are always working. They are constantly detecting,
deciphering and analyzing sounds and communicating them to the brain. In a comparatively tiny
area of our body the ear is performing many highly technical and intricate functions. There are
three distinct portions to the ear: the outer ear containing the fleshy skin and the canal that leads
to the inner ear, the middle ear containing the three smallest bones in the human body the
malleus, incus and stapes (commonly called the hammer, anvil and stirrup) and the inner ear,
made up of a cluster of three semicircular canals and the snail shaped cochlea. Let’s take a look
at them one at a time…
Scientists cannot fully explain just how the signals are transmitted to the brain. They do
know that the signals sent by all the hair cells are about the same in duration and strength. This
has led them to believe that it is not the content of the signals but rather the signals themselves
that convey some sort of message to the brain.
Our ears, so often taken for granted, thus are a marvel of intricacy and design that leaves
anything that man can produce in the shade as a cheap imitation. Your hearing can never be
replaced. Don’t take it for granted.
5. Psychoacoustics
How do we reduce the size of the input data? The basic idea is to eliminate information that
is inaudible to the ear. This type of compression is often referred to as perceptual encoding. To
help determine what can and cannot be heard, compression algorithms rely on the field of
psychoacoustics, i.e., the study of human sound perception. Waves vibrating at different
frequencies manifest themselves differently, all the way from the astronomically slow pulsations
of the universe itself to the inconceivably fast vibration of matter (and beyond). Somewhere in
between these extremes are wavelengths that are perceptible to human beings as light and sound.
Just beyond the realms of light and sound are sub- and ultrasonic vibration, the infrared and
ultraviolet light spectra, and zillions of other frequencies imperceptible to humans (such as radio
and microwave). Our sense organs are tuned only to very narrow bandwidths of vibration in the
overall picture. In fact, even our own musical instruments create many vibrational frequencies
that are imperceptible to our ears. Frequencies are typically described in units called Hertz (Hz),
which translates simply as "cycles per second." In general, humans cannot hear frequencies
below 20Hz (20 cycles per second), nor above 20kHz (20,000 cycles per second), as shown in
Figure 2.
While hearing capacities vary from one individual to the next, it's generally true that humans
perceive midrange frequencies more strongly than high and low frequencies,[2] and that
sensitivity to higher frequencies diminishes with age and prolonged exposure to loud volumes. In
fact, by the time we're adults, most of us can't hear much of anything above 16kHz (although
women tend to preserve the ability to hear higher frequencies later into life than do men). The
most sensitive range of hearing for most people hovers between 2kHz to 4kHz, a level probably
evolutionarily related to the normal range of the human voice, which runs roughly from 500Hz to
2kHz.
Specifically, audio compression algorithms exploit the conditions under which signal
characteristics obscure or mask each other. This phenomenon occurs in three different ways:
threshold cut-off, frequency masking and temporal masking. The remainder of this section
explains the nature of these concepts; subsequent sections explain how they are typically applied
to audio signal compression.
Threshold Cut-off
The human ear detects sounds as a local variation in air pressure measured as the Sound
Pressure Level (SPL). If variations in the SPL are below a certain threshold in amplitude, the ear
cannot detect them. This threshold, shown in Figure 3, is a function of the sound’s frequency.
Notice in Figure 3 that because the lowest-frequency component is below the threshold, it will
not be heard.
Frequency Masking
Even if a signal component exceeds the hearing threshold, it may still be masked by louder
components that are near it in frequency. This phenomenon is known as frequency masking or
simultaneous masking. Each component in a signal can cast a “shadow” over neighbouring
components. If the neighbouring components are covered by this shadow, they will not be heard.
The effective result is that one component, the masker, shifts the hearing threshold. Figure 4
shows a situation in which this occurs.
Temporal Masking
Just as tones cast shadows on their neighbors in the frequency domain, a sudden increase in
volume can mask quieter sounds that are temporally close. This phenomenon is known as
temporal masking. Interestingly, sounds that occur both after and before the volume increase can
be masked! Figure 5 illustrates a typical temporal masking scenario: events below the indicated
threshold will not be heard. The idea behind temporal masking is that humans also have trouble
hearing distinct sounds that are close to one another in time. For example, if a loud sound and a
quiet sound are played simultaneously, you won't be able to hear the quiet sound. If, however,
there is sufficient delay between the two sounds, you will hear the second, quieter sound. The
key to the success of temporal masking is in determining (quantifying) the length of time
between the two tones at which the second tone becomes audible, i.e., significant enough to keep
it in the bitstream rather than throwing it away. This distance, or threshold, turns out to be around
five milliseconds when working with pure tones, though it varies up and down in accordance
with different audio passages.
6. Spectral Analysis
Of the three masking phenomena explained above, two are best described in the frequency
domain. Thus, a frequency domain representation, also called the “spectrum” of a signal, is a
useful tool for analyzing the signal’s frequency characteristics and determining thresholds. There
are several different techniques for converting a finite time sequence into its spectral
representation, and these typically fall into one of two categories: transforms and filter banks.
Transforms calculate the spectrum of their inputs in terms of a set of basis sequences; e.g., the
Fourier Transform uses basic sequences that are complex exponentials. Filter banks apply
several different band pass filters to the input. Typically the result is several time sequences,
each of which corresponds to a particular frequency band. Taking the spectrum of a signal has
two purposes:
¾ To derive the masking thresholds in order to determine which portion of the signal
can be dropped.
¾ To generate a representation of the signal to which the masking threshold can be
applied.
Some compression schemes use different techniques for these two tasks.
The most popular transform in signal processing is the Fast Fourier Transform (FFT).
Given a finite time sequence, the FFT produces a complex-value frequency domain representation.
Encoders often use FFTs as a first step toward determining masking thresholds. Another popular
transform is the Discrete Cosine Transform (DCT), which outputs a real-valued frequency domain
representation. Both the FFT and the DCT suffer from distortion when transforms are taken from
contiguous blocks of time data. To solve this problem, inputs and outputs can be overlapped and
windowed in such a way that, in the absence of lossy compression techniques, entire time signals can
be perfectly reconstructed. For this reason, most transform-based encoding schemes employ an
overlapped and windowed DCT known as the Modified Discrete Cosine Transform (MDCT).
Some compression algorithms that use the MDCT are MPEG-1 layer-III, MPEG-2 AAC, and Do
Dolby AC-3. Filter banks pass a block of time samples through several band pass filters to generate
different signals corresponding to different sub-bands in frequency. After filtering, masking
thresholds can be applied to each sub-band. Two popular filter bank structures are the poly-phase
filter bank and the wavelet filter bank. The poly-phase filter bank uses parallel band pass filters of
equal width whose outputs are down-sampled to create one (shorter) signal per sub-band. In the
absence of lossy compression techniques, a decoder can achieve perfect reconstruction by up-
sampling, filtering, and adding each sub-band. This type of structure is used in all of the MPEG-1
audio encoders.
The purpose of this section is to discuss some existing standards in digital audio compression, in
particular the MPEG-1 layer III. Features of interest for each standard include which compression
techniques are used, special details or unique characteristics, and target applications.
7.1.1 History
In 1987, the Fraunhofer IIS started to work on perceptual audio coding in the framework of
the EUREKA project EU147, Digital Audio Broadcasting (DAB). In a joint cooperation with the
University of Erlangen (Prof. Dieter Seitzer), the Fraunhofer IIS finally devised a very powerful
algorithm that is standardized as ISO-MPEG Audio Layer-3 (IS 11172-3 and IS 13818-3).
you end up with more than 1.4 Mbit to represent just one second of stereo music in CD quality. By
using MPEG audio coding, you may shrink down the original sound data from a CD by a factor of
12, without losing sound quality. Basically, this is realized by perceptual coding techniques
addressing the perception of sound waves by the human ear.
By exploiting stereo effects and by limiting the audio bandwidth, the coding schemes may
achieve an acceptable sound quality at even lower bit rates. MPEG Layer-3 is the most powerful
member of the MPEG audio coding family. For a given sound quality level, it requires the lowest bit
rate or for a given bit rate, it achieves the highest sound quality.
MP3 uses two compression techniques to achieve its size reduction ratios over uncompressed
audio-one lossy and one lossless. First it throws away what humans can't hear anyway (or at least it
makes acceptable compromises), and then it encodes the redundancies to achieve further
compression. However, it's the first part of the process that does most of the grunt work, requires
most of the complexity.
Perceptual codecs are highly complex beasts, and all of them work a little differently.
However, the general principles of perceptual coding remain the same from one codec to the next. In
brief, the MP3 encoding process can be subdivided into a handful of discrete tasks (not necessarily
in this order):
• Break the signal into smaller component pieces called " frames," each typically lasting a
fraction of a second. You can think of frames much as you would the frames in a movie film.
• Analyze the signal to determine its "spectral energy distribution." In other words, on the
entire spectrum of audible frequencies, find out how the bits will need to be distributed to
best account for the audio to be encoded. Because different portions of the frequency
spectrum are most efficiently encoded via slight variants of the same algorithm, this step
breaks the signal into sub-bands, which can be processed independently for optimal results
(but note that all sub-bands use the algorithm-they just allocate the number of bits differently,
as determined by the encoder).
• The encoding bitrate is taken into account, and the maximum number of bits that can be
allocated to each frame is calculated. For instance, if you're encoding at 128 kbps, you have
an upper limit on how much data can be stored in each frame (unless you're encoding with
variable bitrates, but we'll get to that later). This step determines how much of the available
audio data will be stored, and how much will be left on the cutting room floor.
• The frequency spread for each frame is compared to mathematical models of human
psychoacoustics, which are stored in the codec as a reference table. From this model, it can
be determined which frequencies need to be rendered accurately, since they'll be perceptible
to humans, and which ones can be dropped or allocated fewer bits, since we wouldn't be able
to hear them anyway. Why store data that can't be heard?
• The bitstream is run through the process of " Huffman coding," which compresses redundant
information throughout the sample. The Huffman coding does not work with a
psychoacoustic model, but achieves additional compression via more traditional means.
Thus, you can see the entire MP3 encoding process as a two-pass system: First you run all of
the psychoacoustic models, discarding data in the process, and then you compress what's left
to shrink the storage space required by any redundancies. This second step, the Huffman
coding, does not discard any data-it just lets you store what's left in a smaller amount of
space.
• The collection of frames is assembled into a serial bitstream, with header information
preceding each data frame. The headers contain instructional "meta-data" specific to that
frame.
Along the way, many other factors enter into the equation, often as the result of options
chosen prior to beginning the encoding. In addition, algorithms for the encoding of an individual
frame often rely on the results of an encoding for the frames that precede or follow it. The entire
process usually includes some degree of simultaneity; the preceding steps are not necessarily run in
order.
* Fraunhofer IIS uses a non-ISO extension of MPEG Layer-3 for enhanced performance
(“MPEG 2.5”)
Filter Bank
The filter bank used in MPEG Layer-3 is a hybrid filter bank which consists of a poly-phase
filter bank and a Modified Discrete Cosine Transform (MDCT). This hybrid form was chosen for
reasons of compatibility to its predecessors, Layer-1 and Layer-2.
Perceptual Model
The perceptual model mainly determines the quality of a given encoder implementation. It uses
either a separate filter bank or combines the calculation of energy values (for the masking
calculations) and the main filter bank. The output of the perceptual model consists of values for the
masking threshold or the allowed noise for each coder partition. If the quantization noise can be kept
below the making threshold, then the compression results should be indistinguishable from the
original signal.
Joint Stereo
Joint stereo coding takes advantage of the fact that both channels of a stereo channel pair contain
far the same information. These stereophonic irrelevancies and redundancies are exploited to reduce
the total bit rate. Joint stereo is used in cases where only low bit rates are available but stereo signals
are desired.
gain to result in a larger quantization step sizes until the resulting bit demand for
Huffman coding is small enough.
The great bulk of the work in the MP3 system as a whole is placed on the encoding process.
Since one typically plays files more frequently than one encodes them, this makes sense. Decoders
do not need to store or work with a model of human psychoacoustic principles, nor do they require a
bit allocation procedure. All the MP3 player has to worry about is examining the bitstream of header
and data frames for spectral components and the side information stored alongside them, and then
reconstructing this information to create an audio signal. The player is nothing but an (often) fancy
interface onto your collection of MP3 files and playlists and your sound card, encapsulating the
relatively straightforward rules of decoding the MP3 bitstream format.
While there are measurable differences in the efficiency-and audible differences in the
quality-of various MP3 decoders, the differences are largely negligible on computer hardware
manufactured in the last few years. That's not to say that decoders just sit in the background
consuming no resources. In fact, on some machines and some operating systems you'll notice a slight
(or even pronounced) sluggishness in other operations while your player is running. This is
particularly true on operating systems that don't feature a finely grained threading model, such as
MacOS and most versions of Windows. Linux and, to an even greater extent, BeOS are largely
exempt from MP3 skipping problems, given decent hardware. And of course, if you're listening to
MP3 audio streamed over the Internet, you'll get skipping problems if you don't have enough
bandwidth to handle the bitrate/sampling frequency of the stream.
Some MP3 decoders chew up more CPU time than others, but the differences between them
in terms of efficiency are not as great as the differences between their feature sets, or between the
efficiency of various encoders. Choosing an MP3 player becomes a question of cost, extensibility,
audio quality, and appearance.
Today's music technologies have turned passive listeners into active participants that can
capture, record, transform, edit, and save their music in a variety of digital formats. An emerging
technology that can significantly reduce the size of digital music files while maintaining their
original sound quality is mp3PRO.
A coding scheme for compressing audio signals, MPEG reduces the size of audio files
using three coding schemes or layers. The third layer, commonly known as MP3, uses audio coding
and psychoacoustic compression to remove the information or sounds that can't be perceived by the
human ear. The size of the original sound recording is subsequently reduced by a factor of 12
without sacrificing sound quality.
Music compressed with MP3 is very similar to the original. However, when you start to
reduce the bit rate — thereby reducing the file size — the music begins to sound dull. In addition, a
3-minute, satisfactory quality MP3 song takes about 15 minutes to download using a 56K modem.
The solution is spectral band replication (SBR). Developed by Coding Technologies, this
technology maintains the sound quality of a digital music file while reducing the bit rate. The
resulting audio format, known as mp3PRO, is composed of two components: the mp3 part for the
low frequencies and the SBR or "PRO" part for the high frequencies. Since the "PRO" part requires
only a few kbps, the format could be done in a way that it is still compatible with the original mp3
format. In addition, existing mp3 players can be used to play mp3PRO files. They simply ignore the
PRO part.
It takes tremendous computing power to encode and decode music, especially to MP3 or
mp3PRO files. The speed in which these tasks can be accomplished is tied to both the speed of a
processor and whether the application being used is optimized for a specific processor.
9. Conclusion
By eliminating audio information that the human ear cannot detect, modern audio
coding standards are able to compress a typical 1.4 Mbps signal by a factor of about twelve.
This is done by employing several different methodologies, including noise allocation
techniques based on psychoacoustic models.
Future goals for the field of audio compression are quite broad. Several initiatives are
focused on establishing a format for digital encryption (watermarking) to protect copyrighted
audio content. Improvements in psychoacoustic models are expected to drive bit rates lower.
Finally, entirely new avenues are being explored in an effort to compress audio based on how
it is produced rather than how it is perceived. This last approach was integral in the
development of the MPEG-4 standard.
References