You are on page 1of 42

Speech-Coding Techniques

Chapter 3
Introduction
 Efficient speech-coding techniques
 Advantages for VoIP
 Digital streams of ones and zeros
 The lower the bandwidth, the lower the quality
 RTP payload types
 Processing power
 The better quality (for a given bandwidth) uses a
more complex algorithm
 A balance between quality and cost

Internet Telephony 3-2


Voice Quality
 Bandwidth is easily quantified
 Voice quality is subjective
 MOS, Mean Opinion Score
 ITU-T Recommendation P.800
 Excellent – 5
 Good – 4
 Fair – 3
 Poor – 2
 Bad – 1
 A minimum of 30 people
 Listen to voice samples or in conversations

Internet Telephony 3-3


 P.800 recommendations
 The selection of participants
 The test environment
 Explanations to listeners
 Analysis of results
 Toll quality
 A MOS of 4.0 or higher

Internet Telephony 3-4


 Subjective and objective quality-testing
techniques
 PSQM – Perceptual Speech Quality
Measurement
 ITU-T P.861
 faithfully represent human judgement and
perception
 algorithmic comparison between the output signal
and a know input
 type of speaker, loudness, delay, active/silence
frames, clipping, environmental noise

Internet Telephony 3-5


A Little About Speech
 Speech
 Air pushed from the lungs past the vocal cords
and along the vocal tract
 The basic vibrations – vocal cords
 The sound is altered by the disposition of the
vocal tract ( tongue and mouth)
 Model the vocal tract as a filter
 The shape changes relatively slowly
 The vibrations at the vocal cords
 The excitation signal

Internet Telephony 3-6


Speech sounds
 Voiced sound
 The vocal cords vibrate open and close
 Interrupt the air flow
 Quasi-periodic pluses of air
 The rate of the opening and closing – the pitch
 A high degree of periodicity at the pitch period
 2-20 ms

Internet Telephony 3-7


 Voiced speech  Power spectrum density

Internet Telephony 3-8


 Unvoiced sounds
 Forcing air at high velocities through a constriction
 The glottis is held open
 Noise-like turbulence
 Show little long-term periodicity
 Short-term correlations still present

Internet Telephony 3-9


 unvoiced speech  Power spectrum density

Internet Telephony 3-10


 Plosive sounds
 A complete closure in the vocal tract
 Air pressure is built up and released suddenly
 A vast array of sounds
 The speech signal is relatively predictable over
time
 The reduction of transmission bandwidth can be
significant

Internet Telephony 3-11


Voice Sampling
 A-to-D
 discrete samples of the waveform and represent
each sample by some number of bits
 A signal can be reconstructed if it is sampled at a
minimum of twice the maximum freq.
 Human speech
 300-3800 Hz
 8000 samples per second

Each sample is encoded into


an 8-bit PCM code word
(e.g. 01100101)
time => 8000 x 8 bit/s
Internet Telephony 3-12
Quantization
 How many bits is used to represent
 Quantization noise
 The difference between the actual level of the
input analog signal
 More bits to reduce
 Diminishing returns
 Uniform quantization levels
 Louder talkers sound better
 11.2/11 v.s. 2.2/2

Internet Telephony 3-13


 Non-uniform quantization
 Smaller quantization steps at smaller signal levels
 Spread signal-to-noise ratio more evenly

Internet Telephony 3-14


DTX and Comfort Noise
 DTX is Discontinuous Transmission
 Voice activity detector (VAD) detects if there is
active speech or not.
 When there is no active speech different DTX
procedures can be used:
 No Transmission at all
 Comfort Noise (CN) using RFC 3389
 Codec built CN in like AMR SID (Silence Descriptor)
 Frequency of Comfort Noise packets varies but
is usually some fraction of normal packet rate

Internet Telephony 3-15


Type of Speech Coders
 Waveform codecs
 Sample and code
 High-quality and not complex
 Large amount of bandwidth
 source codecs (vocoders)
 Match the incoming signal to a math model
 Linear-predictive filter model of the vocal tract
 A voiced/unvoiced flag for the excitation
 The information is sent rather than the signal
 Low bit rates, but sounds synthetic
 Higher bit rates do not improve much

Internet Telephony 3-16


 Hybrid codecs
 Attempt to provide the best of both
 Perform a degree of waveform matching
 Utilize the sound production model
 Quite good quality at low bit rate

Internet Telephony 3-17


G.711
 The most commonplace codec
 Used in circuit-switched telephone network
 PCM, Pulse-Code Modulation
 If uniform quantization
 12 bits * 8 k/sec = 96 kbps
 Non-uniform quantization
 64 kbps DS0 rate
 mu-law
 North America
 A-law
 Other countries, a little friendlier to lower signal levels
 An MOS of about 4.3

Internet Telephony 3-18


DPCM
 DPCM, Differential PCM
 Only transmit the difference between the predicated value and
the actual value
 Voice changes relatively slowly
 It is possible to predict the value of a sample base on the
values of previous samples
 The receiver perform the same prediction
 The simplest form
 No prediction
 No algorithmic delay

Internet Telephony 3-19


ADPCM

 ADPCM, Adaptive DPCM


 Predicts sample values based on
 Past samples
 Factoring in some knowledge of how speech varies over
time
 The error is quantized and transmitted
 Fewer bits required
 G.721
 32 kbps
 G.726
 A-law/mu-law PCM -> 16, 24, 32, 40 kbps
 An MOS of about 4.0 at 32 kbps

Internet Telephony 3-20


Analysis-by-Synthesis (AbS) Codecs
 Hybrid codec
 Fill the gap between waveform and source codecs
 The most successful and commonly used
 Time-domain AbS codecs
 Not a simple two-state, voiced/unvoiced
 Different excitation signals are attempted
 Closest to the original waveform is selected
 MPE, Multi-Pulse Excited
 RPE, Regular-Pulse Excited
 CELP, Code-Excited Linear Predictive

Internet Telephony 3-21


G.728 LD-CELP
 CELP codecs
 A filter; its characteristics change over time
 A codebook of acoustic vectors
 A vector = a set of elements representing various char.
of the excitation
 Transmit
 Filter coefficients, gain, a pointer to the vector chosen
 Low Delay CELP
 Backward-adaptive coder
 Use previous samples to determine filter coefficients
 Operates on five samples at a time
 Delay < 1 ms
 Only the pointer is transmitted

Internet Telephony 3-22


 1024 vectors in the code book
 10-bit pointer (index)
 16 kbps
 LD-CELP encoder
 Minimize a frequency-weighted mean-square error

Internet Telephony 3-23


 LD-CELP decoder

 An MOS score of about 3.9


 One-quarter of G.711 bandwidth

Internet Telephony 3-24


G.723.1 ACELP
 6.3 or 5.3 kbps
 Both mandatory
 Can change from one to another during a
conversation
 The coder
 A band-limited input speech signal
 Sampled at 8 KHz, 16-bit uniform PCM quantization
 Operate on blocks of 240 samples at a time
 A look-ahead of 7.5 ms
 A total algorithmic delay of 37.5 ms + other delays
 A high-pass filter to remove any DC component

Internet Telephony 3-25


 Various operations to determine the appropriate
filter coefficients
 5.3 kbps, Algebraic Code-Excited Linear Prediction
 6.3 kbps, Multi-pulse Maximum Likelihood
Quantization
 The transmission
 Linear predication coefficients
 Gain parameters
 Excitation codebook index
 24-octet frames at 6.3 kbps, 20-octet frames at 5.3 kbps

Internet Telephony 3-26


 G.723.1 Annex A
 Silence Insertion Description (SID) frames of size
four octets
 The two lsbs of the first octet
 00 6.3kbps 24 octets/frame
 01 5.3kbps 20
 10 SID frame 4
 An MOS of about 3.8
 At least 27.5 ms delay

Internet Telephony 3-27


G.729
 8 kbps
 Input frames of 10 ms, 80 samples for 8 KHz
sampling rate
 5 ms look-ahead
 Algorithmic delay of 15 ms
 An 80-bit frame for 10 ms of speech
 A complex codec
 G.729.A (Annex A), a number of simplifications
 Same frame structure
 Encoder/decoder, G.729/G.729.A
 Slightly lower quality

Internet Telephony 3-28


 G.729.B
 VAD, Voice Activity Detection
 Based on analysis of several parameters of the input
 The current frames plus two preceding frames
 DTX, Discontinuous Transmission
 Send nothing or send an SID frame
 SID frame contains information to generate comfort
noise
 CNG, Comfort Noise Generation
 G.729, an MOS of about 4.0
 G.729A an MOS of about 3.7

Internet Telephony 3-29


 G.729 Annex D
 a lower-rate extension
 6.4 kbps; 10 ms speech samples, 64 bits/frame
 MOS  6.3 kbps G.723.1
 G.729 Annex E
 a higher bit rate enhancement
 the linear prediction filter of G.729 has 10 coef.
 that of G.729 Annex E has 30 coef.
 the codebook of G.729 has 35 bits
 that of G.729 Annex E has 44 bits
 118 bits/frame; 11.8 kbps

Internet Telephony 3-30


Other Codecs
 CDMA QCELP defined in IS-733
 Variable-rate coder
 Two most common rates
 The high rate, 13.3 kbps
 A lower rate, 6.2 kbps
 Silence suppression
 For use with RTP, RFC 2658

Internet Telephony 3-31


 GSM Enhanced Full-Rate (EFR)
 GSM 06.60
 An enhanced version of GSM Full-Rate
 ACELP-based codec
 The same bit rate and the same overall packing
structure
 12.2 kbps
 Support discontinuous transmission
 For use with RTP, RFC 1890

Internet Telephony 3-32


 GSM Adaptive Multi-Rate (AMR) codec
 20 ms coding delay
 Eight different modes
 4.75 kbps to 12.2 kbps
 12.2 kbps, GSM EFR
 7.4 kbps, IS-641 (TDMA cellular systems)
 Change the mode at any time
 Offer discontinuous transmission
 The SID (Silence Descriptor) is sent in every 8th frame
and is 5 bytes in size
 The coding choice of many 3G wireless networks

Internet Telephony 3-33


 The MOS values are for laboratory conditions
 G.711 does not deal with lost packets
 G.729 can accommodate a lost frame by
interpolating from previous frames
 But cause errors in subsequent speech frames
 Processing Power
 G.728 or G.729, 40 MIPS
 G.726 10 MIPS

Internet Telephony 3-34


iLBC
 a FREE codec for robust VoIP
 13.33 kbit/s with an encoding frame length of
30 ms and 15.20 kbps of 20 ms
 Computational complexity in a range of G.729A

Internet Telephony 3-35


Speex
 Open-source patent-free speech codec
 CELP (code-excited linear prediction) codec
 operating modes:
 narrowband (8 kHz sampling rate)
 2.15 – 24.6 kb/s
 delay of 30 ms
 wideband (16 kHz sampling rate)
 4-44.2 kb/s
 delay of 34 ms
 ultra-wideband (32 kHz sampling rate)
 intensity stereo encoding
 variable bit rate (VBR) possible
 voice activity detection (VAD)

Internet Telephony 3-36


 Cascaded Codecs
 E.g., G.711 stream -> G.729 encoder/decoder
 Might not even come close to G.729
 Each coder only generate an approximate of
the incoming signal
 Audio samples
 http://
www.cs.columbia.edu/~hgs/audio/codecs.html

Internet Telephony 3-37


Effects of packetization

Internet Telephony 3-38


Tones, Signal, and DTMF Digits
 The hybrid codecs are optimized for human
speech
 Other data may need to be transmitted
 Tones: fax tones, dialing tone, busy tone
 DTMF digits for two-stage dialing or voice-mail
 G.711 is OK
 G.723.1 and G.729 can be unintelligible
 The ingress gateway needs to intercept
 The tones and DTMF digits
 Use an external signaling system

Internet Telephony 3-39


 Easy at the start of a call
 Difficult in the middle of a call
 Encode the tones differently from the speech
 Send them along the same media path
 An RTP packet provides the name of the tone and the
duration
 Or, a dynamic RTP profile; an RTP packet containing the
frequency, volume and the duration
 RFC 2198
 An RTP payload format for redundant audio data
 Sending both types of RTP payload

Internet Telephony 3-40


 RTP Payload Format for DTMF Digits
 An Internet Draft
 Both methods described before
 A large number of tones and events
 DTMF digits, a busy tone, a congestion tone, a ringing
tone, etc.
 The named events
 E: the end of the tone, R: reserved

Internet Telephony 3-41


 Payload format

Internet Telephony 3-42

You might also like