You are on page 1of 53

Speech Coders for Wireless Communication

Digital representation of the speech waveform


x(t)

Sampler

x(n) = x(nt)

Quantizer

x(n)

Continuous-time Continuous-amp.

Discrete-time Continuous-amp.

Discrete-time Discrete-amp.

Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST

Three acoustic signals


Frequency range Telephone speech 300 3,400Hz* 50 7,000Hz 10 20,000Hz Sampling rate 8kHz 16kHz 48kHz PCM bits per sample 8 14 16 PCM bit rate 64kb/s 224kb/s 768kb/s

Wideband speech
Wideband audio

* Bandwidth in Europe : 200 3200Hz in the United States and Japan

Frequency response of Telephone transmission channel


Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST

Telephone 0 Hz

Speech 4kHz

Music (CD Quality)


7kHz 20kHz

Talk

A/D

Encoder Compress (record/store) Storage

Decoder Decompress (play)

D/A

Listen

Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST

Source coding techniques

Waveform coding

The reconstructed signal matches as close to the original signal as possible. Robust for wide range of speaker and noisy environment.
A parametric coding based on the quasistationary model of speech production.

Vocoder

Hybrid coding

The combined form of waveform coding and vocoder.

Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST

Hybrid coders

Multi-Pulse Excitation

Efficient at medium bit rates. A sequence of nonuniformly spaced pulses as an excitation signal Amplitudes and positions are excitation parameters Efficient at medium bit rates. A sequence of uniformly spaced pulses as an excitation signal The position of first pulse within a vector and amplitudes are excitation parameters Efficient at low bit rates (below 8 kbps) A code book of excitation sequences Two key issues; the design and search of a codebook

Regular-Pulse Excitation (RPE)


Code-Excited Linear Prediction (CELP)


Courtesy: Communication Networks Research (CNR) Lab. EECS, KAIST

a)

g1 n2 n1 g2 0 5 g1 g2

g3 n4 n3 g4 10 15

gk

nk

20 g6

b) Examples of excitations
a) multipulse b) regular-pulse c) Code-excited Linear Prediction

g4

K g3 0 5 Codebook 10 g5 15 20

c)

Codevector # 1 Codevector # 2 Codevector # 3


2M = N (M = Transmission Bit)

Codevector # N
Communication Networks Research (CNR) Lab. EECS, KAIST 7

Speech Compression Standards


64 kbps -law/A-law PCM(CCTT G.711) 64 kbps 7kHz Subband/ADPCM(CCITT G.722) 32 kbps ADPCM(CCITT G.721) 16 kbps Low Delay CELP(CCITT G.728) 13.2 kbps RPE-LTP(GSM 06.10) 13 kbps ACELP(GSM 06.60) 13 kbps QCELP(US CDMA Cellular) 8 kbps QCELP(US CDMA Cellular) 8 kbps VSELP(US TDMA Cellular) 8 kbps CS-ACELP(ITU G.729) 6.7 kbps VSELP(Japan Digital Cellular) 6.4 kbps IMBE(Immarsat Voice Coding Standard) 5.3 & 6.4 kbps True Speech Coder(ITU G.723) 4.8 kbps CELP(Fed. Standard 1016-STU-3) 2.4 kbps LPC(Fed. Standard 1015 LPC-10E)
8

Communication Networks Research (CNR) Lab. EECS, KAIST

Performance of speech codec


Speech Quality (SNR/SEGSNR, MOS, etc) Bit Rate (bits per second) Complexity (MIPS) Coding Delay (msec)

Communication Networks Research (CNR) Lab. EECS, KAIST

Requirements of speech codec for digital cellular


More channel capacity Noise immunity Encryption Reasonable complexity and encoding delay

Communication Networks Research (CNR) Lab. EECS, KAIST

10

Vocoders

Anatomy of Speech Organs :


The source of most speech occurs in the larynx. It contains two folds of tissue called the vocal folds or vocal cords which can open and shut like a pair of fans. The gap between the vocal cords is called the glottis and as air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. This process is known as phonation. The frequency of vibration determines the pitch of the voice for a male is typically in the range 50-200Hz for a female the range can be up to 500Hz.

Amplitude

Opening phase Closing phase

50 Time (ms)
Closure

Period = 12.5ms Fundamental frequency = 1/.0125 = 80Hz

Glottal Pulse
Rosenberg JASM 49, 1971

Intensity

Spectrum of glottal pulse

Frequency (Hz)
Harmonics of spectrum spaced at 80 Hz, corresponding to pitch period of 12.5ms.

Intensity

Spectrum of glottal pulse filtered by the vocal tract

Frequency (Hz)
Harmonics of spectrum spaced at 80 Hz, corresponding to pitch period of 12.5ms.

/ee/

/ar/

/uu/

Properties of Speech in Brief


Vowels
oo in blue o in spot ee in key e in again

Quasi-periodic Relatively high signal power Consonants


s in spot k in key

Non-periodic (random) Relatively low signal power

Wrong

/r/ /o/ /ng/

Moving

/m/ /uu/ /v/ /i/ /ng/

Southampton

/s/ /ou/ /th/ /aa/ /m/ /p/ /t/ /a/ /n/

Digital speech model

A basic digital model for speech production


periodic signal gen.

x
random signal gen. Gain

linear time variant filter

Vocoder

Send three kinds of information to the receiver:


(1) voiced or unvoiced signal, (2) if it is voiced, the period of the excitation signal, (3) the parameters of the prediction filter

Vocoder

voice classification pitch recognition determine filter coeff.

Encoder/Decoder

digital filter excitation signal gen

LPC Introduction

This speech coders are called Vocoders (voice coder). Basic Idea
Encode Parameters
Transmit Parameters

Estimate parameters

Decode Parameters

Synthetise Speech

They usually provide more bandwidth compression than is possible with waveform coding (24009600bps).

Generalities

LP Model Parameter Estimation Typical Memory requirements

LP Model
Pitch Period Impulse Generator
Voice/Unvoice Switch

Voice

All-pole filter
Gain

Speech Signal

White Noise Unvoice Generator

Glottal filter Vocal tract filter Lip Radiation filter

Parameter Estimation

Therefore, for each frame:


estimate LP coefficients (ais) estimate Gain estimate type of excitation (voice or unvoice). Estimate pitch.

V/UV Estimation

Several Methods Energy of Signal Zero Crossing Rate Autocorrelation Coefficient

Speech Measurements (1)


Zero Crossing Rate Log Energy Es

1 Es 10 log( N

S
n 1

( n))

Normalized Autocorrelation Coefficient

C1

s (n) s (n 1)
n 1

( s ( n))( s 2 ( n))
2 n 1 n 0

N 1

Comparison between actual data and V/U/S determination results.


V S U

Pitch Detection

Voiced sounds

Produced by forcing air through glottis Vocal cords oscillate and modulate air flow into quasi periodic pulses Pulses excite resonances in reminder of vocal tract Different sounds produced as muscles work to change shape of vocal tract Resonant frequencies or formant frequencies Fundamental frequency or pitch rate of pulses

Pitch Detection

400 200
amplitude

Short sections of

0 -200 -400

Voiced speech

100

200

300 400 sample number

500

600

700

400 200
amplitude

Unvoiced speech

0 -200 -400

100

200

300 400 sample number

500

600

700

Time-domain pitch estimation


Well studied area Variations of fundamental frequency are evident Time-domain speech processing should be capable of detecting pitch frequency

400 300 200 100


amplitude

-100 -200 -300 -400

100

200

300 400 sample number

500

600

700

Pitch Period Estimation Using the Autocorrelation Function

Periodic signals have periodic auto-correlation function

Rn (k )

N 1 k m 0

' ' [ x ( n m ) w ( m )][ x ( n m k ) w (k m)]

Basic problems in choosing window length: Speech changes over time (N low) but at least 2 periods of the waveform Approaches: Choose window to catch longest period Adaptive N Use modified short-time auto-correlation function

Pitch Period Estimation Using the Autocorrelation Function (Contd)

Auto-correlation representation - retains too much of the information in the speech signal => auto-correlation function has many peaks

14

12

10

100

200

300

400

500

600

700

Spectrum flatteners techniques


Remove the effects of the vocal tract transfer function Center clipping - nonlinear transformation, clipping value depends on maximum amplitude => Strong peak at the pitch frequency

400 300 200 100


amplitude

0.16 0.14 0.12 0.1

0.7

0.6

0.5

0.4
0 -100 -200 -300 -400

0.08

0.3
0.06 0.04 0.02
0 100 200 300 400 sample number 500 600 700

0.2

0.1

100

200

300

400

500

600

700

100

200

300

400

500

600

700

Fundamental Frequency

F0 estimation: (Hess) determining the main period in quasi-periodic waveform

usually using autocorrelation function and the average magnitude difference function (AMDF)
AMDFt (m) 1 Np

| s (n) s (n m) |,
t t n ,m

0 n m L 1

where L is the frame length Npis number of point pairs (peak in ACF and valley in AMDF indicates F0)

Typical Memory Requirements


Pitch coefficient (6 bits). Gain (5 bits) Model parameters:

LP coefficients (8-10 bits)

Small changes in the LPC results in large changes in the pole positions. If |rk| near 1, then large distortion. Represent a non-linear transformation of the Reflection Coefficients to expand the scale near to |rk| near 1.

Reflection coefficients (6 bits)

Log-Area Ratio:

The main difference of the LP vocoders is the calculation of the source of excitation.

LPC-10
Speech Signal ADC (8kHz) Sample Speech Window (180 samples) AMDF and Zero Crossing LP Analysis (Covariance Method)

Non-linear warping

E n Voice/Unvoice c Switch (1 bit) o LAR coefficients d (4 bits and 5 bits) e Reflection r Coefficients
Pitch Frequency (7 bit)
(4 bits)

Channel D e c o d e r

Pitch Period (7 bits)

Gain (5 bits) Impulse Generator White Noise Generator


Voice/Unvoice Switch(1 bit)

10 Reflection Coefficients. (5 bits for one and 4 bits for the others).

Synthesized Speech Signal

1/A(z)

RELP

Simple vocoder offers poor sound quality and is usually unsatisfactory.

An improvement is to use the prediction error rather than the periodical pulse (for voiced signal) or the random noise (for unvoiced signal) to excite the digital filter to reproduce the speech. The prediction error is also called the residual.
This scheme is called Residual Excited Linear Prediction (RELP) coding.

RELP

quantization

encoder

digital filter determine filter coeff.

RELP

RELP follows essentially the same idea as DPCM. However, in RELP the speech signal is divided into blocks (20ms/block).

The optimum linear predictor is designed for each block. For each block, the filter coefficients and the prediction error should be sent to the receiver.

In DPCM, the predictor can be fixed or adaptive. Only the prediction error is sent to the receiver.

Modeling of the prediction error

In each block of speech signal (a frame), the prediction error may also be correlated. To decorrelate the prediction error, each frame is further divided into 4 sub-frames (5ms). The prediction error u(n) is then modelled as
u( n) hu( n M ) e( n)

where M (40<=M<=120) is called the lag, and h is called the gain. They are determined by using the LMSE principle.

Long-term prediction

The decorrelation of the prediction error is called long-term prediction.


s(n) A(z) digital filter determine filter coeff. u(n)

U(z)
long-term prediction

e(n) encoder

RPE-LTP

The RPE-LTP has been adopted as the speech coding method in the GSM 06.10 standard
long-term prediction digital filter determine filter coeff.
Regular pulse selection and coding

RPE-LTP

Speech is sampled at 8 kHz, quantised to 8 bits/sample The speech signal is pre-processed to remove any DC component and to pre-emphasis the high-frequencies component, partly compensating for their low energy. The signal is then dived into frames (20ms, 160 samples). An eighth-order optimum linear predictor is designed using the Shur algorithm. The reflection coefficients (related to the filter coefficients) are nonlinearly mapped to another set of values called log-area ratio(LAR).

RPE-LTP

The 8 LAR parameters are quantized using 6,6,5,5,4,4,3,3 bits.

So a total of 36 bits for the LAR (or for the filter coefficients).

The frame is filtered using this filter and produces u(n). u(n) is then divided into 4 sub-frames (5ms each, 40 samples). Long-term prediction is performed for each sub-frame. The lag M is quantized to 7 bits and the gain h is represented by 2 bits. Long-term prediction produces e(n).

RPE-LTP

e(n) is down-sampled by a factor 3. For each sub-frame, there are 4 down-sample patterns. Need 2 bits to specify the pattern used. The down-sampled e(n) has 13 samples. The maximum of them is quantized to 6 bits, others are normalised then represented by 3 bits.

So in each sub-frame, e(n) is represented by 6+13*3=6+39 bits. A frame has 4 sub-frames, 4*(6+39)=180 bits

The above method is called regular-pulse-excitation (RPE)

13 kbps RPE-LTP coder - Encoder Short term LPC analysis

Reflection coefficients (36 bit / 20 ms)


RPE parameters (13 pulses / 5 ms)

Input signal

Preprocessing

Short term Analysis filter

+ -

RPE grid Selection and coding

Synthesis filter 1/A(z/5)

RPE grid decoding

LTP parameters LTP analysis

- Decoder Reflection coefficients

RPE parameters

RPE grid decoding

+
Synthesis filter 1/A(z/5)

Short term synthesis

Postprocessing

Output signal

LTP parameters

RPE-LTP

Summary

8 LAR coefficients For each sub-frame pattern code

36 (bits)

lag gain regular pulse total 4 sub-frames Total one frame bit rate

2 7 2 6+39 56 4*56=224 224+36=260 260 bits/20 ms=13kbs

You might also like