Professional Documents
Culture Documents
Sampler
x(n) = x(nt)
Quantizer
x(n)
Continuous-time Continuous-amp.
Discrete-time Continuous-amp.
Discrete-time Discrete-amp.
Wideband speech
Wideband audio
Telephone 0 Hz
Speech 4kHz
Talk
A/D
D/A
Listen
Waveform coding
The reconstructed signal matches as close to the original signal as possible. Robust for wide range of speaker and noisy environment.
A parametric coding based on the quasistationary model of speech production.
Vocoder
Hybrid coding
Hybrid coders
Multi-Pulse Excitation
Efficient at medium bit rates. A sequence of nonuniformly spaced pulses as an excitation signal Amplitudes and positions are excitation parameters Efficient at medium bit rates. A sequence of uniformly spaced pulses as an excitation signal The position of first pulse within a vector and amplitudes are excitation parameters Efficient at low bit rates (below 8 kbps) A code book of excitation sequences Two key issues; the design and search of a codebook
a)
g1 n2 n1 g2 0 5 g1 g2
g3 n4 n3 g4 10 15
gk
nk
20 g6
b) Examples of excitations
a) multipulse b) regular-pulse c) Code-excited Linear Prediction
g4
K g3 0 5 Codebook 10 g5 15 20
c)
Codevector # N
Communication Networks Research (CNR) Lab. EECS, KAIST 7
64 kbps -law/A-law PCM(CCTT G.711) 64 kbps 7kHz Subband/ADPCM(CCITT G.722) 32 kbps ADPCM(CCITT G.721) 16 kbps Low Delay CELP(CCITT G.728) 13.2 kbps RPE-LTP(GSM 06.10) 13 kbps ACELP(GSM 06.60) 13 kbps QCELP(US CDMA Cellular) 8 kbps QCELP(US CDMA Cellular) 8 kbps VSELP(US TDMA Cellular) 8 kbps CS-ACELP(ITU G.729) 6.7 kbps VSELP(Japan Digital Cellular) 6.4 kbps IMBE(Immarsat Voice Coding Standard) 5.3 & 6.4 kbps True Speech Coder(ITU G.723) 4.8 kbps CELP(Fed. Standard 1016-STU-3) 2.4 kbps LPC(Fed. Standard 1015 LPC-10E)
8
Speech Quality (SNR/SEGSNR, MOS, etc) Bit Rate (bits per second) Complexity (MIPS) Coding Delay (msec)
More channel capacity Noise immunity Encryption Reasonable complexity and encoding delay
10
Vocoders
Amplitude
50 Time (ms)
Closure
Glottal Pulse
Rosenberg JASM 49, 1971
Intensity
Frequency (Hz)
Harmonics of spectrum spaced at 80 Hz, corresponding to pitch period of 12.5ms.
Intensity
Frequency (Hz)
Harmonics of spectrum spaced at 80 Hz, corresponding to pitch period of 12.5ms.
/ee/
/ar/
/uu/
Wrong
Moving
Southampton
x
random signal gen. Gain
Vocoder
(1) voiced or unvoiced signal, (2) if it is voiced, the period of the excitation signal, (3) the parameters of the prediction filter
Vocoder
Encoder/Decoder
LPC Introduction
This speech coders are called Vocoders (voice coder). Basic Idea
Encode Parameters
Transmit Parameters
Estimate parameters
Decode Parameters
Synthetise Speech
They usually provide more bandwidth compression than is possible with waveform coding (24009600bps).
Generalities
LP Model
Pitch Period Impulse Generator
Voice/Unvoice Switch
Voice
All-pole filter
Gain
Speech Signal
Parameter Estimation
estimate LP coefficients (ais) estimate Gain estimate type of excitation (voice or unvoice). Estimate pitch.
V/UV Estimation
1 Es 10 log( N
S
n 1
( n))
C1
s (n) s (n 1)
n 1
( s ( n))( s 2 ( n))
2 n 1 n 0
N 1
Pitch Detection
Voiced sounds
Produced by forcing air through glottis Vocal cords oscillate and modulate air flow into quasi periodic pulses Pulses excite resonances in reminder of vocal tract Different sounds produced as muscles work to change shape of vocal tract Resonant frequencies or formant frequencies Fundamental frequency or pitch rate of pulses
Pitch Detection
400 200
amplitude
Short sections of
0 -200 -400
Voiced speech
100
200
500
600
700
400 200
amplitude
Unvoiced speech
0 -200 -400
100
200
500
600
700
100
200
500
600
700
Rn (k )
N 1 k m 0
Basic problems in choosing window length: Speech changes over time (N low) but at least 2 periods of the waveform Approaches: Choose window to catch longest period Adaptive N Use modified short-time auto-correlation function
Auto-correlation representation - retains too much of the information in the speech signal => auto-correlation function has many peaks
14
12
10
100
200
300
400
500
600
700
Remove the effects of the vocal tract transfer function Center clipping - nonlinear transformation, clipping value depends on maximum amplitude => Strong peak at the pitch frequency
0.7
0.6
0.5
0.4
0 -100 -200 -300 -400
0.08
0.3
0.06 0.04 0.02
0 100 200 300 400 sample number 500 600 700
0.2
0.1
100
200
300
400
500
600
700
100
200
300
400
500
600
700
Fundamental Frequency
usually using autocorrelation function and the average magnitude difference function (AMDF)
AMDFt (m) 1 Np
| s (n) s (n m) |,
t t n ,m
0 n m L 1
where L is the frame length Npis number of point pairs (peak in ACF and valley in AMDF indicates F0)
Small changes in the LPC results in large changes in the pole positions. If |rk| near 1, then large distortion. Represent a non-linear transformation of the Reflection Coefficients to expand the scale near to |rk| near 1.
Log-Area Ratio:
The main difference of the LP vocoders is the calculation of the source of excitation.
LPC-10
Speech Signal ADC (8kHz) Sample Speech Window (180 samples) AMDF and Zero Crossing LP Analysis (Covariance Method)
Non-linear warping
E n Voice/Unvoice c Switch (1 bit) o LAR coefficients d (4 bits and 5 bits) e Reflection r Coefficients
Pitch Frequency (7 bit)
(4 bits)
Channel D e c o d e r
10 Reflection Coefficients. (5 bits for one and 4 bits for the others).
1/A(z)
RELP
An improvement is to use the prediction error rather than the periodical pulse (for voiced signal) or the random noise (for unvoiced signal) to excite the digital filter to reproduce the speech. The prediction error is also called the residual.
This scheme is called Residual Excited Linear Prediction (RELP) coding.
RELP
quantization
encoder
RELP
RELP follows essentially the same idea as DPCM. However, in RELP the speech signal is divided into blocks (20ms/block).
The optimum linear predictor is designed for each block. For each block, the filter coefficients and the prediction error should be sent to the receiver.
In DPCM, the predictor can be fixed or adaptive. Only the prediction error is sent to the receiver.
In each block of speech signal (a frame), the prediction error may also be correlated. To decorrelate the prediction error, each frame is further divided into 4 sub-frames (5ms). The prediction error u(n) is then modelled as
u( n) hu( n M ) e( n)
where M (40<=M<=120) is called the lag, and h is called the gain. They are determined by using the LMSE principle.
Long-term prediction
U(z)
long-term prediction
e(n) encoder
RPE-LTP
The RPE-LTP has been adopted as the speech coding method in the GSM 06.10 standard
long-term prediction digital filter determine filter coeff.
Regular pulse selection and coding
RPE-LTP
Speech is sampled at 8 kHz, quantised to 8 bits/sample The speech signal is pre-processed to remove any DC component and to pre-emphasis the high-frequencies component, partly compensating for their low energy. The signal is then dived into frames (20ms, 160 samples). An eighth-order optimum linear predictor is designed using the Shur algorithm. The reflection coefficients (related to the filter coefficients) are nonlinearly mapped to another set of values called log-area ratio(LAR).
RPE-LTP
So a total of 36 bits for the LAR (or for the filter coefficients).
The frame is filtered using this filter and produces u(n). u(n) is then divided into 4 sub-frames (5ms each, 40 samples). Long-term prediction is performed for each sub-frame. The lag M is quantized to 7 bits and the gain h is represented by 2 bits. Long-term prediction produces e(n).
RPE-LTP
e(n) is down-sampled by a factor 3. For each sub-frame, there are 4 down-sample patterns. Need 2 bits to specify the pattern used. The down-sampled e(n) has 13 samples. The maximum of them is quantized to 6 bits, others are normalised then represented by 3 bits.
So in each sub-frame, e(n) is represented by 6+13*3=6+39 bits. A frame has 4 sub-frames, 4*(6+39)=180 bits
Input signal
Preprocessing
+ -
RPE parameters
+
Synthesis filter 1/A(z/5)
Postprocessing
Output signal
LTP parameters
RPE-LTP
Summary
36 (bits)
lag gain regular pulse total 4 sub-frames Total one frame bit rate