You are on page 1of 4

SPEECH AND AUDIO SIGNAL PROCESSING (MTEC-331) Anith M Thomas

1327011
Assignment-2
The following techniques are useful in estimating formant frequencies and in estimating the glottal
waveform for voiced speech.
1. PITCH SYNCHRONOUS SPECTRUM ANALYSIS:
2. POLE-ZERO ANALYSIS
3. PITCH SYNCHRONOUS ESTIMATION OF GLOTTAL WAVE
The basic idea of analysis-by synthesis is the following. First it is assumed that we begin with the
speech waveform or some other representation of the speech signal such as the time-dependent Fourier
Transform. Then some form of the speech production model is assumed. This model (Terminal Analog,
Vocal Tract etc.) has a number of parameters which can be adjusted to produce different speech sounds.
From the model we can derive a representation of the model that is of the same form as the
representation of the speech signal. Then by varying the parameters of the model in a systematic way, we
can attempt to find a set of parameters that cause the model to match the speech signal with minimum
error. When such a match is found, the parameters of the model are assumed to be the parameters of
the speech signal. Let us examine each of the above techniques is detail.
PITCH SYNCHRONOUS SPECTRUM ANALYSIS:
The digital model for voiced speech assumes that a short segment of voiced speech is identical to
the same length segment from the periodic sequence
() =

( +

=
(1)
where

() represents the convolution of the vocal tract impulse response, v(n), with the glottal pulse
g(n) and the impulse response of the radiation load r(n). That is,

() = () () () (2)
The quantity

is the pitch period in samples. The radiation effects, which basically appear as a
differentiation at low frequencies, are adequately modelled for most purposes by a simple first difference,
for which the z-Transform representation is
() = 1
1
(3)
The vocal tract is characterized by a transfer function of the form
() =

(1 2

cos(2

)
1
+
2

2
=1
(4)
Where the number of poles included depends upon the sampling frequency of the input data. The glottal
pulse is of finite duration, implying that the z-Transform of g(n) is a polynomial in z of the form
() = ()

=0
= (1

1
) (5)

=1

Where

is less than

. From (2) we observe that the z-Transform of

() is

() = (). (). () (6)


And the corresponding fourier transform would be

) = (

). (

). (

) (7)
The Fourier Transform of the periodic signal () will consist of very sharp spectral lines at multiples of
the fundamental frequency.
The periodic signal () can be represented by a Fourier series of the form
() =
1

()

1
=0
(8)
where () =

) (9)
By substituting (8) and (9) into

) = ( )()

=
(10)
It is easily shown that

) =
1

(
(
2

1
=0
) (11)
Where

) is the fourier transform of the analysis window,( ). We have seen that the
character of the time-dependent Fourier Transform is strongly dependent on the length and shape of the
analysis window. We notice that equation (11) is () is periodic,

) is a function of the window


position. By isolating one period of the periodic signal we can compute samples of

) at

equally
spaced values. When one uses one period of voiced speech in place of () in
() =

) = ()

, 0

1
=0
(12)
The resulting time-dependent Fourier Transform is termed a pitch synchronous time dependent Fourier
Transform. In general, this approach to voiced speech analysis is called Pitch Synchronous Analysis.


POLE-ZERO ANALYSIS:
With the Pitch Synchronous Spectrum method as the starting point as a starting point other
researchers used an iterative procedure to estimate the parameters of the speech model. They used an
analog model for the transfer function of the radiation load, the vocal tract and the glottal pulse. This
necessitated a higher pole correction factor which probably would not have been necessary had equations
(3), (4), (5) and (6) been used. The basic approach remains the same regardless of the particular functional
form for the speech model.
The parameters of

) can be determined by an iterative approximation process.


Researchers guessed a set of parameters for

), computed values at frequencies 2/

and then
evaluated an error function of the form
= ()[log |

)| log |

)|

] (13)
Where () is a weighting function on the spectrum and

) is the pitch synchronous spectrum


of the speech signal. The parameters were adjusted in a systematic way so as to minimize the error
function. When the error is minimized, the resulting values of the poles of () are taken as estimates of
the formant frequencies. The zero locations give information about the glottal wave.
PITCH SYNCHRONOUS ESTIMATION OF THE GLOTTAL WAVE:
The previous methods were primarily concerned with the distribution of zeros of the
approximations and attempts were made to relate the spacing and arrangement o the zeros to the shape
of the glottal pulse. Later modifications to the methods helped in obtaining the estimates of the glottal
pulse. In this case, the model was of the form

() = ()

()() (14)
Where in this case the glottal wave contributions to the spectrum were initially modelled by a fixed
transfer function, whose equivalent digital form would be

() =
1
(1
1
)(1
1
)
(15)
Again the parameters of () were varied systematically to minimize a similar error criterion. The
resulting pole locations serve as an estimate of the formant frequencies. To obtain the glottal wave shape
for particular period of speech being analyzed, researchers computed the quantity
() =

)
(

) (

)
, 0

1 (16)
The values of () for 0

1 are used as Fourier coefficients of the glottal pulse, (n) which is


computed using the inverse DFT
(n) =
1

()

1
=0
(17)
This is feasible since (n)is a finite duration pulse, even though the

samples of

) that are
obtained by pitch synchronous analysis would in general not be adequate to completely specify

(n)
which is in general longer than

. Thus with the aid of model for speech production it is possible to extract
the component of the convolution which is of finite duration.
This technique for estimating the glottal pulse is sensitive to the model that is assumed. In cases
where the model fits the speech signal well, as in the steady state vowels, the results are excellent. In
other situations a more complex model is required. Another factor that affected the results was the way
in which the pitch period was isolated in the speech waveform. In general it is very unlikely that the exact
point of glottal opening and closing would occur at a sampling instant.

You might also like