You are on page 1of 138

Alexandria University

Faculty of Engineering
Electrical Engineering Department
Communications & Electronics Section

B.Sc. Electrical Communications


Academic Year 2006 - 2007

Under the supervision of


Dr. Noha O. Korany
Preface

We believe that good seeds even if they were planted in a fertile soil won’t grow
without care and attention.
We also believe that plants won’t grow overnight and that in order to enjoy what we
have planted; we have to be patient until they are completely grown. And when they are
completely grown and become beautiful, we have to protect them by watering them
continuously and never let them until they are fully covered by dust which destroys
their shine and beauty and leads them to wither and death.

We are not agricultural engineers neither our project is concerned in agriculture; but
simply we wanted to take a moment to express our deep gratitude, thanks and
appreciation to our dear Dr. Noha O. Korany who supported us a lot and helped us a
lot in achieving what we have achieved in this project. She was always encouraging us
to find out our own capabilities, she never obliged us to do something we did not like to
do; she has put us on the right track and has driven us to a real start of a professional
life in the near future.
The project for us was not just an ordinary college task that has to be submitted in a
deadline; but we were waiting for every weekly meeting with great enthusiasm to
discuss general actualities and events with an extremely open minded, hardly
committed and respectful person like our dear Dr. Noha O. Korany.

Now we guess that you have almost understood the analogy of plants we used at the
beginning.
We were the seeds in their way to grow, the project was the fertile soil and those seeds
won’t have grown properly without the care and attention of Dr. Noha O. Korany.

Dr. Noha,
Words cannot tell how much we are grateful to you

Project’s Participants
Abstract

Selected topics on acoustics-communication:

Topic 1: Audiology
Deals with the hearing process

Students: Menatollah Mostafa Aly


Hanaà Mohamed El Borollosy
Manal Khalaf

Topic 2: Acoustical Simulation of Room


Deals with the modeling of sound propagation in rooms

Students: Héba Mohamed Noweir


Mona Abdel Kader Mohamed
Mona Zarif Shenouda

Topic 3: Noise Control


Deals with environmental noise

Students: Hanaà Khamis Mohamed


Nermine Mohamed Ahmed

Topic 4: Speech Technology


Deals with speech analysis/Synthesis and speaker identification

Students: Ahmed Mohamed Hamido


Beshoy Kamel Ibrahim Ghaly
Contents

Page
Topic 1
1
AUDIOLOGY
CHAPTER 1
2
HEARING PROCESS
1.1 Structure of the Ear………………………………………………… 2
1.2 How the Ear works………………………………………………… 4
1.3 Computational Models of Ear Functions…………………………... 5

CHAPTER 2
13
FUNDMENTAL PROPERTIES OF HEARING
2.1 Thresholds…………………………………………………………. 13
2.2 Equal Loudness Level Contours…………………………………… 14
2.3 Critical Bandwidth………………………………………………… 14
2.4 Masking……………………………………………………………. 15
2.5 Beat and Combination Tones……………………………………… 15

CHAPTER 3
17
MODELS FOR HEARING AIDS
3.1 History of Development of Hearing Aids…………………………. 17
3.2 First Model (Digital Hearing Aids for Moderate Hearing Losses)... 18
3.2.1 Features of Real Time Binaural Hearing Aid……………………… 18
3.2.2 Speech Processing Algorithm……………………………………... 18
3.2.2.1 Interaural Time Delay and Timer.………………………………… 18
3.2.2.2 Frequency Shaping.………………………………………………... 19
3.2.2.3 Adaptive Noise Cancellation using LMS.…………………………. 20
3.2.2.4 Amplitude Compression…………………………………………… 25
3.3 Second Model (A Method of Treatment for Sensorineeural
26
Hearing Impairment)……………………………………………….
3.3.1 The Conceptual Prosthetic System Architecture..…………………. 26
3.3.2 Human Temporal Bone Vibration…………………………………. 28
3.3.3 Design Guideline for Optimum Accelerometer…………………… 31
3.3.4 Conclusion…………………………………………………………. 31

Topic 2
32
ACOUSTICAL SIMULATION OF ROOM
CHAPTER 1
33
GEOMETRICAL ACOUSTICS
1.1 Introduction………………………………………………………... 33
1.2 Sound Behavior……………………………………………………. 35
1.3 Geometrical Room Acoustics……………………………………… 36
1.3.1 The Reflection of Sound Rays……………………………………... 36
1.3.2 Sound Reflections in Rooms………………………………………. 38
1.3.3 Room Reverberation……………………………………………….. 39
1.4 Room Acoustical Parameters & Objective Measures……………... 40
1.4.1 Reverberation Time………………………………………………... 40
1.4.2 Early Decay Time………………………………………………….. 40
1.4.3 Clarity and Definition……………………………………………… 41
1.4.4 Lateral Fraction and Bass Ratio…………………………………… 41
1.4.5 Speech Transmission Index………………………………………... 41

CHAPTER 2
42
ARTIFICIAL REVERBERATION
2.1 Introduction………………………………………………………... 42
2.2 Shortcomings of Electronic Reverberators………………………… 42
2.3 Realizing Natural Sounding Artificial Reverberation……………... 43
2.3.1 Comb Filter………………………………………………………… 43
2.3.2 All-pass Filter……………………………………………………… 45
2.3.3 Combined Comb and All-pass Filters……………………………... 47
2.4 Ambiophonic Reverberation………………………………………. 49

CHAPTER 3
50
SPATIALIZATION
3.1 Introduction………………………………………………………... 50
3.2 Two-Dimensional Amplitude Panning…………………………….. 51
3.2.1 Trigonometric Formulation………………………………………... 52
3.2.2 Vector Base Formulation…………………………………………... 54
3.2.3 Two-Dimensional VBAP for More Than Two Loudspeakers…….. 55
3.2.4 Implementing 2D VBAP for More Than Two Loudspeakers……... 56

Topic 3
57
NOISE CONTROL
CHAPTER 1
58
SOUND ABSORPTION
1.1 Absorption Coefficient...…………………………………………... 58
1.2 Measurement of Absorption Coefficient of the different materials.. 58
1.2.1 Procedures…………………………………………………………. 59
1.2.2 Laboratory Measurements of Absorption Coefficient…………….. 59
1.3 Sound Absorption by Vibrating or Perforated Boundaries………... 62

CHAPTER 2
65
SOUND TRANSMISSION
2.1 Transmission Coefficient………………………………………….. 65
2.2 Transmission loss………………………………………………….. 65
2.3 Sound Transmission Class STC…………………………………… 65
2.3.1 Determination of STC……………………………………………... 66
2.3.2 Laboratory Measurements of STC………………………………… 67
2.4 Controlling Sound Transmission through Concrete Block Walls…. 69
2.4.1 Single-Leaf Concrete Block Walls………………………………… 69
2.4.2 Double-Leaf Concrete Block Walls……………………………….. 71
2.5 Noise Reduction…………………………………………………… 71
2.5.1 Noise Reduction Determination Method…………………………... 71
2.5.2 The Noise Reduction Determinations of Some Absorbed Materials 72
2.6 The Performance of Some Absorbed Materials…………………… 73

Topic 4
75
SPEECH TECHNOLOGY
CHAPTER 1
76
SPEECH PRODUCTION
1.1 Introduction…………………………………………………........... 76
1.2 The human vocal apparatus………………………………………... 76
1.2.1 Breathing………………………………………………………....... 77
1.2.2 The larynx………………………………………………………….. 77
1.2.3 The vocal tract……………………………………………………... 79
1.3 Speech sounds……………………………………………………... 79
1.3.1 Phonemic representation…………………………………………... 79
1.3.2 Voiced, unvoiced and plosive sounds……………………………... 80
1.4 Acoustics of speech production……………………………………. 80
1.4.1 Formant frequencies……………………………………………….. 80
1.5 Perception………………………………………………………….. 82
1.5.1 Pitch and loudness…………………………………………………. 82
1.5.2 Loudness perception……………………………………………….. 82

CHAPTER 2
PROPERTIES OF SPEECH SIGNALS 83
IN TIME DOMAIN
2.1 Introduction………………………………………………………... 83
2.2 Time-Dependent Processing of Speech……………………………. 83
2.3 Short-Time Average Zero-Crossing Rate………………………….. 84
2.4 Pitch period estimation…………………………………………….. 86
2.4.1 The Autocorrelation Method………………………………………. 86
2.4.2 Average magnitude difference function…………………………… 89

CHAPTER 3
91
SPEECH REPRESENTATION IN FREQUENCY DOMAIN
3.1 Introduction………………………………………………………... 91
3.2 Formant analysis of speech………………………………………... 91
3.3 Formant frequency extraction……………………………………... 91
3.3.1 Spectrum scanning and peak-picking method……………………... 92
3.3.2 Spectrum scanning………………………………………………… 92
3.3.3 Peak-Picking Method……………………………………………… 92

CHAPTER 4
93
SPEECH CODING
4.1 Introduction………………………………………………………... 93
4.2 Overview of speech coding………………………………………... 93
4.3 Classification of speech coding……………………………………. 94
4.4 Linear Predictive Coding (LPC)…………………………………… 97
4.4.1 Basic Principles……………………………………………………. 97
4.4.2 The LPC filter……………………………………………………… 99
4.4.3 Problems in LPC model…………………………………………… 100
4.5 Basic Principles of Linear Predictive Analysis……………………. 101
4.5.1 The autocorrelation method………………………………………... 104
4.5.2 The covariance method……………………………………………. 106

CHAPTER 5
108
APPLICATIONS
5.1 Speech synthesis…………………………………………………… 108
5.1.1 Formant – frequency Extraction…………………………………… 108
5.1.2 LPC..……………………………..………………………………… 114
5.2 Speaker identification using LPC………………………………….. 119
5.3 Introduction to VOIP………………………………………………. 122
5.3.1 VoIP Standards…………………………………………………….. 122
5.3.2 System architecture……...………………………………………… 123
5.3.3 Coding technique in VOIP systems……………………………….. 124
5.3.4 Introduction to G.727……………………………………………… 125
5.3.5 Introduction to G.729 and G.723.1………………………………… 127
_________________________________________________________________________
Page 1 Topic 1 – Audiology
CHAPTER 1
HEARING PROCESS

Introduction:
- The human ear can respond to frequencies from 20Hz up to 20 KHz.
- It is more than sensitive and broad band receiver.
- It acts as a frequency analyzer of impressive selectivity.
- It is one of the most delicate mechanical structures in the human body.

1.1 The structure of the ear:

Figure(1.1) The main structure of the Ear

It consists of 3 main parts: outer, middle and inner ear.

1-the outer ear:


- The visible portion of the ear.
- It collects the sound and sends it to the ear drum via the ear canal.
It contains:
1. Pinna:
- It serves as a horn, collecting sound in the auditory canal.
2. Auditory canal:
- It is a straight tube of 0.8 cm diameter & 2.8 cm. long.
- closed by the ear drum.
3. Ear drum:
- Small membrane separates the outer ear from the middle ear
(Considered the entrance to the middle ear).
- Flatted cone.
- Quite flexible in the center and attached around the edge to the end of the canal.

_________________________________________________________________________
Page 2 Topic 1 – Audiology
2- The middle ear:
- It houses chain of three bones (hammer, incus and stapes)
- It contains 3 ossicles (bones), the ear drum is connected to the 1st (hammer), which
communicates to the last (stapes) through the middle one (incus).
- These bones are set into motion by the movement of the ear drum.
- It is also connected to the throat via the "Eustachian tube".
- There is a collection of muscles and ligaments that control the lever ratio of the
system. for high tension, the muscles controlling the motion of the bones change their
tension to reduce the amplitude of motion of the stapes, thereby protecting the inner ear
from damage.(N.B.: it offers no protection from sudden impulsive sounds).

3- The inner ear:


- contains the tiny nerve endings for balance and hearing.
- Also contains a very unique fluid that becomes set in motion by the movement of the
oval window.
-the tiny nerve endings are then stimulated (activated) and each sends a message or an
impulse to the brain.

The inner ear has 3 main parts:


The vestibule, semi-circular canals and the cochlea.

• The vestibule:
- connects with the middle ear through 2 opening, the oval window& the round window
(both prevent the fluid escape in the inner ear).

• The semi-circular canals:


- provide a sense of balance.

• The tube of cochlea:


- is divided by partition to: upper gallery, duct& the lower gallery.
- The ends of the galleries are connected to the oval& the round window; the other ends
are connected to the apex of the cochlea.

The Duct:
- Is filled with endolymph (potassium rich, related to intracellular fluid throughout the
body) & perilymph (sodium rich, is similar to the spinal fluid).
- It also contains membranes; one of them is called the "Basilar membrane". (At the top
of this membrane there is organ of corti "contain 4 rows of hair cell").

_________________________________________________________________________
Page 3 Topic 1 – Audiology
1.1.2 How the ear works:

Figure (1.2) The Hearing process

1- When the ear is exposed to a pure tone, sound waves are collected by the outer ear
and funneled through the ear canal to the ear drum.
2- Sound waves cause the ear drum to vibrate.
3- The motion of the ear drum is transmitted and amplified by the 3 bones of the middle
ear to the oval window of the inner ear creating fluid disturbance that travels in the
upper gallery toward the apex, in to the lower gallery, and then propagates in the lower
gallery to the round window which acts as pressure release termination.

NOTES
The basilar membrane is driven into highly damped motion with a peak amplitude
increases slowly with distance away from stapes. Reaches the maximum, and then
diminishes rapidly toward the apex

_________________________________________________________________________
Page 4 Topic 1 – Audiology
1.2 Computational Models for Ear Function

A computational model has been derived to describe basilar membrane displacement in


response to an arbitrary sound pressure at the ear drum.

In this simplified schematic of the peripheral ear:

P(t) x(t) y (t)


Middle Ear Basilar Membrane

Figure (1.3) Shematic Diagram of Peripheral Ear

YL ( S ) X ( S ) YL ( S )
= . = G ( S ) . FL ( S )
P (S ) P (S ) X (S )
Where:
 P (t): is the sound pressure at the ear drum.
 X (t): is the equivalent linear displacement of the stapes.
 yl (t): is the linear displacement of the basilar membrane at a distance "l"
from the stapes.

The Desired Objective:


Is an analytical approximation to the relations among these quantities.

It is convenient to obtain it in two steps:


 Is to approximate the middle ear transmission, that is, the relation between
x(t) and p(t).
 Is to approximate the transmission from the stapes to the specified point I on
the membrane.

Approximating functions are indicated as the frequency domain "LAPLACE


transforms" G(s) and Fl(s).

Conditions for Approximation:


The functions G(s) and Fl(s) must be fitted to available physiological data,
therefore:

 If the ear is assumed to be mechanically passive and linear over


the frequency and amplitude ranges of interest, rational functions
of frequency can be used to approx. the physiological data.

_________________________________________________________________________
Page 5 Topic 1 – Audiology
 Because the model is an input-output analog, the response of one
point does not require explicit computation of the activity at the
other points. One therefore has the freedom to calculate the
displacement yl (t) for as many or for as few, values of I as are
desired.

1. Basilar Membrane Model:


The physiological data upon which the form of Fl(s) is based are those of
"BEKESY".

If the curves are normalized with respect to the frequency of the maximum
response, one can find that:

 They are approximately constant percentage bandwidth


responses as shown in figure 1.4(A).
 Also, the phase data suggest a component which is
approximately a simple delay, and whose value is inversely
proportional to the frequency of peak response. That is, low
frequency points on the membrane (nearer to the apex) exhibit
more delay than high frequency (basal) points as shown in figure
1.4(B) where at low frequencies the phase is high i.e. big shift
while at high frequencies phase is low i.e. small shift .

One function which provides a reasonable fit to BEKESY results is:


0.8 2 −3πS
 2000 π β L   S +εL   1  4βL
FL ( S ) = C1β L4   .   .  2 2
.e
 β L + 2000 π   S + β L   (S + α L ) + β L 
Where:

 S = σ + j w is the complex frequency.


 β L = 2α L is the radian frequency to which the point l-distance from the
stapes responds maximally.
 C1 is a real constant that gives the proper absolute value of displacement.
−3 π S
4 βL
 e is a delay factor of 3π/4βL seconds which brings the phase delay
of the model into line with the phase measured on the human ear.
2000πβ L 0.8 4
 ( ) .β L is an amplitude factor which matches the variations
β L + 2000π
in peak response with resonant Frequency (βL)
 ξl/ βL =0.1 to 0.0 depending upon the desired fit to the response at low
frequencies.

_________________________________________________________________________
Page 6 Topic 1 – Audiology
B=2*PI*50 B=2*PI*2500 B=2*PI*5000

ABS. OF F(S)
400 400 400

200 200 200

0 0 0
-1 0 1 -1 0 1 -1 0 1
10 10 10 10 10 10 10 10 10
FREQ.(W/B) B=2*PI*10000
400
200
0
-1 0 1
10 10 10
(A)
B=2*PI*50
ANGLE OFF(S)

B=2*PI*2500 B=2*PI*5000
1 1 1
0 0 0
-1 -1 -1
-2 -2 -2
-3 -3 -3
-1 -1 -1
10 10 10
B=2*PI*10000
freq.(W/B) 1
0
-1
-2
-3
-1
10 (B)

Figure (1.4) Amplitude & Phase Response of the Basilar


Membrane Model FL(S)

The membrane response at any point is therefore approximated in terms of the


poles and zeros of the rational function of FL(s).

The reason at properties of the membrane is approximately constant (constant


percentage bandwidth).

The real and imaginary parts of the critical frequencies can therefore be related
by a content factor, namely, (βL=2xL)-the imaginary part of the pole frequency (βL)
completely describes the model and the characteristics of the membrane at a place l-
distance from the Stapes.

The real-frequency response of the model is evidenced by letting s=jw.


The inverse laplace transform of Fl(s) (shown in figure 1.5) is the displacement
response of the membrane to an impulse of displacement by the stapes, the details of
inverse transformation is found to be:
0.8
 2000π 
f l (t ) = C1   β L1+ r {[0.033 + 0.36 β L (t − T )]
 β L + 2000π 

β L ( t −T )

Xe 2
. sin β L (t − T ) + [0.575 − 0.320 β L (t − T )]

β L ( t −T )

Xe 2
. cos β L (t − T ) − 0.575 .e − β L ( t −T ) } = 0

For t ≥ T and ξL/βL = 0.1 where the delay T = 3π/4βL

_________________________________________________________________________
Page 7 Topic 1 – Audiology
And so from the relation we find that as the frequency increase as the delay of the
maximum response decrease as shown in figure 5.

when B=2*pi*50 B=2*pi*150 B=2*PI*1500


10 50 400

the unverse of F(S)


5 200

inv of f(s)
0
0 0

-5 -50 -200
0 5 10 0 5 10 0 5 10
time time time
B=2*PI*10000 B=2*pi*20000
B=2*PI*5000
1500 500
200
1000
0
100 500
-500
0
0
-500 -1000

-100 -1000
0 5 10 0 5 10 -1500 0 5 10
time time time

Figure (1.5) The Basilar Membrane Response to Impulse


of Stapes Displacement and with Frequency too

And we note also from the equation of F(S) that (βL) depend on the l the distance from
the stapes of maximum response of the membrane according to the equation
(35-l)=7.5 log ((βL)/ (2*Ω*20)
So as frequency increase as the l decrease and this note is logical because as the
frequency increase as the basilar membrane respond quickly and maximally (as shown
in figure 1.6).
45

40

35

30
L(IN MM)

25

20

15

10
0 2 4 6 8 10 12 14
B(IN HZ) 4
x 10

Figure (1.6) The relation between the (βL) & I-The distance from the stapes that has responds
maximally

2. Middle Ear Transmission:


To account for middle ear transmission, an analytical specification is necessary
of the stapes displacement produced by a given sound pressure at the ear drum.

Quantitative physioacoustical data on the operation of the human middle ear are
sparse (few).

All agree that the middle ear transmission is a low-pass function. As shown in
figure in 1.7

_________________________________________________________________________
Page 8 Topic 1 – Audiology
An approximating Function of 3rd Degree of Middle Ear Transmission:

Co
G(S ) =
[
(S + a) (S + a) 2 + b 2 ]
 Where Co is a positive real constant.

 One might consider Co=a (a2+b2) so that the low frequency transmission of
G(s) is unity .when the pole frequencies of G(s) are related according to b=
2a=2π (1500) rad/sec.
-1 3
x 10
20

15

10
G(S)

-5
0 100 0 2 000 300 0 4 000 500 0 6 000 700 0
freque nc y in c y c les p er s ec ond

Figure (1.7) functional approximation of middle ear

 The inverse transform of G(s) is the displacement response of the stapes to


an impulse of pressure at the ear drum, given by:

e − at Coe −bt / 2
g (t ) = Co. (1 − cos bt ) = (1 − cosbt )
b b
For this middle ear function the response is seen to be heavily damped, so as the
frequency decrease as the response is damped of the middle ear and the velocity of the
displacement of the stapes also damped as shown in the next two figures

_________________________________________________________________________
Page 9 Topic 1 – Audiology
0.7

0.6

0.5

0.4

g(t)
0.3

0.2

0.1

0
0 2 4 6 8 10 12 14
time

FIGURE (1.8) Displacement of stapes to an impulse of pressure on the ear drum

The Time derivative of the stapes displacement is:

* Co.e −bt / 2
g (t ) = .(2 Sin bt+ cos bt − 1)
2
0. 8

0. 6

0. 4

0. 2
velocity

-0. 2

-0. 4

-0. 6
0 2 4 6 8 10 12 14
t im e

Figure (1.9) Velocity of stapes to an impulse of pressure on the ear drum

3. Combined Response of Middle ear and Basilar Membrane:


The combined response of the models for the middle ear and basilar membrane is:

HI(s) =G(s) Fl(s)


hl (t) =g(t)* fl(t)

The combined response of G(s) and FL (s) in the frequency domain is simply the sum of
the individual curves for amplitude (in dB) and phase (in radians).

When the inverse transform is calculated, the result has the form:

_________________________________________________________________________
Page 10 Topic 1 – Audiology
1
h1 (τ ) = Ae −bτ / 2 + Be −bτ / 2 (Cosb τ − Sin bτ ) + C (e −bτ / 2 Sin bτ )
2

+ D e −ηbτ + E (e −ηbτ / 2 Sinηbτ ) + F (ηbτ e −ηbτ / 2 Sin bτ )

+ G (e −ηbτ / 2 Cosη bτ ) + H (η bτ e −ηbτ / 2 Cosη bτ ); Forτ ≥ 0

Where: A, B, C, D, E, F, G, H are all real numbers which are functions of (βL)


and (b). τ=-(t - T); T= 3π/4 βL, η= βL /b, βL=2αL, b=2a, ξL=0

The form of the impulse response is thus seen to depend upon the parameter (η=
βL /b)

Values of η <1.0 refer to apical (low frequency) membrane points whose


frequency of maximal response is less than the critical frequency of the middle ear.
For these points, the middle ear transmission is essentially constant with
frequency. (As shown in figure 1.10 (a,b,c))

On the other hand, values of η> l .0 refer to basal (high frequency) points which
respond maximally at frequencies greater than the critical frequency of the middle ear.

For these points, the middle ear transmission is highly dependent upon
frequency and would be expected to influence strongly the membrane displacement. (as
shown in figure 1.10 as shown in figure(1.10(d,e,f))
-12 -12 -10 B=2*PI*1200
x 10 B=2*PI*50 x 10 B=2*PI*150 x 10
2 10 2

1 5 1
h(t)

0 0 0

-1 -5 -1
0 1 2 0 1 2 0 1 2
time time time
(a) (b) (c)

-10 -10 x 10
-10 B=2*PI*10000
x 10 B=2*PI*3000 x 10 B=2*PI*5000 4
4 5

2
2

0
h(t)

0 0

-2
-2

-4
-4 -5 0 1 2
0 1
time 2 0 1 2
time
(d) time
Figure (10) The response of the ear to the different frequency range

_________________________________________________________________________
Page 11 Topic 1 – Audiology
Note:

As already indicated by the impulse responses:

 The response of apical points on the membrane is given essentially by


FL(s), while for basal points the response is considerably influenced by
the middle-ear transmission G (s).

 Concerning the latter point may be noted that at frequencies appreciably


less than its peak response frequency, the membrane function FL (w)
behaves as a differentiator because the middle ear transmission begins to
diminish in amplitude at frequencies above about 1500 cps, the
membrane displacement in the basal region is roughly the time
derivative of the stapes displacement (as shown in figure (1.10)).

 The waveform of the impulse response along the basal part of the
membrane is therefore approximately constant in shape (as shown in
figure (1.4).

 Along the apical part, however, the impulse response oscillates more
slowly (in time) as the apex is approached (as shown in figure 1.9).

_________________________________________________________________________
Page 12 Topic 1 – Audiology
CHAPTER 2
FUNDAMENTAL PROPERTIES OF HEARING

At the beginning the main 5 properties we are going to talk about are:
1) Thresholds
2) Equal loudness level
3) Critical bandwidth
4) Masking
5) Beats and combinational tones

2.1 Thresholds:
The threshold of audibility is the minimum perceptible of L1 of a tone that can be
detected at each frequency over the entire range of ear. The tone should have duration
of 1S.a representative threshold of audibility for a young undamaged ear is shown as
the lowest
Curve in figure.

Figure (2.1) Threshold of audibility & free field, equal loudness level contour

The frequency of maximum sensitivity is near 4KHZ for high frequencies the threshold
also raises rapidly to a cutoff. It is in this higher freq. Region that the greatest
variability is observed among different listeners, particularly if they are over 30 of age.
The cut-off frequency. For young person may be as high as 20kHZ or even 25kHZ,but
people over 40 or 50 years of age with typical hearing can seldom hear freq. Near or
above 15kHZ.in the range below 1kHZ,the threshold is usually independent of the
range of the listener.
As the intensity of the incident acoustic wave is increased, the sound grows louder and
eventually produces a tickling sensation, this occurs at an intensity level of about
120dB and is called threshold of feeling, and the tickling sensation becomes one of pain
about 140dB.
Since the ear responds relatively slowly to loud sound by reducing the lever action of
the middle ear, the threshold of audibility shifts upwards under exposure, the amount of
shift depends on the intensity and the duration of the sound. After the sound is removed

_________________________________________________________________________
Page 13 Topic 1 – Audiology
the threshold of hearing will begin to reduce and if the ear fully recovers its original
threshold it has experienced temporary threshold shift (TTS). The amount of time
required for a complete recovery increases with increasing intensity and duration of
sound. If the exposure is long enough or the intensity is high enough the recovery of the
ear is not complete. The threshold never returns to its original value and permanent
threshold shift (PTS) has occurred.
It is important to realize that the damage leading to PTS occurs in the inner ear, the hair
cells are damaged.
Also of importance are differential thresholds one of which is the differential threshold
for intensity determination. If two tones of almost identical freq. are sounded together,
one tone much weaker than the other the resultant signal is indistinguishable from a
single freq. whose amplitude fluctuates slightly and sinusoidally.
The amount of fluctuation that the ear can just barely detect, when converted into the
difference in intensity between the stronger and the weaker portions, determines the
differential thresholds. As might be expected, values depend on frequency numbers of
beats per second and intensity level.
Generally the greatest sensitivity to intensity changes is found for about 3 beats per
second sensitivity decreases at frequency extremes, particularly for low frequency but
the effect diminishes with the increasing sound levels.
For sound more than 40dB above threshold, the ear is sensitive to intensity level
fluctuations of less than 2dB at the freq. Extremes and less than about 1dB between 100
and 1000HZ.
Other differential thresholds involve the ability to discriminate between two sequential
signals of nearly the same freq.
The frequency level required to make the discrimination is termed the difference limen.

2.2 Equal loudness level contour:


Experiments in which listeners gauge when 2 tones of different freq. Sounded
alternately, are equally loud provides contours as function of frequency. As seen in
figure2.1, high and low freq. Tones requires greater values of L1 to sound as loud as
those in the mid-frequency range. The curves resulting from such comparison are
labeled by the L1 they have at 1KHZ. Each curve is an equal loudness level contour and
express the loudness level LN in phon. which is assigned to all tones whose L1 fall on
the contour. thus , Ln =L1 for 1KHZ tone ,regardless of its level ,however ,4KHZ tone
with L1=90 dB has a loudness level Ln =70 phons, as does a 4KHZ tone with L1=61
dB. The curves become straighter at higher loudness levels and Ln and L1 become more
similar at all frequencies.

2.3 Critical bandwidth:


If a subject listens to a sample of noise with a tone present, the tone cannot be detected
until its L1 exceeds a value that depends on the amount of noise present.
And found that the masking of a tone by a broadband noise is independent of the noise
bandwidth until the bandwidth become smaller than some critical value that depends on
the frequency of the tone.
In this task the ear appears to act like a collection of parallel filters, each with its own
band width and the detection of a tone requires that its level exceed the noise level in its
particular band by some detection threshold.

_________________________________________________________________________
Page 14 Topic 1 – Audiology
In early experiments it was assumed that the signal must equal the noise for detection to
occur (DT=0).on this bases and assuming that the sensitivity of the ear is constant
across each bandwidth Wcr, it follows that Wcr =S/N1 where S: the signal power and N1
the noise power per Hz the band width measured this way are now termed the critical
ratios.
Later experiments based on the perceived loudness of noise have yielded critical
bandwidths Wcb larger than the critical ratios. In some of theses experiments, the
loudness of a band of noise is observed as a unction of bandwidth while the overall
noise level is held constant. For noise bandwidth s less than the critical, the loudness
will be constant but when the bandwidth exceeds theoretical bandwidth, the loudness
will increase.

2.4 M asking:
This is the increase of the level of audibility in the presence of noise. First consider the
masking of one pure tone by another. The subject is exposed to a single tone of fixed
frequency and L1, and then asked to detect another tone of different frequency and
level. Analysis yields the threshold shift, the increase in L1 of the masked tone above its
value for the threshold of audibility before it can be detected. As shown in figure 2.2
gives reprehensive results for masking frequencies of 400 and 2000 HZ , the frequency
range over which there is appreciable masking increases with the L1 of the masker, the
increase being greater for frequencies above that of the masker . This is to be expected
because the region of the basilar membrane excited in to appreciable motion of
moderate values of L1 extends from the maximum further toward the stapes than the
apex.

Figure (2.2) Masking of one pure tone by another (The Abscissa is the frequency of the masked
tone)

2.5 Beats and combination tones:


Let two tones of similar frequency F1 and F2 and of equal L1 be presented to one ear (or
both ears).
When the two frequencies are very close together, the ear perceives a tone of single
frequency Fc= (F1 +F2)/2 fluctuating in intensity at the beat frequency FB=abs (F1 -F2).

As the frequency interval between the two tones increases, the sensation of beating
changes to throbbing and then to roughness gradually diminishes and the sound

_________________________________________________________________________
Page 15 Topic 1 – Audiology
becomes smoother, finally resolving in to two separate tones for frequencies falling in
the midrange of hearing, the transition from beats to throbbing occurs at about 5 to 10
beats per second and this turns into roughness at about 15 to 30 beats per second. These
transitions occur for higher beat frequencies as the frequencies of the primary tones are
increased.
Transition to separate tones occurs when the frequency interval has increased to about
the critical bandwidth.
None of this occurs if each tone is presented to a different ear. When each ear is
exposed to a separate tone, the combined sound does not exhibit intensity fluctuations,
this kind of beating is absent. this suggest that the beats arise because the two tones
generate overlapping regions of excitation on the basilar membrane, and it is not until
these regions become separated by the distance corresponding to the critical bandwidth
that they can be separately sensed by the ear. When the tones are presented one in each
ear, each basilar membrane is separated excited and these effects do not occur if the
two tones are separated far enough and are of sufficient loudness, combination tones
can be detected. These combination tones are not present in the signal sound but are
manufactured by the ear. There is a collection of possible combination tones whose
frequencies are various sums and differences of the original frequencies F1 &F2
Fnm = ABS (m F2 +n F1) n, m=1, 2, 3...
Only a few of these frequencies will be sensed. One of the easiest to detect is the
difference frequency abs (F1 -F2).

_________________________________________________________________________
Page 16 Topic 1 – Audiology
CHAPTER 3
MODELS OF HEARING AIDS

3.1 History of development of the hearing aids

Types of hearing impairment (weakness):


Basically, there are 3 types of hearing impairments: conductive, sensorineural and
mixed.
A-conductive: since the outer ear and the middle ear are involved in the conduction
as found, a problem located in these areas is considered a conductive hearing
impairment. It may be corrected or partially corrected with surgery and/or medication.
Amplification or the use of hearing aids may also be an option.
B-sensorineural: a problem associated with the inner ear is considered a
sensorineural hearing impairment. Generally, this type of hearing impairment is the
result of damage or degeneration to the tiny nerve endings. It is usually not correctable
with surgery or medication. The use of amplification is typically the choice of
treatment.
C-mixed: If both of these types of hearing impairment occur at the same time, the
result is a mixed hearing impairment.

The technology of hearing aids:

Analog vs. Digital:


All hearing aids, whether analog or digital, are designed to increase the loudness of
sounds reach the ear drum so that the hearing-impaired (weakened) person can better
understand speech. To accomplish this, three basic components are required:

1- A microphone to gather acoustic energy (sound waves in the air) and convert it to
electrical energy.
2- An amplifier to increase the strength of the electrical energy.
3- A receiver, which converts the electrical energy back into acoustic energy (sound
waves).

The main advantage of the analog hearing aids is the accuracy in sound reproduction
with low noise and distortion.

But it was large in size and need high power problem. To overcome it was developed
by ASIC in its design. but this also had its disadvantage which was increasing the cost
five times. To overcome it there was of programmable DSP approach .its advantage
was the reduction in the cost and the improvement in the sound quality. but its
disadvantage that the digital H.A amplifies all the sound even noise.
To overcome it they use the real time binaural digital hearing aid platform
(TM3205000)

_________________________________________________________________________
Page 17 Topic 1 – Audiology
Now, we will discuss the details of the hearing aid of the moderate hearing loss.

3.2 First Model


Digital Hearing Aid for Conductive Impairment

3.2.1 Features of Real Time Binaural Hearing Aid:


1) Real time binaural hearing aid (of type fixed point signal processor).
[Note: there are 2 types of processors fixed and floating point chips the floating easier
to implement DSP algorithms since the quantization effects are negligible for most
application .but the fixed are smaller in size and have low power requirements but of
course its algorithm have to be analyzed for the effect of quantization noise on their
performance]

2) It sample 2 inputs microphone signal with sampling rate 32 kHz/channel (hearing aid
B.W=10KHZ) and drive a stereo headphone o/p (Need Power source =1.8V)

3) It can be developed by reducing voltage of the power source to 1Vand reduce MIP
for final implementation of hearing aid.

3.2.2 Speech Processing Algorithm


The Hearing Loss
Is characterized by less sensitivity to sound that varies with the signal level and
frequency. So to overcome it they need a system to:
1) Separate high and low frequencies.
2) Improve the speech comprehension.
3) Listening comfort.

Speech Processing Algorithm:


1) Frequency shaping.
2) Adaptive noise reduction.
3) Multi channel amplitude compression.
4) Interaural time delay.
5) Timer
The first 3 algorithms will be discussed later in details.
But know a small comment on the last two algorithms, to understand their important
function.

3.2.2.1 The timer:


Function: it is used to switch off the drive of the left and right ear piece in a mutually
exclusive fashion in a fraction of second at a time.
Users: persons of severe hearing loss.
Advantage: overcoming problem of fatigue due to high gain amplification without
affecting the H.A performance.

_________________________________________________________________________
Page 18 Topic 1 – Audiology
The Interaural time delay:
Function: used to provide delay to signal going to 1 ear with respect to signal going to
2nd ear on a frequency selective basis.
Uses: it is provided on the theory that if a person has differential hear losing, in
addition to compensating gain to signal going to the two ears, there must be provision
for compensating internal delay between the signals received by the 2 ears.

Now, we are going to discuss the main three points of our study.

Figure (3.1) Digital signal Processor

3.2.2.2 Frequency Shaping:


It is:
1) Binaural equalizer of 2 banks of band-pass filters (one bank/ear).
2) It provide frequency from (dc to 16KHZ)
3) The filters used I) has linear phase. II) High band insulation for each. III) FIR (finite
impulse response)
[Typically there are 50 filters][And typically each has 50-200taps to increase the
shaping precision].

The therapist select:


1) Number of band-pass filters
2) Cut off freq.
3) Critical freq.
4) Isolation between bands.

Once the filter is selected with all the previous characteristics the therapist improves the
spectral magnitude for subjects hearing loss by adjusting gain of each filter.

_________________________________________________________________________
Page 19 Topic 1 – Audiology
3.2.2.3 The Noise Cancellation using LMS Algorithm:

This method is called Feedback System. This method depends on 2 operations that we
will declare now. According to the shown figure “the LMS Algorithm figure”; initially
the input taps is zero, and so there is nothing coming out of the transversal filter so
there will only be the error e(1) which obviously equal d(1) [which is the desired signal
(the one with no noise)] this error is used to adapt the input signal u(n) by producing
tap weights corresponding to the input taps of u(n) then produce estimate of the desired
signal then generating the error by comparing the estimate we got with the actual value
of the desired signal(that we already know) which will be feedback again to the
adoption of the successive signal this process is called filtering process.
The adjusting of the tap weight by the error generation is called the adaptive process.
So the function of the transversal filter is the filtering process and the adaptive control
of the tap weight is the adapting process.

So this was the basic of the LMS. But now we will learn how this LMS Algorithm
really works.
First, we will use the equation of the least mean square, to show how we update the
taping weight of the adaptive weight control mechanism. Which is:
∇ J ( n ) = − 2 p + 2 R W ( n)
The simplest choice of estimators for R and P is to use instantenous estimates that are
based on the sample value tap input
)
R ( n) = u (n) u H (n)
)
P ( n) = u ( n) d * (n)
And so the first equation becomes:
)
∇ J (n) = − 2u (n) d * (n) + 2u (n) u H (n) w (n)
And if we considered J (n) is viewed as the operator gradient applied to the
instantaneous squared error.
So by substituting in the estimate of equation the steepest descent, we can get a new
recursive relation for updating the tap- weight vector:

W (n+1) =W (n) +ωu (n) [d*(n)-uH (n) W (n)]

_________________________________________________________________________
Page 20 Topic 1 – Audiology
Figure (3.2) The LMS Algorithm

So the summary of the figure is that we have:


1. Filter output (the estimate of the desired signal)
Y (n)=wH (n) u (n) 2
2. Estimation error
e (n)=d(n)-y(n) 3
3 .Tap weight adoption
W(n+1)=W(n)+ ωu(n)[d*(n)-uH(n)W(n)] 4
We notice that the error estimation is based on the current estimate of the tap weight
vector, W (n) also that the second term in equation 4 represents the correction that is
applied to the current estimate of the Tap weight.

That was just an introduction on how the LMS Algorithm works in general and the
basic block diagram.

Adaptive Noise Cancellation applied to a sinusoidal interference:


The traditional method of suppressing a sinusoidal interference corrupting information
–bearing signal is to use a fixed notch filter tuned to the frequency of the interference.
The adaptive noise canceller using the LMS algorithm has 2 important characteristics:
1. The canceller behaves as an adaptive notch filter whose null point is determined by
the angular frequency ωo of the sinusoidal interference. Hence it is tunable and the
tuning frequency moves with ωo.
2. The notch in the frequency response can be made very sharp precisely at the
frequency ωo of the sinusoidal interference (by choosing a small enough value for the
and this exactly will be proved later.
Now we will discuss in details the noise canceller: it is a dual input adaptive noise
canceller.
As shown in figure (3.2), the primary input supplies information and a sinusoidal
interference .the reference supplies the sinusoidal interference. For the adaptive filter,
we may use a transversal filter whose tap weight are adapted by the means of the LMS
algorithm. The filter uses the reference input to provide an estimate of the sinusoidal

_________________________________________________________________________
Page 21 Topic 1 – Audiology
interference contained in the primary signal. Thus, by subtracting the adaptive filter
output from the primary input, the effect of the sinusoidal interference will diminished.

Figure (3.3) adaptive noise canceller

So as shown in figure (3.2) the adaptive noise canceller consists of:

1. The primary input:


D (n) =s (n) +A0 COS (ωo n+φ0)
Where the s (n) is the information bearing signal; A0 is the amplitude of the sinusoidal
interference ωo is the normalized angular frequency, and  is the phase.

2. Reference input:
u (n) =A COS(ωo n + φ)
In the LMS algorithm, the tape weight update is described by the following equations:
)
y (n) = ∑ iM=0−1w i (n) u (n − i )

e (n)=d(n)-y(n)
W (n+1) =W (n) +ωu (n) [d*(n)-uH (n) W (n)]

Where: m is the total number of the tap weights in the transversal filter.
With a sinusoidal excitation as the input of the interest, we restructure the block
diagram of the adaptive noise cancellers in figure (3.2). According to this new
representation , we may lump the sinusoidal input u(n) ,the transversal filter, and the
weight update equation of the LMS algorithm in to a single (open loop system)defined
by the transfer function G(Z) as shown in figure(3.3)

_________________________________________________________________________
Page 22 Topic 1 – Audiology
Figure (3.4) Equivalent Model in Z-domain

Where:
Y ( z)
G( z) =
E ( z)
Where Y (Z) is the z- transform of the reference input u (n) and the estimation error e
(n), respectively, given E (Z), our task is to find Y (Z), and therefore G (Z).to do so we
use the signal flow graph representation in the figure (3.3) .In this diagram, we have
singled out the ith tap weight for specific attention. The corresponding value of the tap
input is
u (n − i ) = A cos [ wo (n − i ) + φ ]

A j ( won +φi )
=
2
e [ + e − j ( won +φi ) ]
Then getting its z-transform (n-i) e (n)

Z [u (n − i )e(n)] =
A jφi
2
[ A
]
e z e(n)e jwon + e − jφi Z e(n)e − jWo
2
n
[ ]
A j ϕi A
=
2
[ ]
e E Ze − jWo + e − jφi E Ze jWo
2
[ ]

Then getting z transform of Wi (Z) by using equation no.4.


∧ ∧
Z Wi( z ) = WI ( Z ) + µZ [u (n − i )e(n)]

_________________________________________________________________________
Page 23 Topic 1 – Audiology

µA 1
Wi( z ) =
2 Z −1
[e jφi
E ( Ze − jWo ) + e jφi E ( Ze jWo ) ]
Then since y (n)
A M −1 ∧
[
y (n) = ∑ Wi (n) e j ( won+φi ) + e − j ( won+φi )
2 i −o
]
The from them getting Y (Z)
A M −1  jφii ∧ ∧
jwo 
y( z) = ∑  e WI ( Ze − jWo
) + e −φi
W i ( Ze )
2 i −o  
We find that it consists of 2 components:
1) A time invariant component:
µMA2 1 1
( + ) 1
4 Ze − jwo − 1 Ze jwo − 1
2) A time varying component:
Sin( Mwo )
β (Wo1M ) = 2
Sin wo
And since the value of are too large we can do so:
β (Wo 1M ) Sin( MWO )
= ≈O
M M Sin wO
Thus Y (Z) is
µMA2 1 1
∴Y ( z ) = E ( Z )( − jwo
+ jwo
)
4 Ze −1 Ze −1
Thus the open loop transfer function G (Z) is:
Y ( Z ) µMA2 1 1
G ( z) = = ( − jwo + jwo )
E (Z ) 4 Ze − 1 Ze − 1

µMA2 Z Cos Wo − 1
= ( )
2 Z 2 − 2 Z Cos Wo + 1
The adaptive filter has annul point determined by the angular frequency ωo of the
sinusoidal interference as G (Z) has zeros at z= (1\cosωo) and this prove the 1st
characteristic. Then from G (Z) we get H (Z) [transfer function of a closed loop
feedback]

_________________________________________________________________________
Page 24 Topic 1 – Audiology
E (Z ) 1
H (Z ) = =
D( Z ) 1 + G ( Z )
Where E(Z)is the z-transform of the system output e(n), and D(Z) is the z-transform of
the system input d(n).
Z 2 − 2 Z Cos Wo + 1
H (Z ) = 2
Z − 2(1 − µMA2 / 4) ZCos Wo + (1 − ηMA2 / 2)
So this function is the transfer function of a second order digital notch filter with a notch
at the normalized angular frequency ωo. And finally we find that poles of H (Z) lies
inside the unit circle. This means that the adaptive filter is stable [that is needed for real
time practical life].
Also we find that the zeroes of H (Z) lies on the unit circle that means that the adaptive
noise canceller has a notch of infinite depth at frequency ωo. Also the sharpness of the
notch filter is determined by the closeness of the poles of H (Z) to its zeros. And since
the 3-dB bandwidth is used for this we find that it is equal

µΜΑ 2
Β=
2
The smaller we therefore make ω, the smaller B is and therefore the sharper the notch is
and so finally we satisfy the 2nd characteristic too.

3.2.2.4 Amplitude Compression:


Speech amplitude compression is essentially the task of controlling the overall gain of a
speech amplification system. It essentially “maps” the dynamic range of the acoustic
environment to the restricted dynamic range of the hearing impaired listener.
Amplitude compression is achieved by applying a gain of less than one to a signal
whenever its power exceeds a predetermined threshold.
As long as the input power pin to the compressor is less than the input threshold pth no
compression takes place and the input is equal to the output. When the input power
exceeds the threshold value pth, a gain less than one is applied to the signal. Once
amplitude compression is being applied, if the attenuated input power exceeds a
specified saturation power p sat, the output power is held at a constant level.

Now, we will discuss the implanted hearing aid; the hearing aid with the severe hearing
loss.

_________________________________________________________________________
Page 25 Topic 1 – Audiology
3.3 Second Model
A method of treatment for a sensorineural hearing impairment

Introduction:
- Cochlear implants, which are implanted through a surgical procedure, are taking
hearing technology to a new level.
- The best candidates for cochlear implants are individuals with profound hearing
loss to both ears who have not received much benefit from traditional hearing
aids and are of good general health.
- Children as young as 14 months have been successfully implanted.

3.3.1 The conceptual prosthetic system architecture is:

Figure (3.6) Architecture of proposed fully implanted cochlear prosthetic system

Processing:
• The proposed middle ear sound sensor, based on accelerometer operating
principle, can be attached to the "umbo" to convert the umbo vibration to an
electrical signal representing the input acoustic information.
• This electrical signal can be further processed by the cochlear implant speech
processor, which is followed by a stimulator to drive cochlear electrodes.
• The speech processor, stimulator, power management and control unit,
rechargeable battery and radio frequency (RF) coil will be housed in a
biocompatible package located under the skin to form a wireless network with
external adaptive control and battery charging system.
• Wireless communication between the implant and external system is essential
for post-implant programming of the speech processor.

_________________________________________________________________________
Page 26 Topic 1 – Audiology
Tuning process:
- After implant it is necessary for the patient to go through a tuning procedure for
speech processor optimization so that the cochlear implant can function properly.
- In this tuning procedure, an audiologist will present different auditory stimuli
consisting of basic sounds or words to a patient.
- The acoustic information will be detected by the implanted accelerometer and
converted into an electrical signal.
- The speech processor will then process the signal and filter it into a group of outputs,
which represent the acoustic information in an array corresponding to individual
cochlear electrode bandwidth.
- Then an array of biphasic current pulses with proper amplitude and duty cycle will be
delivered to stimulate the electrodes located inside the cochlea along the auditory
nerve.
- This excitation activates neurotransmitters, which travel to the brain for sound
reception.
- The patient will then provide feedback in terms of speech reception quality to the
audiologist.
- To achieve the optimal performance for the active implant-human interface network,
the audiologist will adaptively tune the speech processor through the RF-coils-based
wireless link.

Re-charging battery:
A- Through the same link, an intelligent power management network for extending the
battery longevity and ensuring patient safety can also be implemented between the
implanted rechargeable battery and external powering electronics.
B- An external coil loop worn in a headset can transmit RF power across the skin to the
receiving loop, and active monitoring and control of incident RF power can be realized.
C- Upon completion of battery charging, the communication and control unit can send
out a wireless command to turn off the external powering system.

Accelerometer's position misalignment:


- The umbo vibrates with the largest vibration amplitude in response to auditory inputs.
- Measurements of ossicular bone vibration can be performed by using a Laser Doppler
Vibrometer (LDV).
- Due to the umbo's curved surface it is possible that the device may become
misaligned anterior or posterior to the long process of the hammer (malleus) during
attachment, which could potentially degrade sensitivity.
- Therefore, it is necessary to investigate and characterize the umbo vibration response
along different axes.
- the direction perpendicular to the tympanic membrane is defined as the "primary axis"
of the umbo, and the vector parallel to the tympanic membrane plane and perpendicular
to the long process of the malleus is defined as the "secondary axis" of the umbo.

_________________________________________________________________________
Page 27 Topic 1 – Audiology
Figure (3.7) umbo primary and secondary axes

Accelerometer's design:
To achieve the optimum sensor design, we've to investigate human temporal bone
vibration characterization.

3.3.2 Human temporal bone vibration characterization:

A- Human temporal bone preparation:


-Four temporal bones were used to study the vibration cc's of the umbo.
-all temporal bones were individually inspected under a microscope to verify an intact
(uninjured before) tympanic membrane, ear canal, and ossicular bone structure.
-any bone with any evidence of structural damage due to the middle ear cavity exposure
was not used.

Temporal bones were then sequentially opened in two stages:


a- A simple mastiodectomy (surgical removal) with a facial recess approach.
b- After the initial opening of the middle ear cavity, the temporal bone was further
opened in a 2nd stage of drilling.
- In this stage, the facial recess was widened such that full access was gained into the
middle ear.
- The drilling proceeded until the tympanic membrane could be visualized.
- 2 pieces of <1 mm^2 > reflective material were placed as targets for the umbo
primary and secondary axes characterization.

_________________________________________________________________________
Page 28 Topic 1 – Audiology
B- Temporal bone experimental setup and procedures:

Figure (3.8) Schematic of temporal bone setup

- A temporal bone under examination was placed in a weighted temporal bone


holder.
- An insert earphone driven by a waveform generator presented pure tones within
the audible spectrum to the tympanic membrane.
- A probe microphone was positioned approx. 4mm from the tympanic membrane
to monitor the input sound pressure level.
- A LDV exhibiting a velocity resolution of 5micro.m/s over the frequency range
of DC to 50 kHz was used to measure the ossicles's vibrational characteristics.
- The laser was focused onto the reflective targets attached to the primary and
secondary axes of the umbo.
- Acceleration of the umbo along the primary and the secondary axes was
measured in the frequency range of 250 Hz to 10 kHz with input tones between
70 dB and 100 dB SPL in increments of 5 dB.

C- Measurements results with respect to primary and secondary axes:


- The acceleration frequency response along the primary axis is nearly identical
to that of the secondary axis.
- Although the frequency trends are very similar, a 20% increase in the
acceleration amplitude is measured on the umbo along the primary axis
compared to the secondary axis.
- While the difference between the two axes measurements is small, it occurs
with approx. equal magnitude in all bones at all sound levels.

Therefore, any potential misalignment in sensor placement will have a minimal impact
on the output signal amplitude because of the similar acceleration amplitude response,
and also negligible frequency distortion due to the similar frequency response.

_________________________________________________________________________
Page 29 Topic 1 – Audiology
Figure (3.9)

Figure (3.10)

- The vibration acceleration frequency response in the direction perpendicular to the


tympanic membrane increases with:

a- A slope of 40 dB per decade below 1 kHz.


b- And with a slope of about 20 dB per decade from 1 kHz to 4 kHz.
c- Above 4 kHz the acceleration signal remains relatively flat.

- Through out the measurement frequency range the vibration acceleration


exhibits a linear function of the input sound pressure level SPL with a slope of
20 dB per decade.

_________________________________________________________________________
Page 30 Topic 1 – Audiology
3.3.3 Design Guideline for optimum accelerometer:

Figure (3.11)

- The previous measurement results can serve as design guideline to help define the
specifications for the prototype accelerometer.
- Audiologists report that audible speech is primarily focused between 500 Hz and 8
KHz, and that the loudness of quite conversation is approx. 55 dB SPL.
- Within the audible speech spectrum, 500 Hz has the lowest acceleration response, and
thus it is the most difficult for detection.
- today's cochlear implants have multiple channels and electrodes to provide an
appropriate stimulus to the correct location within the cochlea, at 500 Hz, the electrode
channel bandwidth is on the order of 200 Hz.

Therefore:
1- To detect sounds at 55dB SPL at 500 Hz, an accelerometer with a sensitivity of 50
micro.g/Hz^1/2 and a bandwidth of 10 kHz is needed.
2- The total device mass is another important design consideration,
- The mass of the umbo and long process of the malleus is about 20 -- 25mg.
- adding a mass greater than 20 mg can potentially result in a significant damping effect
on the frequency response of the middle ear ossicular chain. Therefore, the total mass
of the packaged sensing system needs to be kept below 20 mg.

3.3.4 Conclusion:
An accelerometer with reduced package mass (below 20 mg) and improved
performance, achieving a sensitivity of 50 µ.g/(Hz)½, and bandwidth of 10 kHz, would
be needed to satisfy the requirements for normal conversation detection.

_________________________________________________________________________
Page 31 Topic 1 – Audiology
_________________________________________________________________________
Page 32 Topic 2 – Acoustical Simulation of Room
CHAPTER 1
GEOMETRICAL ACOUSTICS

1.1 Introduction
What is Room Acoustics?

Room Acoustics describe how sound behaves in an enclosed space.

The sound behavior in a room depends significantly on the ratio of the frequency (or
the wavelength) of the sound to the size of the room. Therefore, the audible spectrum
can be divided into four regions (zones) illustrated in the following drawing (for
rectangular room):

1. The first zone is below the frequency that has a wavelength of twice the longest
length of the room. In this zone sound behaves very much like changes in static
air pressure.

2. Above that zone, until the frequency is approximately 11,250 (RT60 / V) ½,


wavelengths are comparable to the dimensions of the room, and so room
resonances dominate.

3. The third region which extends approximately 2 octaves is a transition to the


fourth zone.

4. In the fourth zone, sounds behave like rays of light bouncing around the room.

The first 3 zones constitute what we call “Physical Acoustics” and the fourth zone
is the target of our project is called “Geometrical Acoustics”.

_________________________________________________________________________
Page 33 Topic 2 – Acoustical Simulation of Room
What is Acoustical Simulation?

Acoustical Simulation is a technique that assists the acoustical consultants in the


evaluation of room acoustics or the performance of the sound systems. This acoustical
program can simulate the sound as it would be heard after the project is built. This is
also called auralization.

Human beings hear things by virtue of pressure waves impinging on their eardrums,
OK? All of the information that we need to know about the sound (such as volume,
frequency content, direction, etc.) is contained in those pressure waves (or so we
believe). What an auralization system tries to do is to fool your brain into thinking that
you're listening to a sound source in an acoustical space (i.e., room) that you're not in.
How it does this is to take the original sound source and alter the frequency spectrum
according to both:

1. How the room affects the wave


2. And how your head/ears affect the wave.

Once this is done, you play back two signals (one for each ear) through, for example,
headphones and listen. Hopefully, you get the sense that what you are listening to is
what you would actually hear if you were really in the room with the source. Simple
enough, right?

• We will discuss the simulation of Room Acoustics in the following 3 chapters as


follows:

Acoustical Simulation
of Room

Chapter 1 Chapter 2 Chapter 3

Geometrical Room Artificial Spatial


Acoustics Reverberation Impression

_________________________________________________________________________
Page 34 Topic 2 – Acoustical Simulation of Room
1.2 Sound Behavior
Consider a sound source situated within a bounded space. Sound waves will propagate
away from the source until they encounter one of the room's boundaries where, in
general, some of the energy will be absorbed, some transmitted and the rest reflected
back into the room.

Sound arriving at a particular receiving point within a room can be considered in two
distinct parts. The first part is the sound that travels directly from the sound source to
the receiving point itself. This is known as the direct sound field and is independent of
room shape and materials, but dependant upon the distance between source and
receiver.

After the arrival of the direct sound, reflections from room surfaces begin to arrive.
These form the indirect sound field which is independent of the source/receiver
distance but greatly dependant on room properties.

The growth and Decay of sound:


When a source begins generating sound within a room, the sound intensity measured at
a particular point will increase suddenly with the arrival of the direct sound and will
continue to increase in a series of small increments as indirect reflections begin to
contribute to the total sound level. Eventually equilibrium will be reached where the
sound energy absorbed by the room surfaces is equal to the energy being radiated by
the source. This is because the absorption of most building materials is proportional to
sound intensity, as the sound level increases, so too does the absorption.

If the sound source is abruptly switched off, the sound intensity at any point will not
suddenly disappear, but will fade away gradually as the indirect sound field begins to
die off and reflections get weaker. The rate of this decay is a function of room shape
and the amount/position of absorbent material. The decay in highly absorbent rooms
will not take very long at all, whilst in large reflective rooms, this can take quite a long
time.

(Figure1.1)Reverberant Decay of sound in a small absorbent enclosure.

This gradual decay of sound energy is known as reverberation and, as a result of this
proportional relationship between absorption and sound intensity, it is exponential as a
function of time. If the sound pressure level (in dB) of a decaying reverberant field is

_________________________________________________________________________
Page 35 Topic 2 – Acoustical Simulation of Room
graphed against time, one obtains a reverberation curve which is usually fairly straight,
although the exact form depends upon many factors including the frequency spectrum
of the sound and the shape of the room.

1.3 Geometrical Room Acoustics


In geometrical room acoustics, the concept of a wave is of minor importance; it is
replaced instead by the concept of a sound ray. The latter is an idealization just as much
as the plane wave.

Sound Ray:
As in geometrical optics, we mean by a sound ray a small portion of a spherical wave
with vanishing aperture, which originates from a certain point. It has a well-defined
direction of propagation and is subject to the same laws of propagation as a light ray,
apart from the different propagation velocity, of these laws; only the law of reflection is
of importance in room acoustics. But the finite velocity of propagation must be
considered in all circumstances, since it is responsible for many important effects such
as reverberation, echoes and so on.

Diffraction phenomena are neglected in geometrical room acoustics, since propagation


in straight lines is its main postulate. Likewise, interference is not considered, i.e. if
several sound field components are superimposed, their mutual phase relations are not
taken into account; instead, simply their energy densities or their intensities are added.
This simplified procedure is permissible if the different components are 'incoherent'
with respect to each other.

1.3.1 The Reflection of Sound Rays


If a sound ray strikes a plane surface, it is usually reflected from it. This process takes
place according to the reflection law well-known in optics.

The Law of Reflection :


It states that the ray during reflection remains in the plane including the incident ray
and the normal to the surface, and that the angle between the incident ray and reflection
ray is halved by the normal to the wall.

Since the lateral extension of a sound ray is vanishingly small, the reflection law is
valid for any part of a plane no matter how small. Therefore it can be applied equally
well to the construction of the reflection of an extended ray bundle from a curved
surface by imagining each ray in turn to be reflected from the tangential plane which it
strikes.

_________________________________________________________________________
Page 36 Topic 2 – Acoustical Simulation of Room
The mirror source concept:
The reflection of sound ray originating from a certain point can be illustrated by the
construction of a mirror source, provided that the reflection surface is plane (see figure
1.2) at some distance from the reflecting plane, there is a sound source A. we are
interested in the sound transmission to another point B. it takes place along the direct
path AB on the one hand (direct sound) and on the other, by reflection from the wall.
To find the path of the reflected ray we make A` the mirror image of A, connect A` to
B and A to the point of intersection of A`B with the plane.

B○

○ ○
A A'

(Figure 1.2) Construction of a mirror source.

Once we have constructed the mirror source A` associated with a given original source
A, we can disregard the wall altogether, the effect of which is now replaced by that of
the mirror source.
Of course, we must assume that the mirror emits exactly the same sound signal as the
original sound and that its directional characteristics are symmetrical to that of A. if the
extension of the reflecting wall is finite, then we must restrict the directions of emission
of A` accordingly. Usually, not all the energy striking a wall is reflected from it; part of
the energy is absorbed by the wall (or it is transmitted to the other side which amounts
to the same thing as far as the reflected fraction is concerned).

The absorption coefficient of the wall α:


The fraction of sound energy (or intensity) which is not reflected is characterized by the
absorption coefficient α of the wall, which defined as the ratio of the non-reflected to
the incident intensity. It depends generally, on the angle of incidence and, of course, on
the frequencies which are contained in the incident sound. Thus, the reflected ray
generally has a different power spectrum and a lower total intensity than the incident
one.

_________________________________________________________________________
Page 37 Topic 2 – Acoustical Simulation of Room
1.3.2 Sound Reflections in Rooms
Suppose we follow a sound ray originating from a sound source on its way through a
closed room. Then we find that it is reflected not once, but many times, from the walls,
the ceiling and perhaps also from the floor. This succession of reflections continues
until the ray arrives at a perfectly absorbent surface. But even if there is no perfectly
absorbent area in our enclosure, the energy carried by the ray will become vanishingly
small after some time, because during its free propagation in air as well as with each
reflection a certain part of it is lost by absorption. If the room is bounded by plane
surface, it may be advantageous to find the paths of the sound rays by constructing the
mirror source.

Let us examine, in a room of arbitrary shape, the position of a sound and a point of
observation. We assume the sound source to emit at a certain time a very short sound
pulse with equal intensity in all directions. This pulse will reach the observation point
(see figure 1.3) not only by the direct path, but also via numerous partly signal, partly
multiple reflections, of which only a few are indicated in (figure 1.3) the total sound
field is thus composed of the 'direct sound' and of many 'reflections'.

In the following we use the term 'reflection' with a two-fold meaning: first to indicate
the process of reflecting sound from a wall and secondly as the name for a sound
component which has been reflected.

These reflections reach the observer from various directions, moreover their strengths
may be quite different and finally they are delayed with respect to the direct sound by
different times, corresponding to the total path length they have covered until they
reach the observation point.

(Figure 1.3) Direct sound and a few reflected components in a room

Thus, each reflection must be characterized by three quantities: its direction, its relative
strength and its relative time of arrival, i.e. its delay time.
The sum total of the reflections arriving at a certain point after emission of the original
sound pulse is the reverberation of the room, measured at or calculated for that point.

_________________________________________________________________________
Page 38 Topic 2 – Acoustical Simulation of Room
1.3.3 Room Reverberation

The temporal Distribution of Reflections

Since an enumeration of the great number of reflections, of their strengths, their


directions and their delay times would not be very illustrative and would yield much
more information on the sound field than is meaningful because of our limited hearing
abilities, we shall again prefer statistical methods in what follows.

If we mark the arrival times of the various reflections by perpendicular dashes over a
horizontal time axis and choose the heights of the dashes proportional to the relative
strengths of reflections, i.e. to the coefficients An we obtain what is frequently called a
'reflection diagram' or 'echogram'. It contains all significant information on the
temporal structure of the sound field at a certain room point. In (figure 1.4) the
reflection diagram of a rectangular room with dimensions 40m × 25m × 8m is plotted.
After the direct sound ,arriving at t=0, the first strong reflections occur at first
sporadically, later their temporal density increases rapidly, however; at the same time
the reflections carry less and less energy.
10 dB

0 50 100 150ms
t

(Figure 1.4) Reflection diagram for certain positions of sound source and
receiver in a rectangular room of 40 m x 25 m x 8 m. Abscissa is the delay
time of a reflection, ordinate its level, both with respect to the direct sound
arriving at t = 0.

_________________________________________________________________________
Page 39 Topic 2 – Acoustical Simulation of Room
As we shall see later in more detail, the role of the first isolated reflections with respect
to our subjective hearing impression is quite different from that of the very numerous
weak reflections arriving at later times, which merge into what we perceive
subjectively as reverberation. Thus, we can consider the reverberation of a room not
only as the common effect of free decaying vibrational modes, but also as the sum total
of all reflections-except the very first ones. The reverberation time, that is the time in
which the total energy falls to one millionth of its initial value, is thus:

V Eq. (1.1)
T = 0.163
4mV − S ln(1 − α )

If we use the value of the sound velocity in air and express the volume V in m3 and the
wall area S in m2. We have by rather simple geometric considerations the most
important formula of room acoustics, which relates the reverberation time that is the
most characteristics figure with respect to the acoustics of a room, to its geometrical
data and to the absorption coefficient of its walls. We have assumed more or less tacitly
that the latter is the same for all wall portions and that it does not depend on the angle
at which a wall is struck by the sound rays.

1.4 Room Acoustical Parameters & Objective Measures

1.4.1 Reverberation Time


Sabine carried out a considerable amount of research in this area and arrived at an
empirical relationship between the volume of an auditorium, the amount of absorptive
material within it and a quantity which he called the Reverberation Time (RT).
As defined by Sabine, the RT is the time taken for a continuous sound within a room to
decay by 60 dB after being abruptly switched off and is given by;
(0.161V )
RT =
A

Where:
V is the volume of the enclosure (m3) and A is the total absorption coefficient (a) of
each material used within the enclosure (Sabine).
The term A is calculated as the sum of the surface area (in m2) times the absorption
coefficient (a) of each material used within the enclosure.

1.4.2 Early Decay Time


The reverberation time, as mentioned above, refers to the time taken for the reverberant
component of an enclosure to fall by 60 dB after the source is abruptly switched off. In
an ideal enclosure this decay is exponential, resulting in a straight line when graphed
against Sound Level.

_________________________________________________________________________
Page 40 Topic 2 – Acoustical Simulation of Room
Research “Kuttruff 1973”, however; shows that this is not always the case. It has shown
that this exponential decay is the initial portion of the sound decay curve process which
is responsible for our subjective impression of reverberation as the later portion is
usually masked by new sounds. To account for this, the Early Decay Time (EDT) is
used. This is measured in the same way as the normal reverberation time but over only
the first 10 - 15 dB of decay, depending on the work being referenced.

1.4.3 Clarity & Definition


Clarity and Definition refer to the ease with which individual sounds can be
distinguished from within a general audible stream. This stream of sound may take
many forms; a conversation, a passage of music, a shouted warning, the whirring of
machinery, whatever.
In more detailed words we can define the Definition and Clarity as follows:

• Definition:

The definition is a measure of the clarity of speech, specifying how good a speaker can
be understood at a given position from the listener for example.
• Clarity:
Clarity is a comparable measure for the clarity of music. It is used for large rooms like
chamber music and concert halls. It is not appropriate for an evaluation of music
reproduced with loudspeakers in a recording studio or listening room.

1.4.4 Lateral Fraction & Bass Ratio


Lateral fraction and Bass ratio are two interesting room acoustical parameters. The
lateral fraction is related to the spatial impression made by the music in a concert hall.
The bass ratio describes the warmth of the music.

1.4.5 Speech Transmission Index


Speech Transmission Index, short STI is a measure of intelligibility of speech whose
value varies from 0 (completely unintelligible) to 1 (perfect intelligibility).The
understanding of speech, the intelligibility is directly dependent of the background
noise level, of the reverberation time, and of the size of the room. The STI is calculated
from acoustical measurements of speech and noise.

Another standard defines a method for computing a physical measure that is highly
correlated with the intelligibility of speech; this measure is called the Speech
Intelligibility Index, or SII.

_________________________________________________________________________
Page 41 Topic 2 – Acoustical Simulation of Room
CHAPTER 2
ARTIFICIAL REVERBERATION

2.1 Introduction

Reverberation:
Reverberation is the result of many reflections of sound in a room. For any sound,
there’s a direct path to reach the ear of the audience but it’s not the only path; sound
waves may take slightly longer paths by reflecting off walls and ceiling before arriving
to our ears.
This sound will arrive later than the direct sound; and weaker because of the absorption
of walls of sound energy. It may also reflect again before arriving to our ears.
So these delays and attenuations in sound waves are called “Reverberation”.
Summary:
Reverberation occurs when copies of the audio signal reach the ear with different
delays and amplitudes. It depends on the room geometry and its occupants.
Reverberation is a natural phenomenon.

Artificial Reverberation:
Artificial Reverberation is added to sound signals requiring additional reverberation for
optimum listening enjoyment.

Our Aim:
To generate an artificial reverberation which is indistinguishable from the natural
reverberation of real rooms.

2.2 Shortcomings of Electronic Reverberators


The main defects of artificial electronic reverberators:
1) Coloration:
It is the change in timbre of many sounds. It occurs because the amplitude-frequency
responses of electronic reverberators are not flat; in fact they deviate from a flat
response so much, particularly if only little direct (unreverberated) sound is mixed with
the artificially reverberated signal.

2) Fluttering:
Fluttering of reverberated sound occurs because the echo density* is too low compared
to the echo density of real room, especially for short transients.
*Echo density: the number of echoes per second at the output of the reverberator for a
single pulse at the input.

_________________________________________________________________________
Page 42 Topic 2 – Acoustical Simulation of Room
How to avoid the above degradations in artificial reverberators?
The problem of coloration can be solved by making an artificial reverberator with a flat
amplitude-frequency response. This can be achieved by passing all frequency
components equally by means of an all-pass filter which will be described below.

Concerning the problem of low echo density, it was found that for a flutter-free
reverberation; approximately 1000 echoes per second are required. Unfortunately, echo
densities of 1000 per second are not easily achieved practically by one dimensional
delay device.
Many researchers have suggested multiple feedbacks to produce a higher echo density.
However, multiple feedbacks have severe stability problems. Also, it leads to non flat
frequency responses and non-exponential decay characteristics.
We can simply treat the problem of low echo density by having a basic reverberating
unit that can be connected in series any number of times.
Now the question that must tackle our minds is why this simple remedy of echo density
problem has not been used from the beginning?
In fact, the answer is quite simple because the existing reverberators have highly
irregular frequency responses.
Here we have to mention that if the basic reverberator unit has a flat frequency
response. The series connection of any number of them will have a flat response too.
Conclusion:
All-pass reverberators (Reverberators with flat frequency response) can remove the two
main defects in artificial reverberators which are: Coloration and Fluttering.

2.3 Realizing Natural Sounding Artificial Reverberation

2.3.1 Comb Filter


What is Comb filter?
In signal processing, a comb filter adds a delayed version of a signal to itself, causing
constructive and destructive interference. The frequency response of a comb filter
consists of a series of regularly-spaced spikes, giving the appearance of a comb.
Comb filters exist in two different forms, feed-forward and feedback; the names refer to
the direction in which signals are delayed before they are added to the input.

Structure of Comb filter:

IN

+ Delay, τ
OUT
Feedback
Gain, g

_________________________________________________________________________
Page 43 Topic 2 – Acoustical Simulation of Room
Impulse Response of Comb filter:

Frequency Response of Comb filter:

Analysis:
The impulse response of Comb Filter is given by the following equation:
h (t) = δ (t-τ) + g δ (t-2τ) + g2 δ (t-3τ) + g3 δ (t-4τ) + . . . Eq. (2.1)
Where:
δ (t- τ) is the time response of a simple echo produced by a delay line.
g is the gain that must be less than one to guarantee the stability of the filter.

By taking Fourier Transform of Eq. (2.1), we obtain the following spectrum:

_________________________________________________________________________
Page 44 Topic 2 – Acoustical Simulation of Room
H(ω) = e-jωτ + g e-2jωτ + g2 e-3jωτ + g3 e-4jωτ + . . . Eq. (2.2)

Eq. (2.2) can be rewritten as:

e − j ωτ
H (ω ) = j ωτ Eq. (2.3)
1 − ge −

The amplitude response will be:


1 Eq. (2.4)
H (ω ) =
2
1 + g − 2 g cos ωτ
Where:
ω = 2πn/τ , n = 0, 1, 2, 3, … and 0 ‹ g ‹ 1
1 Eq. (2.5) 1 Eq. (2.6)
H max = H min =
1− g , 1+ g

Disadvantage of Comb Filter:


The Amplitude Response of Comb Filter has periodic maxima and minima like shown
in the above figure. These peaks and valleys are responsible of the unwanted “colored”
quality which accompanies the reverberated sound.

2.3.2 All-pass Filter


What is All-pass filter?
It is a filter that passes all frequencies equally. In other words, the amplitude response
of an all-pass filter is 1 at each frequency, while the phase response can be arbitrary.

Structure of All-pass filter:


All pass filters can be implemented in many different ways, however we will focus on
“Schroeder All-pass” which is implemented by cascading a feedback comb filter with a
feed-forward comb filter.

-g
IN + OUT

+ Delay, τ Gain, 1-g2

Gain, g

_________________________________________________________________________
Page 45 Topic 2 – Acoustical Simulation of Room
Impulse Response of All-pass filter:

Frequency Response of All-pass filter:

Analysis:
The impulse response of the above All-pass Filter is given by the following equation:
h (t) = -g δ(t) + (1- g2)[δ (t-τ) + g δ (t-2τ) + . . .] Eq. (2.7)

By taking Fourier Transform of Eq. (2.7), we obtain the following spectrum:

 e − j ωτ 
H ( ω ) = − g + (1 − g 2 )  − j ωτ  Eq. (2.8)
 1 − ge 

Eq. (2.8) can be rewritten as:


( e − j ωτ − g )
H (ω ) = Eq. (2.9)
1 − ge − j ωτ

_________________________________________________________________________
Page 46 Topic 2 – Acoustical Simulation of Room
Or
 1 − ge j ωτ 
H ( ω ) = e − j ωτ  − j ωτ  Eq. (2.10)
 1 − ge 

The amplitude response will be:


H (ω ) = 1 Eq. (2.11)

Advantage of All-pass Filter:


The All-pass filter obtained from two combs has made a marked improvement in the
quality of the reverberated sound and we have successfully reached a perfectly
“colorless” quality.

Conclusion:
By implementing the all-pass reverberator discussed above, we can say that we possess
a basic reverberating unit that passes all frequencies with equal gain and thus avoids the
problem of sound coloration. Moreover, if we connect in series any desired number of
such units; we can increase the echo density.

2.3.3 Combined C omb and All-pass Filters

As we discussed in previous sections, connecting several units of all-pass reverberators


have lead to a complete elimination of both coloration and fluttering and we have
successfully reached our aim of creating an artificial reverberator that is
indistinguishable of real rooms.

Now, we can go deeper towards a more sophisticated artificial reverberator having


more subtle characteristics of natural reverberation like:

1. Mixing of direct and reverberated sounds:


By changing the position of the direct input sound added to the reverberated sound,
we can change the ratio of direct sound energy to reverberant sound energy without
producing non-exponential decays; because, non-exponential decays in real rooms
actually point to a lack of spatial “diffusion” impression of sound.

2. Introduction of time gap between the direct sound and reverberation:


This can be done by means of the delay (τ). In real rooms, the time gap as well as
the ratio of direct-to-reverberated sound depends on the positions of sound source
and listener.

3. A dependence of the reverberation time on frequency:


In order to add more realism to the artificial reverberation, we can make the
reverberation time a function of frequency (i.e. for low frequencies, enlarge the

_________________________________________________________________________
Page 47 Topic 2 – Acoustical Simulation of Room
reverberation time). This can be done by adding a simple RC circuit in the feedback
loop of each all-pass reverberator.

 The following block diagram describes an artificial reverberator that allows a


wide choice of specifications such as mixing ratios, delay of reverberated sound
and kind of decay.

A complete circuit diagram of an advanced artificial reverberator:

+ τ g7 + OUT

1
g1
-g5 -g6
+ τ2
1-g52 1-g62
+ τ + + τ +
g2
5 6
+
IN g5 g6
+ τ3

g3

+ τ4

g4

Description:
Here, as we can see we used a set of 4 comb filters connected in parallel; however,
comb filters have irregular frequency responses, the human ear cannot distinguish
between a flat response and an irregular response of a room that fluctuates about
approximately 10 dB. Studies have shown that such irregularities are unnoticed
when the density of peaks and valleys is high enough and this is the case used in
our 4 comb filters.
Several all-pass reverberators are connected in series with the comb filters to
increase the echo density.

_________________________________________________________________________
Page 48 Topic 2 – Acoustical Simulation of Room
2.4 Ambiophonic Reverberation
We can achieve a highly diffuse reverberation by adding one more modification to our
artificial reverberator; that is to make it “Ambiophonic”.
In order to create the spatially diffuse character of real reverberation, we need to
generate several different reverberated signals and to feed them into a number of
loudspeakers distributed around the listener.
1

C1 -1

MATRIX
C2 -1
IN
A1 A2
C3 -1

C4 -1

16

Description:
• The schema illustrates an Ambiophonic installation where the order of comb filters
and all-pass filters has been inverted compared with the diagram of section (2.5)
• The outputs and the inverted outputs of the 4 comb filters are connected to a
resistance matrix which forms up to 16 different combinations of the comb filter
outputs. Each combination uses each comb filter output or its negative exactly once.
• The matrix outputs are supplied to power amplifiers and loudspeakers distributed
around the listening area.

 More techniques allowing the creation of spatial impression are discussed in


details in next chapter.

_________________________________________________________________________
Page 49 Topic 2 – Acoustical Simulation of Room
CHAPTER 3
SPATIALIZATION

3.1 Introduction

The acoustical sound field around us is very complex. Direct sounds, reflections, and
refractions arrive at the listener's ears, which then analyzes incoming sounds and
connects them mentally to sound sources. Spatial hearing is an important part of the
surrounding world.

The perception of the direction of the sound source relies heavily on the two main
localization cues: Interaural level difference (ILD) and Interaural time difference (ITD).
These frequency-dependent differences occur when the sound arrives at the listener's
ears after having traveled paths of different lengths or being shadowed differently by
the listener's head. In addition to ILD and ITD some other cues, such as spectral
coloring, are used by humans in sound source localization.

Bringing a virtual three-dimensional sound field to a listening situation is one goal of


the research in the audio reproduction field. The first recordings were monophonic;
they created point like sound fields. A big step was two-channel stereophonic
reproduction, with which the sound field was enlarged to a line between two
loudspeakers. Two-channel stereophony is still the most used reproduction method in
domestic and professional equipment.

Various attempts to enlarge the sound field have been proposed. Horizontal-only
(pantophonic) sound fields have been created with various numbers of loudspeakers
and with various systems of encoding and decoding and matrixing. In most systems the
loudspeakers are situated in a two-dimensional (horizontal) plane. Some attempts to
produce periphonic (full-sphere) sound fields with three-dimensional loudspeaker
placement exist, such as holophony or three dimensional Ambisonics.

Periphonic sound fields can be produced in two-channel loudspeaker or headphone


listening by filtering the sound material with digital models of the free-field transfer
functions between the listener's ear canal and the desired place of the sound source
[head-related transfer functions (HRTFs)]. The spectral information of the direction of
the sound source is thus added to the signal emanating from the loudspeaker. The
system, however, has quite strict boundary conditions, which limits its use.

In most systems the positions of the loudspeakers are fixed. In the Ambisonics systems,
the number and placement of the loudspeakers may be variable. However, the best
possible localization accuracy is achieved with orthogonal loudspeakers is greater; the
accuracy is not improved appreciably.
A natural improvement would be a virtual sound source positioning system that would
be independent of the loudspeaker arrangement and could produce virtual sound
sources with maximum accuracy using the current loudspeaker configuration.

_________________________________________________________________________
Page 50 Topic 2 – Acoustical Simulation of Room
The vector base amplitude panning (VBAP) is a new approach to the problem. The
approach enables the use of an unlimited number of loudspeakers in an arbitrary two-
or three-dimensional placement around the listener. The loudspeakers are required to be
nearly equidistant from the listener, and the listener room is assumed to be not
reverberant. Multiple moving or stationary sounds can be positioned in any direction in
the sound field spanned by the loudspeakers.

In VBAP the amplitude panning method is reformulated with vectors and vector bases.
The reformulation leads to simple equations for amplitude panning, and the use of
vectors makes the panning methods computationally efficient.

3.2 Two-Dimensional Amplitude Panning


Two-dimensional amplitude panning, also known as intensity panning, is the most
popular panning method. The applications range from small domestic stereophonic
amplifiers to professional mixers. The method is, however, an approximation of real-
source localization.

In the simple amplitude panning method two loudspeakers radiate coherent signals,
which may have different amplitudes. The listener perceives an illusion of a single
auditory event (virtual sound source, phantom sound source), which can be placed on a
two-dimensional sector defined by locations of the loudspeakers and the listener by
controlling the signal amplitudes of the loudspeakers. A typical loudspeaker
configuration is illustrated in figure 3.1.

_________________________________________________________________________
Page 51 Topic 2 – Acoustical Simulation of Room
(Figure 3.1) Two-channel stereophonic configuration.

Two loudspeakers are positioned symmetrically with respect to the median plane.
Amplitudes of the signals are controlled with gain factors g1 and g2 respectively. The
loudspeakers are typically positioned at φ 0 = 30ο angles.

The direction of the virtual source is dependent on the relation of the amplitudes of the
emanating signals. If the virtual source is moving and its loudness should be constant,
the gain factors that control the channel levels have to be normalized. The sound power
can be set to a constant value C, whereby the following approximation can be stated:

g12 + g 22 = C. Eq. (3.1)

The parameter C > 0 can be considered a volume control of the virtual source. The
perception of the distance of the virtual source depends within some limits on C ̶ the
louder the sound, the closer it is located. To control the distance accurately, some
psycho-acoustical phenomena should be taken into account, and some other sound
elements should be added, such as reflections and reverberations.

When the distance of the virtual source is left unattended, the virtual source can be
placed on an arc between the loudspeakers, the radius of which is defined by the
distance between the listener and the loudspeakers. The arc is called the active arc.
In the ideal panning process only the direction where the virtual source should appear is
defined and the panning tool performs the gain factor calculation. In the next two
subsections some different ways of calculating the factors will be presented.

3.2.1 Trigonometric Formulation


The directional perception of a virtual sound source produced by amplitude panning
follows approximately the stereophonic law of sines originally proposed by Blumlein
and reformulated in phasor form Bauer,

sin ϕ g − g2
= 1 . Eq. (3.2)
sin ϕ 0 g1 + g 2

Where 0ο < φ0 < 90ο , - φ0 ≤ φ ≤ φ0, and g1, g2 ∈[0, 1]. In Eq. (3.2) φ represents the
angle between the x axis and the direction of the virtual source; ± φ0 is the angle
between the x axis and the loudspeakers. This equation is valid if the listener's head is

_________________________________________________________________________
Page 52 Topic 2 – Acoustical Simulation of Room
pointing directly forward. If the listener turns his or her head following the virtual
source, the tangent law is more correct,

tan ϕ g − g2
= 1 . Eq. (3.3)
tan ϕ 0 g1 + g 2
Where 0ο < φ0 < 90ο , - φ0 ≤ φ ≤ φ0, and g1, g2 ∈[0, 1]. Eqs. (3.2) and (3.3) have been
calculated with the assumption that the incoming sound is different only in magnitude,
which is valid for frequencies below 500-600 Hz. When keeping the sound power level
constant, the gain factors can be solved using Eqs. (3.2) and (3.1) or using Eqs. (3.3)
and (3.1). The slight difference between Eqs. (3.2) and (3.3) means that the rotation of
the head causes small movements of the virtual sources. However, in subjective tests it
was shown that this effect is negligible.

Some kind of amplitude panning method is used in the Ambisonics encoding system. In
pantophonic Ambisonics the entire sound field is decoded to three channels using a
modified amplitude panning method. Two of the channels, X and Y, contain the
components of the sound on the x axis and the y axis, respectively. The third, W,
contains a monophonic mix of the sound material. The signal to be stored on the
channel is calculated by multiplying the input signal samples by the channel-specific
gain factor. The gain factors gx, gy, and gw are formulated as

g x = cos θ . Eq. (3.4)


g y = sin θ . Eq. (3.5)
g w = 0.707. Eq. (3.6)
Where Ө is the azimuth angle of the virtual sound source, and r is the distance of virtual
source as illustrated in figure (3.2).

(Figure 3.2) Coordinate system of two-dimensional Ambisonics system.

This method differs from the standard amplitude panning method in that the gain
factors gx and gy may have negative values. The negative values imply that the signal is
stored on the recorder in antiphase when compared with the monophonic mix in the W
channel. When the decoded sound field is encoded, the antiphase signals on a channel
are applied to the loudspeakers in a negative direction of the respective axis. The
decoding stage is performed with matrixing equations. In the equations some additions
or subtractions are performed between the signal samples on the W channel and on the
X and Y channels. Equations for various loudspeaker configurations can be formulated.

_________________________________________________________________________
Page 53 Topic 2 – Acoustical Simulation of Room
The absolute values of the gain factors used in two-dimensional Ambisonics satisfy the
tangent law [Eq. (3.3)], which the reader may verify, for example, for values of 0ο< Ө <
90ο, by setting Ө = φ0 + φ, φ0 = 45ο, g2 = gx, and g1 = gy, and by substituting Eqs. (3.4)
and (3.5) into the relation (gy – gx)/ (gy+gx).

3.2.2 Vector Base Formulation


In the two-dimensional VBAP method, the two-channel stereophonic loudspeaker
configuration is reformulated as a two-dimensional vector base. The base is defined by
unit-length vectors l1 = [l11 l12 ] and l 2 = [l 21 l 22 ] , which are pointing toward
T T

loudspeakers 1 and 2, respectively, as seen in figure (3.3). The super script T denotes
the matrix transposition. The unit-length vector p = [ p1 p 2 ] , which points toward the
T

virtual source, can be treated as a linear combination of loudspeaker vectors,

p = g1l1 + g 2 l 2 . Eq. (3.7)

(Figure 3.3) Stereophonic configuration formulated with vectors.

In Eq. (3.7) g1 and g2 are gain factors, which can be treated as nonnegative scalar
variables. We may write the equation in matrix form,

p T = gL12 Eq. (3.8)

where g = [g1 g 2 ] and L12 = [l1 l 2 ] . This equation can be solved if L12
T −1
exists,
−1
l l12 
g = p L = [ p 1 p 2 ] 11
T −1
12  . Eq. (3.9)
 l 21 l 22 
−1 −1 −1
The inverse matrix L12 satisfies L12 L12 = I, where I is the identity matrix. L12 exists
ο ο
when φ0 ≠ 0 and φ0 ≠ 90 , both problem cases corresponding to quite uninteresting
stereophonic loudspeaker placements. For such cases the one-dimensional VBAP can
be formulated.

_________________________________________________________________________
Page 54 Topic 2 – Acoustical Simulation of Room
Gain factors g1 and g2 calculated using Eq. (3.9) satisfy the tangent law of Eq. (3.3).
When the loudspeaker base is orthogonal, φ0 = 45ο, the gain factors are also equivalent
to those calculated for the Ambisonics encoding system, with the exception that the
gain factors in Ambisonics may have negative values. In such cases, however, the
absolute values of the factors are equal.

When φ0 ≠ 45ο, the gain factors have to be normalized using the equation

Cg
g scaled = . Eq. (3.10)
g + g 22
2
1

Now gain factors g scaled satisfy Eq. (3.1).

3.2.3 Two-Dimensional VBAP for More Than Two


Loudspeakers
In many existing audio systems there are more than two loudspeakers in the horizontal
plane, such as in Dolby surround systems. Such systems can also be reformulated with
vector bases. A set of loudspeaker pairs is selected from the system, and the signal is
applied at any one time to only on pair. Thus the loudspeaker system consists of many
vector bases competing among themselves. Each loudspeaker may belong to two pairs.
In figure (3.4) a loudspeaker system in which the two dimensional VBAP can be
applied is illustrated. A system for virtual source positioning, which similarly uses only
two loudspeakers at any one time, has been implemented in an existing theater.

(Figure 3.4) Two-dimensional VBAP with five loudspeakers.

The virtual source can be produced by the loudspeaker on the active arc of which the
virtual source is located. Thus the sound field that can be produced with VBAP is a
union of the active arcs of the available loudspeaker bases. In two-dimensional cases

_________________________________________________________________________
Page 55 Topic 2 – Acoustical Simulation of Room
the best way of choose the loudspeaker bases is to let the adjacent loudspeakers from
them. In the loudspeaker system illustrated in figure (3.4) the selected bases would
be L12 , L23 , L34 , L45 , and L51 . The active arcs of the bases are thus nonoverlapping.

The use of the nonoverlapping active arcs provides continuously changing gain factors
when moving virtual sources are applied. When the sound moves from one pair to
another, the gain factor of the loudspeaker, which is not used after the change, becomes
gradually zero before the change-over point.

The fact that all other loudspeakers except the selected pair are idle may seem a waste
of resources. In this way, however, good localization accuracies can be achieved for the
principal sound, whereas the other loudspeakers may produce reflections and
reverberation as well as other elements.

3.2.4 Implementing Two-Dimensional VBAP for More Than


Two Loudspeakers
A digital panning tool that performs the panning process is now considered. Sufficient
hardware consists of a signal processor that can perform input and output with multiple
analog-to-digital (A/D) and digital-to-analog (D/A) converters and has enough
processing power for the computation needed. The tool has to include also a user
interface.

When the tool is initialized, the directions of the loudspeakers are measured relative to
the best listening position and loudspeaker pairs are formed from adjacent
loudspeakers. L−nm1 matrices are calculated for each pair and stored in the memory of the
panning system.

During the run time the system performs the following steps in an infinite loop:
• New direction vectors p(1,….,n) are defined.
• The right pairs are selected.
• The new gain factors are calculated.
• The old gain factors are cross faded to new ones and the loudspeaker bases are
changed if necessary.

The pair can be selected by calculating unscaled gain factors with Eq. (3.9) using all
selected vector bases, and by selecting the base that does not produce any negative
factors. In practice it is recommended to choose the pair with the highest smallest
factor, because a lack of numerical accuracy during calculation may produce slightly
negative gain factors in some cases. The negative factor must be set to zero before
normalization.

_________________________________________________________________________
Page 56 Topic 2 – Acoustical Simulation of Room
_________________________________________________________________________
Page 57 Topic 3 – Noise Control
CHAPTER 1
SOUND ABSORPTION

1.1 Absorption Coefficient α:-


It is a measure of the relative amount of energy that will be absorbed when a sound hits
a surface. Absorption coefficients are always a value ranging from 0 to 1 that when
multiplied by the surface areas in question yield a percentage of sound that will be
absorbed by that surface.
It is a measure of the relative amount of sound energy absorbed by that material when a
sound strikes its surface.

Figure1 absorption coefficient at a room

1.2 Measurement of the absorption coefficient of the different


materials

To have a standing wave (incident and reflected waves) inside a tube we should put
speaker with certain frequency and amplitude at one end of a tube and rigid (absorber
material) on the other end.
Using the microphone inside the tube useful to change the length of the standing wave
and to detect the positions of maximum and minimum values.

Figure2 A tube of a standing wave

_________________________________________________________________________
Page 58 Topic 3 – Noise Control
1.2.1 Procedures:
1- Adjust the speaker at certain frequency and amplitude then put the material which
needed to be measured its absorption coefficient at one end of the tube.

2- Move the microphone inside the tube which connects to the speaker at the other end
of the tube using the oscilloscope we can detect maximum minimum values Vmax &
Vmin.

3- Change the frequency for each


material (500, 1000, 2000 & 4000
Hz) and find Vmax & Vmin each
time.

σ =Vmax/Vmin

α = 4* σ /( σ+1)^2

Where α: absorption coefficient

Then calculate the average


absorption coefficient.

4- Repeat the experiment for


different materials.

1.2.2 Laboratory Measurements of Absorption Coefficient


By Appling the above steps for determination of the noise reduction on some absorbed
materials to get the best performance for the isolate the sound. We had this:

Open cell foam

Frequency Hz Absorption Coefficient α


500 0.395
1000 0.89
2000 0.34
4000 0.28

α average=0.383

_________________________________________________________________________
Page 59 Topic 3 – Noise Control
Carpet underlay

Frequency Hz Absorption Coefficient α


500 0.135
1000 0.25
2000 0.96
4000 0.49

α average=0.554

Mineral wool slab

Frequency Hz Absorption Coefficient α


500 0.75
1000 0.785
2000 0.84
4000 0.94

α average=0.813

Mineral wool slab (low density)

Frequency Hz Absorption Coefficient α


500 0.284
1000 0.64
2000 0.78
4000 0.947

α average=0.875

_________________________________________________________________________
Page 60 Topic 3 – Noise Control
_________________________________________________________________________
Page 61 Topic 3 – Noise Control
1.3 Sound Absorption by Vibrating or Perforated Boundaries
For the acoustics of a room it does not make any difference whether the apparent
absorption of a wall is physically brought about by dissipated processes, by conversion
of sound energy into heat, or by part of the energy penetrating through the wall into the
outer space. In this respect an open window is a very effective absorber, since it acts as
a sink for all the arriving sound energy.

A less trivial case is that of a wall or some part of a wall forced by a sound field into
vibration with substantial amplitude. Then a part of the wall's vibrational energy is re-
radiated into the outer space. This part is withdrawn from the incident sound energy,
viewed from the interior of the room. Thus the effect is the same as if it were really
absorbed it can therefore also be described by an absorption coefficient. In practice this
sort of absorption occurs with doors, windows, light partition wall, suspended ceiling,
circus tents and similar walls.

_________________________________________________________________________
Page 62 Topic 3 – Noise Control
p1
p3

p2

Figure3 pressure acting on a layer with mass M

This process, which may be quite involved especially for oblique sound incidence, is
very important in all problems of sound insulation. From the viewpoint of room
acoustics, it is sufficient, however, to restrict discussions to the simplest case of a plane
sound wave impinging perpendicularly onto the wall, whose dynamic properties are
completely characterized by its mass inertia. Then we need not consider the
propagation of bending waves on the wall.

Let us denote the sound pressures of the incident and the reflected waves on the surface
of a wall by p1 and p2, and the sound pressure of the transmitted wave by p3. The total
pressure acting on the wall is then p1+p2-p3. it is balanced by the inertial force iωMυ
where M denotes the mass per unit area of the wall and υ the velocity of the wall
vibrations. This velocity is equal to the particle velocity of the wave radiated from the
rare side, for which p3=ρcυ holds. Therefore we have p1+p2- ρcυ = iωMυ, from which
we obtain:
Z=iωM+ ρc
For the wall impedance
α= (1+ (ωM/2 ρc)²)¯ ¹~ (2 ρc/ ωM)²
Where c: air velocity
ρ: density of air
ω= 2*pi*f
f: resonance frequency

This simplification is permissible for the case frequently encountered in practice in


which the characteristic impedance of air is small compared with the mass reactance of
the wall. Thus the absorption becomes noticeable only at low frequencies.

At a frequency at 100 Hz the absorption coefficient of a glass pane with 4 mm


thickness is as low as 0.02 approximately. For oblique or random incidence this value is
a bit higher due to the better matching between the air and the glass pane, it is still very
low. Nevertheless, the increase in absorption with decreasing frequency has the effect
that rooms with many windows sometimes should 'crisp' since the reverberation at low
frequency is not as long as it would be in the same room without windows.
The absorption caused by the vibrations of normal and single walls and ceiling is thus
very low. Matters are different for double and multiple walls provided that the partition
on the side of the room under consideration is mounted in such away that vibrations are
not hindered and provided that it is not too heavy. Because of the inter action between
the leaves and the enclosed volume of air such a system behaves as a resonance system.

_________________________________________________________________________
Page 63 Topic 3 – Noise Control
It is a fact of great practical interest that a rigid perforated plate or panel has essentially
the same properties as a mass-loaded wall or foil. Each hole in a plate may be
considered as a short tube or channel with length b, the mass of air contained on it,
divided by the cross section is ρb.

Because of the contraction of the air stream passing through the hole, the air vibrates
with a greater velocity than that in the sound wave remote from the wall, and hence the
inertial forces of the air included in the hole are increased. The increase is given by the
ratio s2/s1, where s1 is the area of the hole and s2 is the plate area per hole. Hence the
equivalent mass of the perforated panel per unit area is M=ρb`/σ with σ=s1/s2

The geometrical tube length b has been replaced by the effective length b` which is
some times larger than b, b`=b+2δb

The correction term 2δb known as the end correction accounts for the fact that the
streamlines cannot contract or diverge abruptly but only gradually when interning or
leaving a hole. For circular apertures with radius (a) and with relatively large lateral
distances it is given by δb=0.8 a.

_________________________________________________________________________
Page 64 Topic 3 – Noise Control
CHAPTER 2
SOUND TRANSMISSION

2.1 Transmission Coefficient:


Definition: It is the unit less ratio of transmitted to incident sound energy, which ranges
from the ideal limits of 1 to 0. The limit of 1 is practically possible since a transmission
coefficient of 1 implies that all of the sound energy is transmitted through a partition.
This would be the case for an open window or door, where the sound energy has no
obstruction to its path. The other extreme of ‘0’ (implying no sound transmission), how
ever, is not a practical value since some sound will always travel through a partition.

2.2 Transmission loss:


We can define it as the amount of sound reduced by a partition between a sound source
and receiver. The less sound energy transmitted, the higher the transmission loss. In
other words, the greater the TL the better the wall is at reducing noise.
It is the principal descriptor for sound insulation, which is a decibel level based on the
transmission loss (TL) and is based on the logarithm of the mathematical reciprocal of
(1 divided by) the transmission coefficient. Since the logarithm of 1 is 0, the condition
in which the transmission coefficient is 1 translates to a TL of 0. This concurs with the
notion that an open air space in wall allows the free passage of sound .the practical
upper limit of TL is roughly 70 dB.TL is frequency dependent ,typical partitions have
TL values that increase with increasing frequency it is represented by a single number it
is called STC.

T.L = 10 log (1/ τ) dB

=1
τ= all sound is transmitted T.L =0

=0
τ= No sound is transmitted T.L= ∞
=0
τ=
= 0.2
τ= 20 % of incident sound is transmitted

2.3 Sound transmission class STC:


This is a single-number rating for TL that takes the entire frequency into account,
which can be used to compare the acoustical isolation of different barrier materials or
partition constructions over the human speech frequency range of roughly 125 to 4000
HZ. The STC number is determined from Transmission Loss values .The standard test
method requires minimum room volumes for the test to be correct at low frequencies.

_________________________________________________________________________
Page 65 Topic 3 – Noise Control
Due to the definition of the rating, STC is only applicable to air-borne sound, and is a
poor guideline for construction to contain mechanical equipment noise, music,
transportation noise, or any other noise source which is more heavily weighted in the
low frequencies than speech. The standard thus only considers frequencies above 125
Hz. Composite partitions composed of elements such as doors and windows as well as
walls will tend to have an STC close to the value of the lowest STC of any component.

In laboratory tests, the STC rating of that particular wall section varies from STC 47 to
STC 51. Any cracks in the wall or holes for electrical or mechanical servicing will
further reduce the actual result; rigid connections between wall surfaces can also
seriously degrade the wall performance. The higher the target STC, the more critical
are the sealing and structural isolation requirements. The builder's best options for
getting a satisfactory STC result are to specify partitions with a laboratory rating of
STC 54 or better.

2.3.1 Determination of STC


To determine the STC value for the material, at the first we should get the transmission
loss TL due to using the absorption material, so to determine it; we will use two test
rooms: a ''source'' room and a ''received'' room. The source room will contain a full-
range test loudspeaker and by adjusting the sound-level meter (measuring device) we
can determine it as following.

The first step


The sound transmitted should be in the band frequency that we can use STC to be a
good guide for the barrier materials from 125Hz to 4000Hz and generate it on 1/3
octave as shown in table 2.1 Then by using the sound level meter we can measure the
transmitted sound level in decibel from the source room.

Low cutoff Central frequency Upper cutoff frequency


frequency(Hz) (Hz) (Hz)
112 125 141
178 200 224
224 250 282
355 400 447
447 500 562
708 800 891
891 1000 1122
1413 1600 1778
1778 2000 2239
2239 2500 2818
3548 4000 4467

Table 2-1 1/3 octave bands

_________________________________________________________________________
Page 66 Topic 3 – Noise Control
The next step

Is to put the material on the separated wall between two rooms and sound transmission
between the rooms is measured again in the received room. The sound level from the
''receiver'' is subtracted from the sound level from ''source''. The resulting difference is
the transmission loss or ''TL.'‘

Now we can use this value of TL to determine the sound transmission class STC value
graphical or by numerical procedure.

By graphical procedure:
the TL is plotted on a graph of 1/3-octave band center frequency versus level (in
dB). Now this is where it can get confusing. To get the STC, the measured curve is
compared to a reference STC curve.

By numerical procedure:
the numerical procedure for determining the STC is easier unlike the graphical
procedure. The numerical procedure consists of the following steps.

Step 1: Record the TL at each one-third octave frequency in column 2.

Step 2: Write the adjustment factor against each frequency in column 3.these
factors are based on the STC contour, assuming the adjustment as zero at 500 HZ.

Step 3: add the TL and adjustment factors in column 4.this is the adjusted TL
(TLadj).

Step 4: Note the least TLadj value and circle it, at column 4.

Step 5: Add eight to the least TLadj .this is the first trial STC (STCtrial) write this
value in column 5.

Step 6: subtract STC trial from TL adj, i.e., determine (TL adj – STC trial).

If this value is positive, enter zero in column 5: if negative enter the actual value.

2.3.2 Laboratory measurements of STC:-

Here in our experiment we have two parts.

Note that: we should make the two parts of this


experiment in the same conditions to have the actual and
correct results

_________________________________________________________________________
Page 67 Topic 3 – Noise Control
Single layer of cork:

The absorbed material with a thickness equal 2.5 cm (1 in)

Frequency Transmitted Received Transmission


(HZ) sound level dB sound level loss TL dB
dB
500 85 63 22

1000 85.8 68.1 17.7

2000 86.6 70.3 16.3

4000 87 72 15

Table 2-2 The TL due to single layer

Frequency TL Adjustment Adjusted Trial STCs


(HZ) dB factor TL(TLadj) (STCtrial= 19)
500 22 0 22 0
1000 17.7 -3 14.7 -4.3
2000 16.3 -4 12.3 -6.7
4000 15 -4 11 -8

Table 2-3 the calculation of STC for single layer of cork

Double layer of cork:

We will use the absorbed material with a thickness equal 5 cm (2 in)

Frequency Transmitted Received Transmission


(HZ) sound level dB sound level dB loss TL dB
500 85 62.7 22.3
1000 85.8 63.3 22.5
2000 86.7 63.7 23
4000 87 63.9 23.1

Table 2-4 The TL due to double layer of cork

_________________________________________________________________________
Page 68 Topic 3 – Noise Control
Frequency TL Adjustment Adjusted Trial STCs
(HZ) dB factor TL(TLadj) (STCtrial= 27)
500 22.3 0 22.3 -4.7
1000 22.5 -3 19.5 -7.5
2000 23 -4 19 -8
4000 23.1 -4 19.1 -7.9
Table 2-5 the calculation of STC for double layer of cork

2.4 Controlling Sound Transmission through Concrete Block Walls

We discuss the various factors that affect sound transmission through different types of
concrete block walls, including single-leaf walls and double-leaf walls. Knowledge of
these factors will assist construction practitioners to design and build walls with high
levels of acoustic performance economically.

Concrete block walls are commonly used to separate dwelling units from each other
and to enclose mechanical equipment rooms in both residential and office buildings
because of their inherent mass and stiffness. Neither of these fundamental properties
(mass and stiffness) can be altered by users. However, there are additional factors that
need to be considered in building high quality walls.

2.4.1 Single-Leaf Concrete Block Walls

Mass per Unit


Unit Area
For single-leaf walls, the most important determinant of sound transmission class
(STC) is the mass per unit area: the higher it is the better. A single-leaf concrete block
wall is heavy enough to provide STC ratings of about 45 to 55 when the block surface
is sealed with paint or plaster.

Figure 2-1 shows measured STC ratings for single-leaf concrete block walls from a
number of published sources. The considerable scatter demonstrates that, while
important, block weight is not the only factor that determines the STC for this type of
wall. In the absence of measured data, the regression line in Figure 2-1 can be used to
estimate the STC from the block weight. Alternatively, Table 2-6 provides
representative STC values for 50% solid blocks that have been sealed on at least one
side. It shows the relatively modest effects of significant increases in wall thickness on
STC.

Adding materials such as sand or grout to the cores of the blocks simply increases the
weight; the increase in STC can be estimated from Figure 2-1.

_________________________________________________________________________
Page 69 Topic 3 – Noise Control
Figure 2-1 also shows that using heavier block to get an STC rating of much more than
50 leads to impracticably heavy constructions.

Figure 2-1 Effect of block weight on STC for single- layer concrete block walls.

Wall Thickness, mm Lightweight block Normal weight block


STC STC
90 43 44
140 44 46
190 46 48
240 47 49
290 49 51

Table 2-6 STC ratings for 50% solid normal- weight and lightweight block walls sealed on at
least one side

Effect of Porosity
When the concrete block is porous, sealing the surface with plaster or block sealer
significantly improves the sound insulation; the more porous the block, the greater the
improvement. Improvements of 5 to 10 STC points, or even more, are not uncommon
for some lightweight block walls after sealing. Conversely, normal-weight blocks
usually show little or no improvement after sealing. This improvement in STC in
lightweight blocks is related to the increased airflow resistivity of these blocks.

_________________________________________________________________________
Page 70 Topic 3 – Noise Control
2.4.2 Double-Leaf Concrete Block Walls
In principle, double-leaf masonry walls can provide excellent sound insulation. They
appear to meet the prescription for an ideal double wall: two independent, heavy layers
separated by an air space. In practice, constructing two block walls that are not solidly
connected somewhere is very difficult. There is always some transmission of sound
energy along the wire ties, the floor, the ceiling and the walls abutting the periphery of
the double-leaf wall, and through other parts of the structure. This transmitted energy,
known as flanking transmission, seriously impairs the effectiveness of the sound
insulation. Flexible ties and physical breaks in the floor, ceiling and abutting walls are
needed to reduce it.

Even if such measures are considered in the design, mortar droppings or other debris
can bridge the gap between the layers and increase sound transmission. Such errors are
usually concealed and impossible to rectify after the wall has been built.
A double-leaf wall that was expected to attain an STC of more than 70 provided an
STC of only 60 because of mortar droppings that connected the two leaves of the wall.

In general, the greater the T.L the better the performance of the block wall, it is
possible to construct high-quality walls that can meet the most acoustically
demanding situations it can be achieved by changing (mass of the wall & the
amount of sound-absorbing material ) .

2.5 Noise reduction

Noise reduction at work areas inside production facilities is essential not only to
conserve hearing of employee, but also help employee to accomplish work efficiently
In order to achieve noise reduction between two rooms, the wall (or floor) separating
them must transmit only a small fraction of the sound energy that strikes it , it will be
achieved by using good absorbed material on the separated wall to have less sound
energy transmitted, then higher the transmission loss. In other words, the greater the
TL, the better the wall is at reducing noise. Now we can get the relation between the
noise reduction of the wall and the transmission loss as shown below.

N. R = T.L + 10 log (A/ S)

Where A & S: - area (m^2)

2.5.1 Determination of noise reduction:-

To determine the noise reduction for the different material


we should get the different between sound level with and
without using this material so we will follow the following
steps.

_________________________________________________________________________
Page 71 Topic 3 – Noise Control
1. Generating noise by using the noise source in our experiment we will use dc-
motor as shown in figure 2-2.
Figure 2-2
2. We prepare some equipment (internal microphone, external microphone, and
sound level meter).

3. Put the absorbed material on the cover and


fix it on the noise source as shown in the
figure 2-3.

Figure 2-3

4. Then by the internal microphone& the external microphone and the sound level
meter we can get the internal sound level and the external sound level.
5. By subtract the two levels of the sound we can get the noise reduction of the
used material.

2.5.2 The laboratory measurements of the noise reduction:-


By Appling the above steps for determination of the noise reduction of some absorbed
materials to get the best performance for the isolate the sound. We had these results

Open cell foam


The sound level without using the absorbed material = 79.5 (dBA).

The sound level by using the absorbed material = 74.5 (dBA).

So the noise reduction due to using open cell foam as absorbed material is equal

N.R = 79.5 – 74.5 =5 dBA.

Carpet underlay

The sound level without using the absorbed material = 79.5 (dBA).

The sound level by using the absorbed material = 73.5 (dBA).

So the noise reduction due to using Carpet underlay as absorbed material is equal

N.R = 79.5 – 73.5 = 6 (dBA).

_________________________________________________________________________
Page 72 Topic 3 – Noise Control
Mineral wool slab (low density)
The sound level without using the absorbed material = 79.5 (dBA).

The sound level by using the absorbed material = 72.5 (dBA).

So the noise reduction due to using Mineral wool slab (low density) as absorbed
material is equal

N.R = 79.5 – 72.5 = 7 (dBA).

Mineral wool slab (high density)

The sound level without using the absorbed material = 79.5 (dBA).

The sound level by using the absorbed material = 72 (dBA).

So the noise reduction due to using Mineral wool slab (high density) as absorbed
material is equal

N.R = 79.5 – 72 = 7.5 (dBA).

2.6 The performance of some absorbed materials:


Now we will show the materials as its absorption coefficient and the noise reduction as
in table 2-7 trying to get the best absorbed material has a good performance and to get
the relation between absorption coefficient and the noise reduction value for same
material.

Absorbed material Absorption coefficient Noise reduction

0.383
Open cell foam 5 dBA

0.554
Carpet underlay 6 dBA

0.813
Mineral wool slab (low density) 7 dBA

Mineral wool slab (high density) 0.875 7.5 dBA


Table 2-7

_________________________________________________________________________
Page 73 Topic 3 – Noise Control
Now, from the results which shown we conclude for the shown absorbed materials that
for the material has high absorption coefficient it will be good insulation material due
to high noise reduction value ,so for the above four materials the best one should be
used is Mineral wool slab (high density).

_________________________________________________________________________
Page 74 Topic 3 – Noise Control
____________________________________________________________________________
Page 75 Topic 4 – Speech Technology
CHAPTER 1
SPEECH PRODUCTION

1.1 Introduction
In this chapter, we take a look at the physiological and acoustic aspects of speech
production and of speech perception, which will help to prepare the ground for later
chapters on the electronic processing of speech signals.
The human apparatus concerned with speech production and perception is complex and
uses many important organs—the lungs, mouth, nose, ears and their controlling
muscles and the brain.
When we consider that most of these organs serve other purposes such as breathing or
eating it is remarkable that this apparatus has developed to enable us to make such a
wide variety of easily distinguishable speech utterances.

1.2 The human vocal apparatus


Speech sounds are produced when breath is exhaled from the lungs and causes either a
vibration of the vocal cords (for vowels) or turbulence at some point of constriction in
the vocal tract (for consonants). The sounds are affected by the shape of the vocal tract
which influences the harmonics produced. The way in which the vocal cords are
vibrated, the shape of the vocal tract or the site of constriction can all be varied in order
to produce the range of speech sounds with which we are familiar. Figure 1.1 shows the
main human articulatory apparatus.

Figure (1.1) shows the main human articulatory apparatus.

____________________________________________________________________________
Page 76 Topic 4 – Speech Technology
1.2.1 Breathing
The use of exhaled breath is essential to the production of speech. In quiet breathing, of
which we are not normally aware, inhalation is achieved by increasing the volume of
the lungs by lowering the diaphragm and expanding the rib-cage. This reduces the air
pressure in the lungs which causes air from outside at higher pressure to enter the lungs.
Expiration is achieved by relaxing the muscles used in inspiration so that the volume of
the lungs is reduced due to the elastic recoil of the muscles, the reverse movement of
the rib-cage and gravity, thus increasing air pressure in the lungs and forcing air out.
The form of expiration achieved by relaxing the inspiratory muscles cannot be
controlled sufficiently to achieve speech or singing. For these activities, the inspiratory
muscles are used during exhalation to control lung pressure and prevent the lungs from
collapsing suddenly; when the volume is reduced below that obtained by elastic recoil,
expiratory muscles are used. Variations in speech intensity needed, for example, to
stress certain words are achieved by varying the pressure in the lungs; in this respect
speech differs from the production of a note sung at constant intensity.

1.2.2 The larynx


There are two main methods by which speech sounds are produced. In the first, called
voicing, the vocal cords located in the larynx are vibrated at a constant frequency by the
air pressure from the lungs. The second gives rise to unvoiced sounds produced by
turbulent how of air at a constriction at one of a number of possible sites in the vocal
tract. A schematic view of the larynx is shown in Fig. 1.2.

Figure (1.2) A schematic view of the larynx and vocal cord

____________________________________________________________________________
Page 77 Topic 4 – Speech Technology
The vocal cords are at rest when open. Their tension and elasticity can be varied; they
can be made thicker or thinner, shorter or longer and then can be either closed, open
wide or held in some position between.

When the vocal cords are held together for voicing they are pushed open for each
glottal pulse by the air pressure from the lungs; closing is due to the cords’ natural
elasticity and to a sudden drop in pressure between the cords (the Bernoulli principle).
Considered in vertical cross-section the cords do not open and close uniformly, but
open and close in a rippling movement from bottom to top as shown in Fig. 1.2.

The frequency of vibration is determined by the tension exerted by the muscles, the
mass and the length of the cords. Men have cords between 17 and 24mm in length;
those of women are between 13 and 17mm. The average fundamental or voicing
frequency (the frequency of the glottal pulses) for men is about 125 Hz, for women
about 200 Hz and for children more than 300 Hz.

When the vocal cords vibrate harmonics are produced at multiples of the fundamental
frequency; the amplitude of the harmonies decreases with increasing frequency. Figure
1.3 shows the range of human voice.

Figure (1.3) Frequency range of the human voice

____________________________________________________________________________
Page 78 Topic 4 – Speech Technology
1.2.3 The vocal tract
For both voiced and unvoiced speech sound that is radiated from the speaker’s face is a
modification of the original vibration caused by the resonances of the vocal tract. The
oral tract is highly mobile and the position of the tongue, pharynx, palate, lips and jaw
will all affect the speech sounds made which we hear as radiation from the lips or
nostrils. The nasal tract is immobile, but can be coupled in to form part of the vocal
tract depending on the position of the velum. Combined voiced and unvoiced sounds
can also he produced as voiced consonants.
The major speech articulators are shown in Fig. 1.1. When the velum is closed the oral
and pharyngeal cavities combine to form the voice resonator. The tongue can move
both up and down and forward and back, thus altering the shape of the vocal tract; it
can also be used to constrict the tract for the production of consonants. By moving the
lips outward the length of the vocal tract can be increased. The nasal cavity is coupled
in when the velum is opened for sounds such as /m/, in ‘hum’; here the vocal tract is
closed at the lips and acts as a side branch resonator.

1.3 Speech sounds


The smallest element of speech sound which indicates a difference in meaning is called
a phoneme, and is written between slashes as, for example, /p/ in ‘pan’. About 40
phonemes are sufficient to discriminate between all the sounds made in British English;
other languages may use different phoneme sets.

1.3.1 Phonemic representation


The following table shows the International Phonetic Alphabet representation.

____________________________________________________________________________
Page 79 Topic 4 – Speech Technology
1.3.2 Voiced, unvoiced and plosive sounds
As we have seen, voiced sounds, for example the vowel sounds /a /,/e/ and /I/, are
generated by vibration of the vocal cords which are stretched across the top of the
trachea. The pressure of air flow from the lungs causes the vocal cords to vibrate. The
fundamental pitch of the voicing is determined by the air flow, but mainly by the
tension exerted on the cords.
Unvoiced sounds are produced by frication caused by turbulence of air at a constriction
in the vocal tract. The nature of the sound is determined by the site of the constriction
and the position of the articulators (e.g. the tongue or the lips). Examples of unvoiced
sounds arc /f/, /s/ or /sh/. Mixed voiced and unvoiced sounds occur where frication and
voicing arc simultaneous. For example if voicing is added to the /f/ sound it becomes
/v/; if added to /sh/ it becomes /xh/ as in ‘azure’.
Silence occurs within speech, but in fluent speech it does not occur between words
where one might expect it. It most commonly occurs just before the stop in a plosive
sound. The duration of these silences is of the order of 30 to 50 ms.

1.4 Acoustics of speech production


The vibration of the vocal cords in voicing produces sound at a sequence of
frequencies, the natural harmonics, each of which is a multiple of the fundamental
frequency. Our ears will judge the pitch of the sound from the fundamental frequency.
The remaining frequencies have reducing amplitude. However, we never hear this
combination of frequencies because as the sound waves pass through the vocal tract it
resonates well at some frequencies and not so well at others and the strength of the
harmonics that we hear are the result of the change due to these resonances.

1.4.1 Formant frequencies


The resonances in the oral and nasal tract are not fixed, but change because of the
movement of the speech articulators described above. For any position the tract
responds to some of the basic and harmonic frequencies produced by the vocal cords
better than to others. For a particular position of the speech articulators, the lowest
resonance is called the first formant frequency (f1) the next the second formant
frequency (f2) and go on.
The formant frequencies for each of the vowel sounds are quite distinct but for each
vowel sound generally have similar values regardless of who is speaking.
For example, for a fundamental frequency of 100 Hz, harmonics will be produced at
200, 300, 400, 500 Hz, etc. For the vowel /ee/ as in ‘he’ typical values for f1 and f2 are
300 and 2100Hz respectively. For the vowel /ar/ as in ‘hard’ the typical values of the
corresponding formant frequencies are about 700 and 900Hz.
The fundamental frequency will vary depending on the person speaking, mood and
emphasis, but it is the magnitude and relationship of the formant frequencies which
make each voiced sound easily recognizable.

____________________________________________________________________________
Page 80 Topic 4 – Speech Technology
A common approach to understanding speech production and the processing of speech
signals is to use a source filter model of the vocal tract. The model is usually
implemented in electronic form but has also been implemented mechanically. In the
electronic form an input signal is produced either by a pulse generator offering a
harmonic rich repetitive waveform or a broadband noise signal is generated digitally by
means if a pseudorandom binary sequence generator. The input signal is passed through
a filter which has the same characteristics as the vocal tract.
The parameters of the filter can clearly not be kept constant, but must be varied to
correspond to the modification of the vocal tract made by movement of the speech
articulators. The filter thus has time- variant parameters; in practice the rate of variation
is slow, with parameters being updated at intervals of between 5 and 25 ms.
The pitch of the voiced excitation is subject to control as is the amplitude of the output,
in order to provide a fairly close approximation to real speech. The pitch period may
vary from 20ms for a deep-voiced male to 2 ms for a high-pitched child or female.

Figure (1.4) source filter model of speech

In the case of the pulse waveform it will consist of a regular pattern of lines which are
spaced apart by the pitch frequency. For a noise waveform considered as the
summation of a large number of randomly arriving impulses the distribution will
approximate to a continuous function. In both cases the energy distribution decreases
with an increase of frequency, but there are significant levels up to 15 to 20 kHz.
Frequency shaping is provided by the filter characteristic which is applied to the signal
in the frequency domain. Typically the filter characteristic will consist of a curve where
the various resonances of the vocal tract appear as peaks or poles of transmission. The
frequency at which these poles occur represents the formant frequencies and will
change for the various speech sounds.

____________________________________________________________________________
Page 81 Topic 4 – Speech Technology
1.5 Perception
1.5.1 Pitch and loudness
The pitch at which speech is produced depends on many factors such as the frequency
of excitation of the vocal cords, the size of the voice box or larynx and the length of the
vocal cords. Pitch also varies within words to give more emphasis to certain syllables.
The loudness of speech will generally depend on the circumstances, such as the
emotions of the speaker. Variations in loudness are produced by the muscles of the
larynx which allow a greater flow of air, thus producing the ‘sore throat’ feeling when
the voice has to be raised for a period to overcome noise. Loudness is also affected by
the flow of air from the lungs, which is the principal means of control in singing.

1.5.2 Loudness perception


The sensitivity of the human ear is not the same for tones of all frequencies. It is most
sensitive to frequencies in the range 1000 to 4000 Hz. Low- and high-frequency sounds
require a higher intensity sound to be just audible, and our concept of ‘loudness’ also
varies with frequency.

____________________________________________________________________________
Page 82 Topic 4 – Speech Technology
CHAPTER 2
PROPERTIES OF SPEECH SIGNALS IN TIME DOMAIN

2.1 Introduction
We are now beginning to see how digital signal processing methods can be applied to
speech signals .Our goal in processing the speech signal is to obtain a more convenient
or more useful representation of the information carried by the speech signal.
In this chapter we shall be interested in a set of processing techniques that are
reasonably termed time-domain methods. By this we mean simply that the processing
methods involve the waveform of the speech signal directly.
Some examples of representations of the speech signal in terms of time- domain
measurements include average zero-crossing rate, energy, and the auto- correlation
function. Such representations are attractive because the required digital processing is
simple to implement, and, in spite of this simplicity, the resulting representations
provide a useful basis for estimating important features of the speech signal.

2.2Time-Dependent Processing of Speech


A sequence of samples (8000 samples/sec) representing a typical speech signal is
shown in Figure 2.1. It is evident from this figure that the properties of the speech
signal change with time. For example, the excitation changes between voiced and
unvoiced speech, there is significant variation in the peak amplitude of the signal, and
there is considerable variation of fundamental frequency within voiced regions.
The fact that these variations are so evident in a waveform plot suggests that simple
time-domain processing techniques should be capable of providing useful
representations of such signal features as intensity, excitation mode, pitch, and possibly
even vocal tract parameters such as formant frequencies. The fact that these variations
are so evident in a waveform plot suggests that simple time-domain processing
techniques should be capable of providing useful representations of such signal features
as intensity, excitation mode, pitch, and possibly even vocal tract parameters such as
formant frequencies.
The underlying assumption in most speech processing schemes is that the properties of
the speech signal change relatively slowly with time. This assumption leads to a variety
of ‘short-time’ processing methods in which short segments of the speech signal are
isolated and processed as if they were short segments from a sustained sound with fixed
properties. This is repeated (usually periodically) as often as desired. Often these short
segments, which are sometimes called analysis frame, overlap one another. Two of
these methods are short-time average zero-crossing rate and the short-time
autocorrelation function.

____________________________________________________________________________
Page 83 Topic 4 – Speech Technology
2.3 Short-Time Average Zero-Crossing Rate
In the context of discrete-time signals, a zero-crossing is said to occur if successive
samples have different algebraic signs. The rate at which zero crossings occur is a
simple measure of the frequency content of a signal. This is particularly true of
narrowband signals. For example, a sinusoidal signal of frequency F0 sampled at a rate
Fs has Fs/F0 samples per cycle of the sine wave. Each cycle has two zero crossings so
that the long-time average rate of zero-crossings is Z = 2F0 /Fs crossings/sample;
Thus, the average zero-crossing rate gives a reasonable way to estimate the frequency
of a sine wave.
Speech signals are broadband signals and the interpretation of average zero-crossing
rate is therefore much less precise. However, rough estimates of spectral properties can
be obtained using a representation based on the short- time average zero-crossing rate.
Let us see how the short-time average zero-crossing rate applies to speech signals. The
model for speech production suggests that the energy of voiced speech is concentrated
below at 3 kHz because of the spectrum fall- off introduced by the glottal wave,
whereas for unvoiced speech, most of the energy is found at higher frequencies. Since
high frequencies imply high zero- crossing rates, and low frequencies imply low zero-
crossing rates, there is a strong correlation between zero-crossing rate and energy
distribution with frequency. A reasonable generalization is that if the zero-crossing rate
is high, the speech signal is unvoiced, while if the zero-crossing rate is low, the speech
signal is voiced.
Figure 2.11 shows a histogram of average zero-crossing rates (averaged. over 10 msec)
for both voiced and unvoiced speech. Note that a Gaussian curve provides a reasonably
good fit to each distribution. The mean short-time average zero-crossing rate is 49 per
10 msec for unvoiced and 14 per 10 msec for voiced. Clearly the two distributions
overlap so that an unequivocal voiced/unvoiced decision is not possible based on short-
time average zero- crossing rate alone. Nevertheless, such a representation is quite
useful in making this distinction.

Unvoiced

Voiced

Figure (2.1) Distribution of zero-crossings for unvoiced and voiced speech.

____________________________________________________________________________
Page 84 Topic 4 – Speech Technology
An appropriate definition is for the zero crossing rates are:

Zn = ∑
m =−∞
sgn[x (m )] − sgn[x (m − 1)] w ( n − m ) (2.1)

Where
sgn[x (n )] = 1 x (n ) ≥ 0
(2.2)
= −1 x ( n ) < 0
And
1
w (n ) = 0 ≤ n ≤ N −1
2N (2.3)
=0 otherwise
This can be achieved by a simple Matlab code. This measure could allow the
discrimination between voiced and unvoiced regions of speech, or between speech and
silence. Unvoiced speech has in general, higher zero-crossing rate. The signals in the
graphs are normalized.
wRect = rectwin(winLen);
ZCR = STAzerocross(speechSignal, wRect, winOverlap);

subplot(1,1,1);
plot(t, speechSignal/max(abs(speechSignal)));
title('speech: He took me by surprise');

hold on;
delay = (winLen - 1)/2;

plot(t(delay+1:end-delay), ZCR/max(ZCR),'r');
xlabel('Time (sec)');
legend('Speech','Average Zero Crossing Rate');
hold off;

____________________________________________________________________________
Page 85 Topic 4 – Speech Technology
2.4 Pitch period estimation
One of the most important parameters in speech analysis, synthesis, and coding
application is fundamental frequency, or pitch of voiced speech.
Pitch frequency is directly related to the speaker and sets the unique characteristic of a
person.

Voicing is generated when the airflow from the lungs is periodically interrupted by
movement of the vocal cords. The time between successive vocal cords openings is
called the fundamental period, or pitch period.

For men, the possible pitch frequency range is usually found somewhere between 50
and 250 Hz, while for women the range usually falls between 120 and 500 Hz. In terms
of period, the range for male is 4 to 40 ms, while for female it is 2 to 8 ms.

Pitch period must be estimate at every frame. By comparing a frame with past sample,
it is possible to identify the period in which the signal repeats itself, resulting in an
estimate of the actual pitch period. note that the estimation procedure make sense only
for voiced frames .meaningless results are obtained for unvoiced frames due to their
random nature.

Design of a pitch period estimation algorithm is a complex undertaking due to lack of


perfect periodicity, interference with formants of the vocal tract, uncertainty of the
starting instance of a voiced segment, and other real-word elements such as noise and
echo. In practice, pitch period estimation is implemented as a trade-off between
computational complexity and performance. Many techniques have been proposed for
the estimation of pitch period and only two methods are included here.

2.4.1The Autocorrelation Method


Assume we want to perform the estimation on the signal S[n], with n being the time
index. We consider the frame that ends at time instant m, where the length of the frame
is equal to N. Then the autocorrelation value
m
R [l, m] = ∑ S[n] S[n -1]
n=m-N+1

Reflects the similarity between the frame S[n], n=m-N+1 to m with respect to the time-
shifted version S [n-1], where l is a positive integer representing a time lag. The range
of lag is selected so that it covers a wide range of pitch period values. for instance, for
l=20 to 147/92.5 to 18.3 ms) , the possible pitch frequency values range from 54.4 to
400 Hz at 8 KHz sampling rate. This range of l is applicable for most speakers and can
be encoded using 7 bits, since there are 27 =128 values of pitch period.

____________________________________________________________________________
Page 86 Topic 4 – Speech Technology
By calculating the autocorrelation values for the entire range of lag, it is possible to find
the value of lag associated with the highest autocorrelation representation the pitch
period estimation, since, in theory, autocorrelation is maximized when the lag is equal
to the pitch period.
The method is summarized with the following pseudo code:
PITCH (m, N)

1. Peak←0
2. For 1←20 to 150
3. Autocorrelation ←0
4. For n←m-N+1 to m
5. Autocorrelation← Autocorrelation + S[n] S [n-1]
6. If auto > peak
7. Peak← Autocorrelation
8. Lag←1
9. Return lag

It is important to mention that , the speech signal is often low pass filter before used as
input for pitch period estimation. Since the fundamental frequency associated with
voicing is located in low frequency region (500Hz), low pass filtering eliminate the
interfacing hi Autocorrelation-frequency components as well as out-of-band noise,
leading to a more accurate estimate.
MATLAB Example
[s, Fs, bits] = wavread ('sample6');
autoc = xcorr(s,'unbiased')
Plot (autoc)
x=s (1000:1320);
Plot (x)

____________________________________________________________________________
Page 87 Topic 4 – Speech Technology
Figure 2.3 A voiced portion of a speech waveform used in pitch period estimation

Figure 2.4 Autocorrelation values obtained from the waveform of figure 2.3

Note: drawback of the autocorrelation method is need for multiplication, which is


relatively expensive for implementation

To overcome this problem, the magnitude difference

____________________________________________________________________________
Page 88 Topic 4 – Speech Technology
2.4.2 Average magnitude difference function
Function of magnitude difference function is defined by
m
MDF [L,m] =∑ |s[n]-S[n-L]|
n=m-N+1

For short segments of voiced speech it is reasonable to expect that S [n] –S [n-1] is
small for L=0, ±T, ±2T … with T being the signal's period.
Thus, by computing the magnitude difference function for the lag range of interest, one
can estimate the period by location the lag value associated with the minimum
magnitude difference.

Note that no products are needed for the implementation of the present method. The
following pseudo code summarizes the procedure

PITCH_MD (m, N)
1. min∞ ←
2. for 1← 20 to 150
3. mdf ← 0
4. for n ← m-N+1 to m
5. mdf ← mdf +|S [n] –S [n-1|[
6. if mdf <min
7. min ←mdf
8. lag←1
9. return lag

____________________________________________________________________________
Page 89 Topic 4 – Speech Technology
MATLAB Example
[s, Fs, bits] = wavread ('sample6');
x=s (1000:1320);
for k = 1: 240,
amdf (k) = 0;
for n = 1:240-k+1,
amdf (k) = amdf (k) + abs (x (n) – x (n+k-1));
end
amdf (k) = amdf (k)/ (240-k+1);
end
plot (amdf)

Figure 2.5 Magnitude difference values obtained from the waveform of figure 2.4

____________________________________________________________________________
Page 90 Topic 4 – Speech Technology
CHAPTER 3
SPEECH REPRESENTATION IN FREQUENCY DOMAIN

3.1 Introduction
As discussed in chapter 1 the vibration of the vocal cords in voicing produces sound at
a sequence of frequencies, the natural harmonics, each of which is a multiple of the
fundamental frequency. Our ears will judge the pitch of the sound from the
fundamental frequency. And since the smallest element of speech sound that indicates a
difference in meaning is a phoneme. The formant frequencies for each phoneme are
quite distinct but for each phoneme they usually have similar values regardless who is
speaking.
The fundamental frequency will vary depending on the person speaking, mood, and
emphasis, but the relationship of the formant frequencies which make each voiced
sound easily recognizable.
In this chapter we will concentrate on the formant analysis of the speech signal, and the
extraction of the formant frequencies of different speech sounds.

3.2 Formant analysis of speech


Formant analysis of speech can be considered a special case of speech analysis. The
objective is to determine the complex natural frequencies of the vocal mechanism as
the change temporally. The changes are conditioned by the articulatory deformations of
the vocal tract. One approach to such analysis is to consider how the modes are
exhibited in the short-time spectrum of the signal .as an initial illustration, the temporal
courses of the first three speech formants are traced using various methods for formant
frequency extraction.

3.3 Formant frequency extraction


In its simplest visualization, the voiced excitation of a vocal resonance is analogous to
the excitation of a single-tuned circuit by brief periodic pulses. The output is a damped
sinusoid repeated at the pulse rate. The envelope of the amplitude spectrum has a
maximum at a frequency equal essentially to the imaginary part of the complex pole
frequency.
The formant frequency might be measured either by measuring the axis-crossing rate of
the time waveform, Or by measuring the frequency of the peak in the spectral envelope.
The resonances of the vocal tract are multiple. The output time waveform is therefore a
superposition of damped sinusoids and the amplitude spectrum generally exhibits
multiple peaks. If the individual resonances can be suitably isolated by appropriate
filtering the axis-crossing measures, the spectral maxima and the moments might all be
useful indications of formant frequency. There are several methods for formant
frequency extraction one of them is spectrum scanning and peak-picking method

____________________________________________________________________________
Page 91 Topic 4 – Speech Technology
3.3.1 Spectrum scanning and peak-picking method
One approach to real-time automatic formant tracking is simply the detection and
measurement of prominences in the short-time amplitude spectrum. At least two
methods of this type have been designed and implemented. One is based upon locating
points of zero slopes, and the other is the detection of local spectral maxima by
magnitude comparison

3.3.2 Spectrum scanning


In this method a short-time amplitude spectrum is first produced by a set of bandpass
filters, rectifiers and integrators. The outputs of the filter channels are scanned rapidly
(on the order of 100 times per second) by a sample-and-hold circuit this produces a
time function which is a step-wise representation of the short time spectrum at a
number (36 in this instance) of frequency values. For each scan, the time function is
differentiated and binary-scaled to produce pulses marking the maxima of the spectrum.
The marking pulses are directed into separate channels by a counter where they sample
a sweep voltage produced at the scanning rate. The sampled voltages are proportional
to the frequencies of the respective spectral maxima and are held during the remainder
of the scan. The resulting step wise voltages are subsequently smoothed by low-pass
filtering.

3.3.3 peak-picking method


The second method segments the short-time spectrum into frequency ranges that ideally
contain single formant. The frequency of the spectral maximum within each segment is
then measured. In the simplest form the segment boundaries are fixed. However
additional control circuitry can automatically adjust the boundaries so that the
frequency range of a given segment is contingent upon the frequency of the next lower
formant .the normalizing circuit clamps the spectral segment either in terms of its peak
value or its mean value. The maxima of each segment are selected at a rapid rate for
example 100 times per second and a voltage proportional to the frequency of the
selected channel is delivered to the output. The selections can be time phased so that
the boundary adjustments of the spectral segments are made sequentially and are set
according to the measured position of the next lower formant.

____________________________________________________________________________
Page 92 Topic 4 – Speech Technology
CHAPTER 4
SPEECH CODING

4.1 Introduction
In general speech coding is a procedure to represent a digitized speech signal using as
few bits as possible, maintaining at the same time a reasonable level of speech quality.
A not so popular name having the same meaning is speech compression.
Speech coding has matured to the point where it now constitutes an important
application area of signal processing. Due to the increasing demand for speech
communication, speech coding technology has received augmenting levels of interest
from the research, standardization, and business communities. Advances in
microelectronic and the vast availability of low-cost programmable processors and
dedicated chips have enable rapid technology transfer from research to product
development ;this encourages the research community to investigate alternative
schemes for speech coding , with the objectives of overcoming deficiencies and
limitations. The standardization community pursues the establishment of standard
speech coding methods for various applications that will be widely accepted and
implemented by the industry. The business communities capitalize on the ever-
increasing demand and opportunities in the consumer, corporate, and network
environments for speech processing products.
Speech coding is performed using numerous steps or operation specified as a speech
coding is performed using numerous steps or operations specified as an algorism. An
algorithm is any well-defined computational procedure that takes some value, or set of
values, as input and produces some values, or set of values, as output. An algorithm is
thus a sequence of computational steps that transform the input into the output. Many
signal processing problems-including speech coding- can be formulated as a well-
specified computational problem; hence, a particular coding scheme can be defined as
an algorithm. In general, an algorithm is specified a task. With these instructions, a
computer or processor can execute them so as complete the coding task. The
instructions can also be translated to the structure of a digital circuit, carrying out the
computation directly at the hardware level

4.2 Overview of speech coding


Structure of speech coding system

Figure (4.1) shows the block diagram of speech coding system.


____________________________________________________________________________
Page 93 Topic 4 – Speech Technology
• Most speech coding system were designed to support telecommunication between
300 and 3400 Hz , band width :Fm ≈ 4 KHz
• According to Nyquist theorem to avoid aliasing
Fs=2*Fm Fs: sampling freq
Fs=8 kHz
• Commonly selected as standard sampling freq for speech signal to convert the
analog sample to a digital format use 8 bits/sample
Bit rate =8 kHz * 8 bits/sample =64 kbps
• This rate is what the source encoder attempts to reduce
• Channel encoder providing error protection to the bit stream before transmission to
the communication channel
• May be in some vocoder the source encoder and channel encoder are done in a
single step

4.3 Classification of speech coding


According to coding techniques
1. Waveform coders
An attempt is made to preserve the original shape of the signal waveform, and hence
the resultant codes can generally be applied to any signal source. These coders are
better suited for high bit-rate coding, since performance drops sharply with decreasing
bit-rate. In practice, these coders work best at a bit –rate of 32 kbps and higher.
Signal to noise ratio can be utilized to measure the quality of waveform coders. Some
examples of this class include various kinds of pulse code modulation and adaptive
differential PCM (ADPCM)
Waveform coding is applicable to traditional voice networks and voice over ATM.
Two processes are required to digitize an analog, as follows
Sampling. This discretizes the signal in time.
Quantizing. This discretizes the signal in amplitude.

Figure (4.2) quantization levels

____________________________________________________________________________
Page 94 Topic 4 – Speech Technology
2. Parametric coders
Within the framework of parametric coders, the speech signal is assumed to be
generated from a model, which is controlled by some parameters. During encoding,
parameters of the model are estimate from the input speech signal, with the parameters
transmitted as the encoded bit-stream. This type of coder makes no attempt to preserve
the original shape of the waveform and hence SNR is a useless quality measure.
Perceptual quality of the decoded speech is directly related to the accuracy and
sophistication of the underlying model. Due to this limitation, the coder is signal
specific, having poor performance for non speech signals. There are several proposed
models in the literature. The most successful, however, is based on linear prediction. In
this approach, the human speech production mechanism is summarized using a time-
varying filter, with the coefficients of the filter found using the linear predication
analysis procedure.

3. Hybrid coders
As its name implies, a hybrid coder combines the strength of a waveform coder with
that of a parametric coder. Like a parametric coder, it relies on a speech production
model; during encoding, parametric of the model are located. Additional parameters of
the model are optimized in such a way that decoded speech is as close as possible to the
original waveform, with the closeness often measured by a perceptually weighted error
signal. As in waveform coders, an attempt is made to match the original signal with the
decoded signal in the time domain.

This class dominates the medium bit-rate coders, with the code-excited linear
prediction (CELP) algorithm and its variants the most outstanding representatives.
From a technical perspective, the difference between a hybrid coder and a parametric
coder is that the former attempts to quantize or represent the excitation signal to the
speech production model, which is transmitted as part of the encoder bit-stream. The
latter, however, achieves low bit-rate by discarding all detail information of the
excitation signals; only coarse parameters are extracted.
A hybrid coder tends to behave like a waveform coder for high bit-rate, and like a
parametric coder at low bit-rate, with fair to good quality for medium bit-rate.

4. Signal-mode and multimode coders


Signal-mode coders are those that apply a specific, fixed encoding mechanism at all
times, leading to a constant bit-rate for the encoded bit-stream. Examples of such
coders are pulse code modulation (PCM) and regular-pulse-excited long-term
prediction (RPE-LTP)

Multimode coders were invented to take advantage of the dynamic nature of the speech
signal, and to adapt to the time-varying network conditions. In this configuration, one
of several distance coding modes is selected, with the selection done by source control,
when the switching obeys some external commands in response to network needs or
channel conditions.
____________________________________________________________________________
Page 95 Topic 4 – Speech Technology
According to bit rate
1. High >15 kbps
2. Medium 5 to 15 kbps
3. Low 2 to 5 kbps
4. Very low <2 kbps

LPC has an output rate 2.4 kbps


A reduction of more than 53 times with respect to the i/p
Its class Classification
• low bit rate
• Parametric codes

Since LPC offers a good quality vs. bit rate trade off, it is the most commonly used
coding technique in various applications as.

Applications
FS1015 LPC (1984)
To provide secure communication in military application
TIA IS54 VSEIP (1989)
For TDMA
ETSI AMR ACELP (1999)
For UMTS: Universal mobile telecommunication system
In 3GPP: 3rd generation partnership project.
Voip
Voice over IP
So our focus will be linear predictive coding (LPC).

____________________________________________________________________________
Page 96 Topic 4 – Speech Technology
4.4 Linear Predictive Coding (LPC)
Linear Predictive Coding (LPC)

• One of the most powerful speech analysis techniques


• One of the most useful methods for encoding good quality speech at a low bit
rate.
• It provides extremely accurate estimates of speech parameters

4.4.1 Basic Principles

A. Physical Model:

LPC starts with the assumption that


The speech signal is produced by a buzzer at the end of a tube.
The glottis (the space between the vocal cords) produces the buzz, which is
characterized by its intensity (loudness) and frequency (pitch).
The vocal tract (the throat and mouth) forms the tube, which is characterized by its
resonances (formants).

B. Mathematical Model:

LPC analyzes the speech signal by estimating the formants, removing their effects from
the speech signal, and estimating the intensity and frequency of the remaining buzz.
The process of removing the formants is called inverse filtering, and the remaining
signal is called the residue. The numbers which describe the formants and the residue
can be stored or transmitted somewhere else.

LPC synthesizes the speech signal by reversing the process: use the residue to create a
source signal, use the formants to create a filter (which represents the tube), and run the
source through the filter, resulting in speech. Because speech signals vary with time,
this process is done on short chunks of the speech signal, which are called frames.
Usually 30 to 50 frames per second give intelligible speech with good compression.

____________________________________________________________________________
Page 97 Topic 4 – Speech Technology
Figure (4.3) Block diagram of simplified model for speech production.

Vocal Tract (LPC Filter)

Air (Innovations)
Vocal Cord Vibration (voiced)
Vocal Cord Vibration Period (pitch period)
Fricatives and Plosives (unvoiced)
Air Volume (gain)

____________________________________________________________________________
Page 98 Topic 4 – Speech Technology
4.4.2 The LPC filter

S (z ) G
H (z ) = = p
U (z )
1- ∑ ak z - k
k =1

This is equivalent to saying that the input-output relationship of the filter is


given by the linear difference equation:

The LPC model can be represented in vector form as:

A changes every 20 ms or so. At a sampling rate of 8000 samples/sec, 20 ms is


equivalent to 160 samples.

• The digital speech signal is divided into frames of size 20 ms. There are
50 frames/second.

is equivalent to

Thus the 160 values of S is compactly represented by the 13 values of A.

• There's almost no perceptual difference in S if:


o For Voiced Sounds (V): the impulse train is shifted (insensitive
to phase change).
o For Unvoiced Sounds (UV): a different white noise sequence is
used.

LPC Synthesis: Given A, generate S

LPC Analysis: Given S, find the best A


____________________________________________________________________________
Page 99 Topic 4 – Speech Technology
4.4.3 Problems in LPC model

Problem: the tube isn't just a tube

It may seem surprising that the signal can be characterized by such a simple linear
predictor. It turns out that, in order for this to work, the tube must not have any side
branches.
(In mathematical terms, side branches introduce zeros, which require much more
complex equations.)
For ordinary vowels, the vocal tract is well represented by a single tube. However, for
nasal sounds, the nose cavity forms a side branch. Theoretically, therefore, nasal sounds
require a different and more complicated algorithm. In practice, this difference is partly
ignored and partly dealt with during the encoding of the residue .

Encoding the Source

If the predictor coefficients are accurate, and everything else works right, the speech
signal can be inverse filtered by the predictor, and the result will be the pure source
(buzz). For such a signal, it's fairly easy to extract the frequency and amplitude and
encode them.

However, some consonants are produced with turbulent airflow, resulting in a hissy
sound (fricatives and stop consonants). Fortunately, the predictor equation doesn't care
if the sound source is periodic (buzz) or chaotic (hiss).

This means that for each frame, the LPC encoder must decide if the sound source is
buzz or hiss; if buzz, estimate the frequency; in either case, estimate the intensity; and
encode the information so that the decoder can undo all these steps. This is how LPC-
10e, the algorithm described in federal standard 1015, works: it uses one number to
represent the frequency of the buzz, and the number 0 is understood to represent hiss.
LPC-10e provides intelligible speech transmission at 2400 bits per second.

Here is a sample of LPC-10e encoded speech. Sound files are in Sun/NeXT 8 bit u-law
format, and should be playable on all browsers.

Problem: the buzz isn't just buzz


Unfortunately, things are not so simple. One reason is that there are speech sounds
which are made with a combination of buzz and hiss sources (for example, the initial
consonants in "this zoo" and the middle consonant in "azure"). Speech sounds like this
will not be reproduced accurately by a simple LPC encoder.

Another problem is that, inevitably, any inaccuracy in the estimation of the formants
means that more speech information gets left in the residue. The aspects of nasal
sounds that don't match the LPC model (as discussed above), for example, will end up
in the residue.

____________________________________________________________________________
Page 100 Topic 4 – Speech Technology
There are other aspects of the speech sound that don't match the LPC model; side
branches introduced by the tongue positions of some consonants, and tracheal (lung)
resonances are some examples.

Therefore, the residue contains important information about how the speech should
sound, and LPC synthesis without this information will result in poor quality speech.
For the best quality results, we could just send the residue signal, and the LPC synthesis
would sound great. Unfortunately, the whole idea of this technique is to compress the
speech signal, and the residue signal takes just as many bits as the original speech
signal, so this would not provide any compression.

Encoding the Residue

Various attempts have been made to encode the residue signal in an efficient way,
providing better quality speech than LPC-10e without increasing the bit rate too much.
The most successful methods use a codebook, a table of typical residue signals, which
is set up by the system designers. In operation, the analyzer compares the residue to all
the entries in the codebook, chooses the entry which is the closest match, and just sends
the code for that entry. The synthesizer receives this code, retrieves the corresponding
residue from the codebook, and uses that to excite the formant filter. Schemes of this
kind are called Code Excited Linear Prediction (CELP).

4.5 Basic Principles of Linear Predictive Analysis


The particular form of this model that is appropriate for the discussion of linear
predictive analysis is depicted in Fig. 4.3. In this case, the composite spectrum effects
of radiation, vocal tract, and glottal excitation are represented by a time-varying digital
filter whose steady-state system function is of the form
S (z ) G
H (z ) = = p
(4.1)
U (z )
1- ∑ ak z -k

k =1

This system is excited by an impulse train for voiced speech or a random noise
sequence for unvoiced speech. Thus, the parameters of this model are:
Voiced/unvoiced classification, pitch period for voiced speech, gain parameter G, and
the coefficients {ak} of the digital filter. These parameters, of course, all vary slowly
with time.
The pitch period and voiced/unvoiced classification can be estimated using one of the
many methods. This simplified all-pole model is a natural representation of non- nasal
voiced sounds, but for nasals and fricative sounds, the detailed acoustic theory calls for
both poles and zeros in the vocal tract transfer function. We shall see, however, that if
the order p is high enough, the all-pole model provides a good representation for almost
all the sounds of speech.

____________________________________________________________________________
Page 101 Topic 4 – Speech Technology
The major advantage of this model is that the gain parameter, G, and the filter
coefficients {ak} can be estimated in a very straightforward and computationally
efficient manner by the method of linear predictive analysis.
For the system of Fig. 4.3, the speech samples s(n) are related to the excitation u(n) by
the simple difference equation
p
s (n ) = ∑ ak s (n − k ) + Gu (n ) (4.2)
k =1

A linear predictor with prediction coefficients, ak is defined as a system whose output is


p
s% (n ) = ∑ α k s (n − k ) (4.3)
k =1

Such systems to reduce the variance of the difference signal in differential quantization
schemes. The system function of a pth order linear predictor is the polynomial
p
P (z ) = ∑ α k z − k (4.4)
k =1

The prediction error, e(n), is defined as


p
e (n ) = s (n ) − s% (n ) = s (n ) − ∑ α k s (n − k ) (4.5)
k =1

From Eq. (4.5) it can be seen that the prediction error sequence is the output of a
system whose transfer function is
p
A (z ) = 1 − ∑ α k z − k (4.6)
k =1

It can be seen by comparing Eqs. (4.2) and (4.5) that if the speech signal obeys the
model of Eq. (4.2) exactly, and if α k = ak , then e(n) = Gu(n).
Thus, the prediction error filter, A(z), will be an inverse filter for the system, H(z), of
Eq. (4.1), i.e.,
G
H (z ) = (4.7)
A (z )
The basic problem of linear prediction analysis is to determine a set of predictor
coefficients (ak) directly from the speech signal in such a manner as to obtain a good
estimate of the spectral properties of the speech signal through the use of Eq. (4.7).
Because of the time-varying nature of the speech signal the predictor coefficients must
be estimated from short segments of the speech signal. The basic approach is to find a
set of predictor coefficients that will minimize the mean-squared prediction error over a
short segment of the speech waveform. The resulting parameters are then assumed to be
the parameters of the system function, H(z), in the model for speech production.

____________________________________________________________________________
Page 102 Topic 4 – Speech Technology
That this approach will lead to useful results may not be immediately obvious, but it
can be justified in several ways. First, recall that if α k = ak , then e(n) = Gu(n). For
voiced speech this means that e(n) would consist of a train of impulses; i.e., e(n) would
be small most of the time. Thus, finding ak’s that minimize prediction error seems
consistent with this observation. A second motivation for this approach follows from
the fact that if a signal is generated by Eq. (4.2) with non-time-varying coefficients and
excited either by a single impulse or by a stationary white noise input, then it can be
shown that the predictor coefficients that result from minimizing the mean squared
prediction error (over all time) are identical to the coefficients of Eq. (4.2). A third very
pragmatic justification for using the minimum mean-squared prediction error as a basis
for estimating the model parameters is that this approach leads to a set of linear
equations that can be efficiently solved to obtain the predictor parameters. More
importantly the resulting parameters comprise a very useful and accurate representation
of the speech signal as we shall see in this chapter.
The short-time average prediction error is defined as
E n = ∑ e n 2 (m ) (4.8)
m

= ∑ (s (n ) − s% (n )) 2 (4.9)
m

2
 p

= ∑  s n (m ) − ∑ α k s n (m − k )  (4.10)
m  k =1 
Where sn(m) is a segment of speech that has been selected in the vicinity of sample n,
i.e.,
s n (m ) = s (m + n ) (4.11)
The range of summation in Eqs. (4.8)-(4.10) is temporarily left unspecified, but since
we wish to develop a short-time analysis technique, the sum will always be over a finite
interval. Also note that to obtain an average we should divide by the length of the
speech segment. However, this constant is irrelevant to the set of linear equations that
we will obtain and therefore is omitted. We can find the values of ak that minimize En
in Eq. (4.10) by setting ∂E n / ∂α i = 0 , i= 1, 2… p, thereby obtaining the equations
p

∑ s n (m − i )s n (m ) = ∑ αˆ k ∑ s n (m − i )s n (m − k )
m k =1 m
1≤ i ≤ p (4.12)

Where αˆ k are the values of ak that minimize En (Since αˆ k is unique, we will drop the
caret and use the notation αˆ k to denote the values that minimize En.) If we define

φn (i , k ) = ∑ s n (m − i )s n (m − k ) (4.13)
m

____________________________________________________________________________
Page 103 Topic 4 – Speech Technology
then Eq. (4.12) can be written more compactly as
p

∑α φ
k =1
k n (i , k ) = φn (i , 0) i = 1, 2,...., p (4.14)

This set of p equations in p unknowns can be solved in an efficient manner for the
unknown predictor coefficients {ak} that minimize the average squared prediction error
for the segment sn(m) Using Eqs. (4.10) and (4.12), the minimum mean-squared
prediction error can be shown to be
p
E n = ∑ s n 2 (m ) − ∑ α k ∑ s n (m )s n (m − k ) (4.15)
m k =1 m

And using eq. (4.14) we can express En as


p
E n = φn (0, 0) − ∑ α k φn (0, k ) (4.16)
k =1

Thus the total minimum error consists of a fixed component, and a component which
depends on the predictor coefficients.
To solve for the optimum predictor coefficients, we must first compute the
quantities φn (i , k ) for 1 ≤ i ≤ p and 0 ≤ k ≤ p . Once this is done we only have to solve
Eq. (4.14) to obtain the ak’s. Thus, in principle, linear prediction analysis is very
straightforward. However, the details of the computation of φn (i , k ) and the
subsequent solution of the equations are somewhat intricate and further discussion is
required.
So far we have not explicitly indicated the limits on the sums in Eqs. (4.8)-(4.10) and in
Eq. (4.12); however it should be emphasized that the limits on the sum in Eq. (4.12) are
identical to the limits assumed for the mean squared prediction error in Eqs. (4.8)-
(4.10). As we have stated, if we wish to develop a short-time analysis procedure, the
limits must be over a finite interval. There are two basic approaches to this question,
and we shall see below that two methods for linear predictive analysis emerge out of a
consideration of the limits of summation and the definition of the waveform segment
sn(m).

4.5.1 The autocorrelation method


One approach to determining the limits on the sums in Eqs. (4.8)-(4.10) and Eq. (4.12)
is to assume that the waveform segment, sn(m) is identically zero outside the
interval 0 ≤ m ≤ N − 1 . This can be conveniently expressed as
s n (m ) = s (m + n )w (m ) (4.17)
Where w(m) is a finite length window (e.g. a Hamming window) that is identically zero
outside the interval 0 ≤ m ≤ N − 1 .

____________________________________________________________________________
Page 104 Topic 4 – Speech Technology
The effect of this assumption on the question of limits of summation for the expressions
for En can be seen by considering Eq. (4.5). Clearly, if sn(m) is nonzero only
for 0 ≤ m ≤ N − 1 , then the corresponding prediction error, en(m), for a pth order
predictor will be nonzero over the interval 0 ≤ m ≤ N − 1 + p . Thus, for this case En is
properly expressed as
N + p −1
En = ∑
m =0
e n 2 (m ) (4.18)

Alternatively we could have simply indicated that the sum should be over alt nonzero
values by summing from −∞to + ∞ .
Returning to Eq. (4.5), it can be seen that the prediction error is likely to be large at the
beginning of the interval (specifically 0 ≤ m ≤ p − 1 ) because we are trying to predict
the signal from samples that have arbitrarily been set to zero. Likewise the error can be
large at the end of the interval (specifically N ≤ m ≤ N + 1 − p ) because we are trying
to predict zero from samples that are nonzero. For this reason, a window which tapers
the segment, s to zero is generally used for w(m) in Eq. (4.17).
The limits on the expression for φn (i , k ) in Eq. (4.13) are identical to those of Eq.
(4.18). However, because sn(m) is identically zero outside the interval 0 ≤ m ≤ N − 1 , it
is simple to show that
N + p +1
1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m − i )s n (m − k )
0≤k ≤ p
(4.19a)

can be expressed as
N −1− ( i − k )
1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m )s n (m + i − k )
0≤k ≤ p
(4.19b)

Furthermore it can be seen that in this case φn (i , k ) is identical to the short- time
autocorrelation function evaluated for (i-k). That is
φn (i , k ) = R n (i − k ) (4.20)
Where
N −1− k
R n (k ) = ∑
m =0
s n (m )s n (m + k ) (8.21)

Since R is an even function, it follows that


i = 1, 2,........, p
φn (i , k ) = R n ( i − k ) (4.22)
k = 0,1,........, p
Therefore Eq (4.14) can be expressed as
p

∑α
k =1
k R n ( i − k ) = R n (i ) 1≤ i ≤ p (4.23)

Similarly, the minimum mean squared prediction error of Eq. (4.16) takes the

____________________________________________________________________________
Page 105 Topic 4 – Speech Technology
form
p
E n = R n (0) − ∑ α k R n (k ) (4.24)
k =1

The set of equations given by Eqs. (4.23) can be expressed in matrix form as

 R n (0) R n (1) R n (2) ..... R n ( p − 1)  α1   R n (1) 


    
 R n (1) R n (2) R n (3) ..... R n ( p − 2)  α 2   R n (2) 
 R n (2) R n (1) R n (0) ..... R n ( p − 3)  α 3   R n (3) 
   =   (4.25)
 ..... ..... ..... ..... .....  .....  ..... 
 ..... ..... ..... ..... .....  .....  ..... 
    
 R n ( p − 1) R n ( p − 2) R n ( p − 3) ..... R n (0)  α p   R n ( p ) 

The pxp matrix of autocorrelation values is a Toeplitz matrix; i.e., it is symmetric and
all the elements along a given diagonal are equal. This special property will be
exploited in Section 4.3 to obtain an efficient algorithm for the solution of Eq. (4.23).

4.5.2 The covariance method


The second basic approach to defining the speech segment s and the limits on the sums
is to fix the interval over which the mean-squared error is computed and then consider
the effect on the computation of φn (i , k ) That is, if we define
N −1
E n = ∑ e n 2 (m ) (4.26)
m =0

then φn (i , k ) becomes
N −1 1≤ i ≤ p
φn (i , k ) = ∑ s n (m − i )s n (m − k ) (4.27)
m =0 0≤k ≤ p
In this case, if we change the index of summation we can express φn (i , k ) as either
N − i −1 1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m )s n (m + i − k )
0≤k ≤ p
(4.28a)

or
N − k −1 1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m )s n (m + k − i )
0≤k ≤ p
(4.28b)

Although the equations look very similar to Eq. (4.19b), we see that the limits of
summation are not the same. Equations (4.28) call for values of sn(m) out side the
interval 0 ≤ m ≤ N − 1 . Indeed, to evaluate. For all of the required values of i and k
requires that we use values of sn(m) in the interval ( − p ≤ m ≤ N − 1 ). If we are to be
consistent with the limits on En in Eq. (4.26) then we have no choice but to supply the

____________________________________________________________________________
Page 106 Topic 4 – Speech Technology
required values. In this case it does not make sense to taper the segment of speech to
zero at the ends as in the autocorrelation method since the necessary values are made
available from outside the interval ( 0 ≤ m ≤ N − 1 ). Clearly, this approach is very
similar to what was called the modified autocorrelation function in Chapter 2, this
approach leads to a function which is not a true autocorrelation function, but rather, the
cross-correlation between two very similar, but not identical, finite length segments of
the speech wave. Although the differences between Eq. (4.28) and Eq. (4.19b) appear
to be minor computational details, the set of equations
p

∑α φ
k =1
k n (i , k ) = φn (i , 0) i = 1, 2,..., p (4.29a)

Have significantly different properties that strongly affect the method of solution and
the properties of the resulting optimum predictor.
In matrix form these equations become

 φn (1,1) φn (1, 2) φn (1,3) ..... φn (1, p )  α1   φn (1, 0) 


 φ (2,1) φ (2, 2) φ (2,3)    φ (2, 0) 
 n n n ..... φn (2, p )  α 2   n 
 φn (3,1) φn (3, 2) φn (3, 3) ..... φn (3, p )  α 3   φn (3, 0) 
   =   (4.29b)
 ..... ..... ..... ..... .....  .....  ..... 
 ..... ..... ..... ..... .....  .....  ..... 
    
φn ( p ,1) φn ( p , 2) φn ( p ,3) ..... φn ( p , p )  α p  φn ( p , 0) 

In this case, since φn (i , k ) = φn (k , i ) Eq. (4.28), the pxp matrix of correlation-like


values is symmetric but not Toeplitz. Indeed, it can be seen that the diagonal elements
are related by the equation
φn (i + 1, k + 1) = φn (i , k ) + s n (−i − 1)s n (−k − 1)
(4.30)
−s n (N − 1 − i )s n (n − 1 − k )
The method of analysis based upon this method of computation of φn (i , k ) has come to
be known as the covariance method because the matrix of values {φn (i , k )} has the
properties of a covariance matrix.

____________________________________________________________________________
Page 107 Topic 4 – Speech Technology
CHAPTER 5
APPLICATIONS

5.1 Speech synthesis


Speech analysis is that portion of voice processing that converts speech to digital forms
suitable for storage on computer systems and transmission on digital (data or
telecommunications) networks
Speech synthesis is that portion of voice processing that reconverts speech data from a
digital form to a form suitable for human usage. These functions are essentially the
inverse of speech analysis
Speech analysis processes are also called digital speech encoding (or coding), and
speech synthesis is also called speech decoding. The objective of any speech-coding
scheme is to produce a string of voice codes of minimum data rate, so that a synthesizer
can reconstruct an accurate facsimile of the original speech in an effective manner,
while optimizing the transmission (or storage) medium.

5.1.1 Formant – frequency Extraction


We use the spectrum scanning and peak-picking method for the extraction of the
formant frequencies.
 We use voice signals of one phoneme and divide this signal into frames of 20ms
each.
 From the spectrum of each frame we extract the pitch , first three formant
frequencies characterizing this phoneme

Figure (5.1) frames of 20ms


____________________________________________________________________________
Page 108 Topic 4 – Speech Technology
Figure (5.2) spectrum of frame

Figure (5.3) zoom to show fundamental frequency

____________________________________________________________________________
Page 109 Topic 4 – Speech Technology
• The range of fundamental frequency (f0) for male
between(150:250)Hz and foe female between(250:400)Hz

 Then we extract the part of the frame containing the highest energy content (the
part around the three-formant frequencies) and we neglect the rest of the
frame.(we called it info).
 By that we reduce signal which we send it
 We try to reconstruct the voice signal by using the extracted part.

 At receiver, we reconstructing the voice signal using the smallest part of the
frame that gives a moderately good quality that allows for voice recognition.
 We try to create a synthetic frame using the pitch, the first three-formant
frequencies.
 By using this synthetic frame, we create a syntheses voice signal.
 By using this synthetic frame, we create a syntheses voice signal.

Figure (5.4) the part of the frame containing the highest energy content

____________________________________________________________________________
Page 110 Topic 4 – Speech Technology
Figure(5.5) Info--- previous figure in time domain

-First synthetic frame reconstruct by convolution info with deltas at pitch period

Figure (5.6) Deltas at pitch period


____________________________________________________________________________
Page 111 Topic 4 – Speech Technology
The synthesis signal is build but have poor quality

Figure (5.7) first synthesis frame

-Second synthetic frame reconstruct by convolution info with triangles its vertices
at pitch period.

Figure (5.8) Triangles its vertices at pitch period


____________________________________________________________________________
Page 112 Topic 4 – Speech Technology
The synthesis signal is build and has good quality

Figure (5.9) Second synthesis frame


-Third synthetic frame reconstruct by convolution info with hamming windows at
pitch period.

Figure (5.10) Hamming windows at pitch period

____________________________________________________________________________
Page 113 Topic 4 – Speech Technology
The synthesis signal is build and has good quality

Figure (5.11) Third second synthesis frame


5.1.2 LPC
We use LPC (linear predictive coding) technique to synthesis voice signal.
 We use voice signals of one phoneme and divide this signal into frames of 20ms
each.
 We determine the parameters (here we take 20 parameter only) of frame using
LPC filter in MATLAB.
 These parameters which we send from transmitter.

Figure (5.12) frame of 20 ms

____________________________________________________________________________
Page 114 Topic 4 – Speech Technology
• Note when increase numbers of parameters the quality increase too.

S (z ) G
H (z ) = = p
U (z )
1- ∑ ak z - k
k =1

U(n) Glottal pulse (Innovation)

• We get glottal pulse from previous equation

Figure (5.13) glottal pulse of frame fig (5.10)

 We can transmit this glottal pulse and reconstruct the signal at receiver.
 At receiver, we reconstructing the voice signal using the LPC parameters and
glottal pulse for voice recognition.
 However, we need to reduce the data, which we send so we use the
next approach.

____________________________________________________________________________
Page 115 Topic 4 – Speech Technology
-First synthetic frame reconstruct by the part of the glottal pulse containing the
highest energy content

Figure (5.14) part of the glottal pulse containing the highest energy content

The synthesis signal has a very good quality as shown

Figure (5.15) First synthesis frame


However, we need to transmit parameters only so we generate periodic function at
receiver and synthesis signal

____________________________________________________________________________
Page 116 Topic 4 – Speech Technology
-Second approach generate triangles at the pitch period

Figure (5.16) Triangles its vertices at pitch period

The synthesis signal has a medium quality as shown

Figure (5.17) Second synthesis frame

____________________________________________________________________________
Page 117 Topic 4 – Speech Technology
-Third approach generate hamming windows at the pitch period

Figure (5.18) Hamming windows at pitch period

The synthesis signal has a medium quality as shown

Figure (5.19) Third synthesis frame

____________________________________________________________________________
Page 118 Topic 4 – Speech Technology
5.2 Speaker identification using LPC
Speaker recognition is the task of recognizing people from their voices. Such systems
extract features from speech, model them and use them to identify the person from
his/her voice.
In this application we first inspect the effective portion of the LPC filter analysis is
more effective in speaker identification the LPC parameters or the glottal pulse .

Part 1
a. We take two voice samples from two different speakers but both samples for the
same phoneme preferably a male sample and a female sample to emphasis the
difference in perception.
b. We pass each sample through an LPC filter getting the LPC parameters, and the
glottal pulse for both speakers.
c. The next step is to swap the glottal pulse of the two speakers and reconstruct the
voice signal of each speaker using his LPC parameters and the glottal pulse of
the other speaker.
d. After the reconstruction of the new voice signals we will find that the glottal
pulse is the effective parameter in voice recognition.
e. The next figure shows that the LPC parameters of both speakers have very close
values which assures our conclusion.

Figure (5.20) LPC parameters from two different speakers

____________________________________________________________________________
Page 119 Topic 4 – Speech Technology
f. While comparing the LPC parameters did not show a big difference between
two speakers; the comparison between the glottal pulses of the two speakers
showed a much bigger difference leading to confirm our conclusion that the
glottal pulse has a much greater weight in the identification of the speaker. This
is shown in the next figure .

Figure (5.21) glottal pulses from different speakers

____________________________________________________________________________
Page 120 Topic 4 – Speech Technology
Part 2
The second portion of our application handles the identification part after
concluding that the glottal pulse is the effective parameter.
a. First we construct a code book containing voice samples from different
speakers but for the same phoneme.
b. Secondly we take a voice sample from one of the speakers for the same
phoneme used in the code book construction. And we consider this
signal as our input signal for which the identification of the speaker
needs to be made.
c. The identification process is done using the distortion measure
technique.
d. Distortion measure = ∑(Sn – Si )2
Where
Sn: is one of the samples saved in the code book ; n:1,2,… ,N
N : is the number of samples saved in the code book (no. of
speakers)
Si : is the input signal .
e. When we get two signals with the least distortion .then we identify the
speaker as the speaker of the signal Sn saved in the code book.
f. We can also use an input signal for a speaker not found in the code in
this case the program will calculate the distortion measure between this
signal and the signals saved in the code book and choose a speaker from
the code book with the least distortion measure .

____________________________________________________________________________
Page 121 Topic 4 – Speech Technology
5.3 Introduction to VOIP
VoIP (voice over IP - that is, voice delivered using the Internet Protocol) is a term used
in IP telephony for a set of facilities for managing the delivery of voice information
using the Internet Protocol (IP). In general, this means sending voice information in
digital form in discrete packets rather than in the traditional circuit-committed protocols
of the public switched telephone network (PSTN). A major advantage of VoIP and
Internet telephony is that it avoids the tolls charged by ordinary telephone service.

VoIP, now used somewhat generally, derives from the VoIP Forum, an effort by major
equipment providers, to promote the use of ITU-T H.323, the standard for sending
voice (audio) and video using IP on the public Internet and within an intranet. The
Forum also promotes the user of directory service standards so that users can locate
other users and the use of touch-tone signals for automatic call distribution and voice
mail.

In addition to IP, VoIP uses the real-time protocol (RTP) to help ensure that packets get
delivered in a timely way. Using public networks, it is currently difficult to guarantee
Quality of Service (QoS). Better service is possible with private networks managed by
an enterprise or by an Internet telephony service provider (ITSP).

A technique used by at least one equipment manufacturer, Adir Technologies (formerly


Netspeak), to help ensure faster packet delivery is to use ping to contact all possible
network gateway computers that have access to the public network and choose the
fastest path before establishing a Transmission Control Protocol (TCP) sockets
connection with the other end.

Using VoIP, an enterprise positions a "VoIP device" at a gateway. The gateway


receives packetized voice transmissions from users within the company and then routes
them to other parts of its intranet (local area or wide area network) or, using a T-carrier
system or E-carrier interface, sends them over the public switched telephone network.

5.3.1 VoIP Standards


• ITU-T H.320 Standards for Video Conferencing,
• H.323 ITU Standards
• H.324 ITU Standards
• VPIM Technical Specification

____________________________________________________________________________
Page 122 Topic 4 – Speech Technology
5.3.2 System architecture

Figure (5.22) Overview of VOIP network

____________________________________________________________________________
Page 123 Topic 4 – Speech Technology
5.3.3 Coding technique in VOIP systems
Codecs are software drivers that are used to encode the speech in a compact enough
form that they can be sent in real time across the Internet using the bandwidth available.
Codecs are not something that VOIP users normally need to worry about, as the VOIP
clients at each end of the connection negotiate between them which one to use.

VOIP software or hardware may give you the option to specify the codecs you prefer to
use. This allows you to make a choice between voice quality and network bandwidth
usage, which might be necessary if you want to allow multiple simultaneous calls to be
held using an ordinary broadband connection. Your selection is unlikely to make any
noticeable difference when talking to PSTN users, because the lowest bandwidth part
of the connection will always limit the quality achievable, but VOIP-to-VOIP calls
using a broadband Internet connection are capable of delivering much better quality
than the plain old telephone system.

A broadband connection is desirable to use VOIP, though it is certainly possible to use


it over a dial-up modem connection if a low-bandwidth, low-fidelity codec is chosen.
The table below lists some commonly used codecs.

Bit rate Bandwidth


Codec Algorithm
(Kbps) (Kbps)
G.711 Pulse Code Modulation (PCM) 64 87.2
G.722 Adaptive Pulse Code Modulation (ADPCM) 48 66.8
G.726 Adaptive Differential Pulse Code Modulation (ADPCM) 32 55.2
G.728 Low-Delay Code Excited Linear Predication (LD-CELP) 16 31.5
IXLBC Internet Low Bit-rate Coded (ILBC) 15 27.7
G.727 Embedded (ADPCM) 16 16
G.729a Conjugate Structure Algebraic-Code Excited Linear Prediction (CS-CELP) 8 31.2
G.723.1a MP-MLQ 6.4 21.9
G.723.1 ACELP 5.3 20.8

The bit rate is an approximate indication of voice quality or fidelity, however it is only
approximate. Codecs that use pulse code modulation all give high fidelity, and you will
detect little or no difference between any of them. The G.728 codec will give much
better quality than the only nominally lower rate GSM codec, because the algorithm it
uses is much more sophisticated. However, the GSM codec uses less computing power,
and so will run on simpler devices.

The bandwidth gives an indication of how much of the capacity of your broadband
Internet connection will be consumed by each VOIP call. The bandwidth usage is not
directly proportional to the bit rate, and will depend on factors such as the protocol
used. Each chunk of voice data is contained within a UDP packet with headers and

____________________________________________________________________________
Page 124 Topic 4 – Speech Technology
other information. This adds a network overhead of some 15 - 25Kbit/s, more than
doubling the bandwidth used in some cases. However, most VOIP implementations use
silence detection, so that no data at all is transmitted when nothing is being said.

Insufficient bandwidth can result in interruptions to the audio if VOIP uses the same
Internet connection as other users who may be downloading files or listening to music.
For this reason, it is desirable to enable the Quality of Service "QoS" option in the
TCP/IP Properties of any computer running a software VOIP client, and to use a router
with QoS support for your Internet connection. This will ensure that your VOIP traffic
will be guaranteed a slice of the available bandwidth so that call quality does not suffer
due to other heavy Internet usage.

5.3.4 Introduction to G.727


ITU—T Recommendation G 727 contains the specification of an embedded adaptive
differential pulse code modulation (ADPCM) algorithm with 5, 4 3 and 2 bits per
sample (i – e., at rats of 40, 32, 24, and 16 kbps).

The characteristics following are recommended for the conversion of 64-kbps A-law or
µ-law PCM channels to or from variable rate-embedded ADPCM channels. The
recommendation defines the transcoding law when the source signal is a pulse code
modulation signal at a pulse rate of 64 kbps developed from voice frequency analog
signals as specified in ITU- T G .7 11. Figure 5.23 shows a simplified block diagram of
the encoder and the decoder.

____________________________________________________________________________
Page 125 Topic 4 – Speech Technology
Figure 5.23 shows a simplified block diagram of the encoder and the decoder.

____________________________________________________________________________
Page 126 Topic 4 – Speech Technology
Embedded ADPCM Algorithms
Embedded ADPCM algorithms are variable-bit-rate coding algorithms with the
capacity of bit-dropping outside the encoder and decoder blocks. They consist of a
series of algorithms such that the decision levels of the lower rate quantizes are subsets
of the quantize at the highest rate. This allows bit reductions at any point in the network
without the need for coordination between the transmitter and the receiver. In contrast,
the decision levels of the conventional ADPCM algorithms, such as those in
Recommendation G.726, are not subsets of one another; therefore, the transmitter must
inform the receiver of the coding rate and the encoding algorithm.
Embedded algorithms can accommodate the unpredictable and bursty characteristics of
traffic patterns that require congestion relief. This might be the case in IP-like
networks, or in ATM net works with early packet discard. Because congestion relief
may occur after the encoding is performed embedded ceding is different from the
variable-rate coding where the encoder and decoder must use the same number of bits
in each sample. In both cases, however the decoder must be told the number of bits to
use in each sample.
Embedded algorithms produce code words that Contain enhancement bits and core bits.
The feed-forward (FF) path utilizes enhancement and core bits, while the feedback
(FB) path uses core bits only. The inverse quantizer and the predictor of both the
encoder and the decoder use the core bits. With this structure, enhancement bits can be
discarded or dropped during network congestion.’ However, the number of core bits in
the FB paths of both the encoder and decoder must remain the same to avoid mist
racking.

The four embedded ADPCM rates are 40, 32, 24, and 16 kbps, where the decision
levels for the 32-, 24-, and 16-kbps quantizes are subsets of those for 40 k bits per
quantize. Embedded ADPCM algorithms are referred to by (x,y) pairs, where x refers
to the FE (enhancement and core) ADPCM bits and y refers to the FB (core) ADPCM
bits. For example, if y is set to 2 bits, (5, 2) represents the 24-kbps embedded algorithm
and (2, 2) the I 6-kbps algorithm, the bit rate is never less than 16 kbps because the
minimum number of core bits is 2. Simplified block diagrams of both the embedded
ADPCM encoder and decoder are shown in Figure 5.23.

5.3.5 Introduction to G.729 and G.723.1

The excitation signal (e.g., ACELP) and the partitioning of the excitation space ( the
algebraic codebook , G.729 and G. 723.1 can be differentiated in the manner , although
both assume that all pulses have the same amplitudes and that the sign information will
be transmitted .The two vocoders also chow major differences in terms of delay .

____________________________________________________________________________
Page 127 Topic 4 – Speech Technology
Differentiations
G.729 has excitation frames of 5 ms and allows four pulses to be selected. The 40-
sample frame is partitioned into four subsets. The Technology and standards for low-
Bit-rate vocoding methods.

Vocoders parameters Vocoders


G.729 G.729A G.723.1
Bit-rate (kbps) 8 8 5.3-6.3

Frame size (ms) 10 10 30

Subframe size (ms) 5 5 7.5

Algorithmic delay (ms) 15 15 37.5

MIPS 20 10 14-20

RAM (bytes) 5.2 k 4k 4.4 k

Quality good good good

Figure 5.24 parameters for new vocoders

First three subsets have eight possible locations for pluses, the fourth has sixteen. One
pulse must be chosen from each subset .This is a four- pulse ACELP excitation
codebook method (see figure 5.2)
G.723.1 has excitation frames of 7.5 ms, and also uses a four-pulse ACELP excitation
codebook for the 5.3-kbps mode .For the mum likelihood quantizer (MP-MLQ) is
employed. Here the frame positions are grouped into even – numbered and odd-
numbered subsets .A sequential multipulse search is used for a fixed number of pulses
from the even subset (either five or six, depending on whether the frame itself is odd-or
even- numbered); a similar search is repeated for the odd-numbered subset. Then the
set resulting in the lowest total distortion is selected for the excitation (1).
At the decoder stage, the linear predication coder (LPC) information and adaptive and
fixed codebook information are demultiplexed and then used to reconstruct the output
signal .An adaptive post filter is used .In the case of the G.723.1 vocoders, the LT post
filter is applied to the excitation signal before it is passed through the LPC synthesis
filter and the ST post filter.

G. 723 .1
G. 732.1 specifies a coded representation that can be used for com- pressing the speech
or other audio signal component of multimedia services a very low bit-rate. In the
design of this coder, the principal application considered by the study Group was very
low bit rate visual telephony as part o the overall H. 324 families of standards.

____________________________________________________________________________
Page 128 Topic 4 – Speech Technology
This coder has two bit rates associated with it, 5.3 and 6.3 kbps .The higher bit rate
gives greater quality. The lower bi rate gives good quality and provides system
designers with additional flexibility. Both rates are a mandatory part of the encoder and
decoder .It is possible to switch between the two rates a any 30 – ms frame boundary.
An option for variable rate operation using discontinuous transmission and noise fill
during nonspeech intervals is also possible.
The G 723.1 coder was optimized to represent speech with a high quality at the stated
rates, using a limited amount of complexity. Music and other audio signals are no
represented as faithfully as speech, but can be compressed and decompressed using his
coder.
The G.723.1 coder encodes speech or other audio signals in 30 –ms frames .In addition,
here is look ahead of 7.5 ms, resulting in a total algorithmic delay of 37.5 ms. All
additional delay in the implementation and operation of this coder is due o he
following:
1- Actual time spent processing the data in the encoder and decoder.
2- Transmission time on the communication link
3- Additional buffering delay for the multiplexing protocol.

Encoder / Decoder
The G .723 .1 coder is designed to operate with a digital signal by first performing
telephone bandwidth filtering ( Recommendation G. 712 ) of the analog input , then
sampling at 8000 Hz , and then Hz, and hen converting to 16 – bit linear PCM for thee
input to the encoder .
The output of the decoder is converted back to analog by similar means.
Other input / output characteristics , such as those specified by Recommendation G 711
for 64 – kbps PCM data, should be converted to 16–bit linear PCM before encoding or
from 16 – bit linear PCM to the appropriate format after decoding .
The coder is based on the principles of linear prediction analysis- by – synthesis coding
and attempts to minimize a perceptually weighted error signal. The encoder operates on
blocks (frames) of 240 samples each. That is equal to 30 ms an 8 – k Hz sampling rate.
Each block is first high – pass filtered to remove the DC component and then is divided
into four sub frames of 60 samples each .For every sub frame , a tenth – order linear
prediction coder filter is computed using the unprocessed input signal . The LPC filter
for the last sub frame is quantized using a predictive split vector quantizer ( PSVQ )
.The quantized LPC coefficients are use to construct the short – term perceptual
weighing filter, which is used to filter he enter frame and o obtain the perceptually
weighted speech signal .
For every two sub frames [120 samples ] , the open – loop pitch period L Lo is compute
using the weighted speech signal .This pitch estimation is performed on blocks of 120
samples. The pitch period is searched in the range from 18 to 142 samples.
From this point, the speech is processed on a basis of 60 samples per sub-frame.
Using the estimated pitch period computed previously, a harmonic noise shaping filter
is constructed. The combination of the LPC synthesis filter, the format perceptual
weighting filter, and the harmonic noise shaping filter is used to create an impulse
response.
The impulse response is then used for further computations.
Using the estimated pitch period estimation L Lo and the impulse response, a closed
pitch predictor is computed A fifth – order pitch predictor is used .The pitch period is
____________________________________________________________________________
Page 129 Topic 4 – Speech Technology
computed as a small differential value around the open – loop pitch estimate. The
contribution of the pitch predictor is then subtracted from the initial target vector. Both
the pitch period and the differential values are transmitted to he decoder .
Finally, the nonperiodic component of the excitation is approximated. For the high bit
rate, multipulse maximum likelihood quantization ( MP – MLQ ) excitation is used ,
and for the low bit rate , an algebraic code excitation is used .
The block diagram of the encoder is shown in figure 5.25.
• Framer
• High – pass filter
• LPC analysis
• Lin spectral pair ( LSP) quantizer
• LSP interpolation
• Forman perceptual weighting filer
• Pitch estimation
• Sub frame processing
• Harmonic noise shaping
• Impulse response calculator
• Zero – input response and ringing subtraction
• Pitch predictor
• High – rate excitation (MP-MLQ )
• Excitation decoder
• Pitch information decoding

Figure (5.25) .The block diagram of the encoder

____________________________________________________________________________
Page 130 Topic 4 – Speech Technology
References

Topic 1: Audiology

[1] Fundamental of Acoustics By Kinsler, Lawrewnce; Sanders, Jame V;


Frey, Austinr.
[2] Adaptive Filters By Haykn, Simon.
[3] (http://www.freehearingtest.com).
[4] (http://www. healthline.com).
[5] (http://www.youcanhear4less.com).
[6] Texas Instruments (http://www.ti.com/).

Topic 2: Acoustical Simulation of Room

[1] Kutruff, K. H. (1991) Room Acoustics, Elsevier Science Publishers, Essex.


[2] (http://home.tir.com/~ms/roomacoustics/).
[3] (http://audiolab.uwaterloo.ca/~bradg/auralization.html).
[4] (http://www.acoustics.hut.fi/~riitta/.reverb/).
[5] (http://www.music.mcgill.ca/~gary/).
[6] (http://www.mcsquared.com/).
[7] (http://www.owlnet.rice.edu/~elec431/projects97/).
[8] J. Audio Eng. Soc., Vol.45, No.6, 1997 June

Topic 3: Noise Control

[1] Architectural Acoustics, McGraw Hill Inc., New York, 1988, p.18
[2] Source: U.S. Dept. of Commerce / National Bureau of Standards.
Handbook 119, July, 1976: Quieting: A Practical Guide to Noise Control; Page 61.
[3] Kutruff, K. H. (1991) Room Acoustics, Elsevier Science Publishers, Essex.
[4] (http://www.STC ratings.com).

Topic 4: Speech Technology

[1] L.R Rabiner and B.Gold Theory and Application of Digital Signal Processing
Prentice-Hall, Englewood cliffs, nj.
[2] C, J.Weinstein A Linear Predictive Vocoder with Voice Excitation Proc. Eascon.
[3] Daniel Minoli, Emma Minoli Delivering Voice Over IP Networks Wile Computer
Publishing john wiley & sons,inc.
[4] (http://www.data-compression.com/speech.html).
[5] (http://www.otolith.com/otolith/olt/lpc.html).

You might also like