You are on page 1of 4

MODELING OF A 'WHAT YOU HEAR IS WHAT YOU SPEAK' (WYHIWYS) DEVICE

Karthik Mahesh Varadarajan

Automation and Control Institute


TU Wien, Austria.
kv@acin.tuwien.ac.at
ABSTRACT cochlea through the cranial skeletal structure, thereby bypassing
the ear drum. There have been several studies on skeletal sound
Humans, at times find it difficult recognizing their own voices. This transmission aiding research in bone conduction. The vibrational
is specifically true when a person's voice is recorded and played response of the skull to sound has been found to be complex,
back to him- he usually finds a marked difference between the composing of multiple resonances and anti-resonances, including
recording and the way he feels his voice sounds. In this paper, the an antiresonance below 1 kHz in the ipsilateral transmission path
various possible physical and physiological reasons behind this leading to a lateralization of perception on the contralateral side for
phenomenon are expounded. Based on the conclusions, a ‘What some frequencies [7]. Also, while hearing by air conduction is
You Hear is What You Speak’ (WYHIWYS) device is modeled from limited to about 20 kHz, hearing by bone conduction extends to at
signal and audio processing principles. Such a device is expected least 100 kHz. Though the frequency spectrum of the live human
to work complementary to devices such as bone conduction head in the voice range is relatively flat (compounded by
microphones and auditory aids, besides helping analyze various resonances and anti-resonances), there is a gradual fall in
psychoacoustic effects. The device is also expected to benefit amplitude this range, resulting absorption of high frequencies [7].
orators, vocal artists and ventriloquists in better understanding This result is also evidenced by Dunlap et al. [8], also noting that
public perception of their voices. Results of the modeling are propagation of high frequencies across the skull is primarily
presented with respect to psychoacoustic evaluation. through the cranial vault, and that the thick base bones do not
support high frequency transmission from one side of the skull to
Index Terms— Psychoacoustics, audio, filtering, speech, the other. This is however in contradiction to the response of a dry
human auditory system skull [6]. In this proposed system, we use a combination of these
effects to model sound perception at the cochlea.
1. INTRODUCTION
2. SURVEY OF EGO-SPEECH PSYCHOACOUSTICS
Sound from a human as heard by an external entity, be it a person
or a recording device varies in a number of characteristics from A small survey was conducted to identify the extent of the
what the speaker believes is the tone he produces. In other words, above observations. Twenty five people of similar age range were
when a person's voice is recorded and played back to him, he finds selected and queried the following questions. 1) If they perceived
a marked difference between the recording and the way he feels his their voice to be intelligible, whereas they experienced
voice sounds. This is noticed even when the recording is done at circumstances in which the listeners felt they were barely audible.
very high sampling rates and with high fidelity microphone 2) If they had difficulty in identifying their own recorded voice.
systems. One can often come across people who find their voice to 3) If they found their own voice unintelligible when having food or
be reasonably loud, whereas to a listener, the person's voice could engaging in similar activity that uses the nasal or vocal tracts. The
seem barely audible. This might be due to the fact that a speaker results of the survey were consistent with the observations. While
perceives his voice to differ from the voice heard by an external twenty-one of the twenty-five respondents answered in the positive
entity. In short, the sound from a speaker undergoes some to the second query, twenty respondents replied in the positive to
transformation before the speaker hears it back. Further, it is query three. A relatively small number of people, six, responded
possible that there are additional components in the sound heard by positive to the first query and most replied in the pejorative.
the speaker when compared to that heard by him. However, this proved that the first observation was indeed true
Furthermore, most people find it difficult to understand their though it was not very common (Figure 1).
own voice, when having food. If the sound that is emitted from the 22

speaker is the same as that heard by him, then the speech should be 20

18
intelligible to the speaker. There should be an additional factor that 16

changes the nature of the speech heard by the speaker. One 14

possibility is the coupling between the speaker's vocal tract (mouth 12 Ques 1
Ques 2
10
and the nasal passage) and the ears as a result of which the 8
Ques 3

speaker's voice undergoes change in its characteristics before 6

reaching the ear drum or the cochlea. 4

2
Bone conduction has been used to varying degrees of success 0
in the military and in the design of auditory aids. Bone conduction Survey Results

works on the principle of transmitting a sound wave directly to the Figure 1. Survey Results Figure 2. Main lobe of speech
energy

978-1-4577-0274-7/11/$26.00 ©2011 IEEE DSP2011


The aim of this paper is to look into the possible reasons for the the vibrations completely. Instead, they vibrate along with the
effects and to alter the speech recorded from a speaker in such a vocal tract and these vibrations travel through them to the cochlea.
way that it resembles the nature of the speech which the speaker Besides, vibrations directly coupled to ear might also add to the
perceives it to be. This subsumes subjectivity of results and vibrations of the ear drum thereby reinforcing the sound waves or
variation from speaker to speaker. creating constructive interferences. In general, high frequencies are
more easily attenuated than low frequencies. This is because for
3. ANALYSIS OF SOUND PERCEPTION DISCREPANCY the same energy levels absorption characteristics of typical natural
structures is better at high frequencies than low frequencies. These
3.1 Backward Directed Waves and Reverberant Fields: components thus lose their energy quickly when compared to low
As sound travels from the vocal passage (consisting of the mouth frequency components. Hence, the combination of the coupling of
and the nose), the main lobe where the energy in the speech is vibrations of the vocal tract, (that are chiefly low frequency
concentrated is towards the front. Hence, the sound that reaches the periodic components) and the fact that the high frequencies are
ears of a speaker is largely from low amplitude backward directed quickly attenuated (if any are present near the ear canal), result in
waves that are generated by the ripple effect. The energy of the the sound components reaching the ear, when one speaks, to be
sound waves reaching the ears are at 90 degrees to that of the main largely composed of low frequency voiced segments with heavily
lobe. Thus, the direct wave sound heard by a speaker is low in attenuated high frequency unvoiced segments.
amplitude when compared to that heard by a listener directly in
front of the speaker. But, generally, the pressure level (SPL) of the 3.2.2 Effects of cancellous bones
sound heard by a speaker is much higher than that heard by a Further, a good percentage of the human body is composed of
listener. Hence, this cannot be the main source of sound reaching a bones called cancellous bones. These bones have high surface area
speaker's ears. There should be an alternative source of the sound. [4]. They are porous bone structures and contain red bone marrow
Also, the reverberant fields created by enclosures around the that engender red blood cells (RBCs). Since these bones are
speaker might add to the level of sound heard by a speaker. But, porous, they can contribute to the vibrations by forming resonating
the sound level heard by a speaker is loud even when there are no or oscillating structures where blood is trapped. These vibrations
enclosures. Hence, this cannot be a dominant factor as well. can possibly enhance the low frequency components that are
Nevertheless, it can be taken into account into the modeling. coupled to the ear through the skeletal vibrations. These low
frequency sound waves are directly coupled to the ear, the indirect
3.2 Nature of Speech and Skeletal Vibrations sound from the mouth occurs at a delay with respect to the skeletal
The mechanism by which the human body produces speech varies vibrations.
from one sound segment to the other. Typical segments are periods
of time in the speech time series, in which the signal is largely 3.3 The Lossy Coupler: The Eustachian Tube
stationary. This in turn corresponds to phonemes of the speech There exists a small tube called the Eustachian tube (Figure 4)
time series. Speech is classified in Speech processing, according to that connects the ears with the nose and the throat. This tube closes
Phonation or by the Manner of Articulation or by the Place of when a person speaks or during food intake [5]. Since the
Articulation [1, 2]. Phonation refers to utilization or non-utilization Eustachian tube closes when a person speaks, the sound from the
of the laryngeal tract for the production of sound segments. vocal tract should not be transmitted to the ear passage. However,
Accordingly, speech segments can be classified into voiced and the Eustachian tube end might not provide perfect isolation, as a
unvoiced sounds. Voiced sounds refer to the use of the larynx in result of which some coupling of energy from the vocal tract to the
the production of the sound. The larynx is a resonating column that ear passage takes place. This adds to the voiced sound signals
generates sound waves of a basic frequency called the pitch period. coupled from the bone vibrations to the ear passage. Other possible
Besides, various columns in the larynx produce other harmonics. reasons could include the feedback from the sound produced by the
The other type of speech segment is the unvoiced sound. In this ear drum and coupling of nasals directly through the Eustachian
case, the vocal cords or the larynx don't come into play. Instead, tube.
sounds are produced by rapid passage of air through the vocal
tract. Since, they do not have any fundamental frequency of
vibration; the spectrum of such a segment is largely flat, hence
representing noise. Since, unvoiced sounds are directly produced in
various parts of the oral cavity, and reach the listener directly, they
can contain larger energy than their equivalent voiced sounds. In
the case of the English language, the phonemes can be classified
into voiced and unvoiced sounds as follows [1,3]. 1) Voiced: All
vowels, [b], [d], [g], [dʒ], [v], [ð], [z], [ʒ], [m], [n], [ŋ], [l], [r], [w], Figure 3. Representation image of sound paths for unvoiced
[j], 2) Unvoiced: [p], [t], [k], [tʃ], [f], [θ], [s], [ʃ]. and voiced segments

3.2.1 Effect of vocal tract vibrations and attenuation effects of


the tissues
The vibrations created during the generation of the voiced sounds Figure 4. Eustachian Tube
cause the entire skeletal system to vibrate. Due to the direct [Source:
coupling of the vocal tract with the skull, this causes the skull to www.agen.ufl.edu/]
vibrate at the frequency of the vocal tract. This is a significant
contributor to the sound heard by the human ear while speaking.
Also, the tissues and blood are not perfect solids and do not absorb
These effects need to be modeled in order to have recorded perform better with an IIR implementation. Also, most systems
sound from a speaker to resemble in perception to that heard by a found in nature are low pass systems. Hence such a model agrees
speaker while speaking. well with our assumptions.
The modeling was started with an FIR filter implementation. A
4. WYHIWYS MODELING reasonably small order was chosen for the modeling (say 15).
Since a window based approach is easy, it is used. It would be
4.1 Low Pass Filter Model: ideal to window the speech signal in time domain by a flat infinite
The skeletal structure responds mainly to the low frequency window, since it has an impulse as its frequency response and
components of the sound and attenuates high frequencies. Also, the hence convolving the two spectra results in the original signal.
bones as well as the tissues absorb the high frequencies from the However since this not practical for implementation, we need to
vocal tract. As a result the skeletal system can be modeled as a low look for a window that has all its energy compacted near its zero
pass filter system that passes only the low frequency components frequency or in other words has its maximum energy concentrated
of the speech signal and augments the original speech heard by the in its main lobe. Bartlett, Hamming and Hanning give the same
speaker from the indirect backward directed waves. Now, two main lobe width of 8П/M, Hamming window gives the best
possibilities exist with regard to the low pass filter action. attenuation of -41 dB. Blackman window gives better attenuation
One possibility is that the cutoff frequency of the low pass filter of -57dB when compared with the other windows, but has a main
lies in a relatively high frequency region of the spectrum. Since lobe width of 12П/M [9,10]. Hence a trade-off has to be made
frequencies and time domains are duals, it turns out that the filter is between the use of Blackman and Hamming windows. In this
capable of resolving the signal well in the time domain. In short, paper, Hamming window has been used in comparison with
the resolution is good enough in the time domain to distinguish Blackman so that energy is compacted. Convolving with the input
between two separate waves of sound. This would result in the fact speech sequence gave the filtered sequence, which was then added
that when the direct and the indirect sound signals augment each to the original sequence, numerically to give the combined effect
other, the ear would be able to hear the two sounds distinctly, since of the skeletal and indirect sounds. This approach essentially
the ear also has low pass filter structure with high cutoff frequency. mimics the function of a voiced – unvoiced detector. The order of
By observation, it can be said that this possibility is untrue, the filter was increased in accordance with the user response. The
because we do not hear the direct and the indirect sounds orders were incremented by five each time the user felt the
discreetly. Instead we hear the sound waves as one single frame of response to more closely resemble the sound perceived by him
sound. The other possibility is that the cutoff frequency of the filter while talking. The cutoff frequency of the filter was also varied as
is relatively low, in which case the signals from the skeletal per user feedback of 'naturalness' of the perceived sound. The
vibrations and that from the indirect sound would appear integrated initial cutoff frequency was chosen to be 0.6 (normalized) and was
to the ear when they are overlapped. This integration is a natural varied in either directions till the sound obtained was recognized as
characteristic of a low pass filter. Thus, the augmentation of the close to natural. Also, the initial gain of the system was chosen to
two sounds with minimal delay would produce the desired sound be 1.0 and gradually increased until the naturalness of the speech
characteristic. Such an augmentation would result in 'Deepening' of was obtained. While the variation in cutoff frequency and gain at
low frequency content of the speech. Hence it would have more each step was a factor of 0.1, the order of the filter was increased
intensity and hence more energy. This assumption appears to be by 5.
reasonable. Hence a good estimate of the cutoff frequency would As the order of the filter was increased, beyond a certain value
be start with something greater than half the entire range of of the order, it was determined that implementation difficulties
frequencies. Thus the normalized cutoff frequency to start with can would be encountered. As a result, the filter modeling required a
be chosen as 0.6. change to IIR or Infinite Impulse Response. This value of order
As with any filter design problem, there are two possible was chosen to be 45. The initial value chosen for the IIR filter
alternatives to be analyzed. These are the FIR or Finite Impulse order was 5. This was incremented in steps of 1 till 'naturalness'
Response filters and the IIR or Infinite Impulse Response Filters. was obtained. The IIR filter was implemented by adding poles to
FIR filters are the simplest and are represented naturally by the the filter. A second method was also tested out on the IIR model.
coefficients of convolution or the multiplicative factors at each This was the use of the Voiced – Unvoiced Detector.
delay. Since they use a limited number of coefficients, these filters
are quite stable. However, if the effects are to be pronounced over 4.2 Voiced- Unvoiced Detector:
long delay times, then these filters become unsuitable as the There are numerous ways to detect voiced / unvoiced portions
number of filter coefficients required becomes very large and along in a segment of speech. Speech signals are found to be stationary
with it, the computation. As the number of delay elements required for a period of about 20-25ms typically [1]. In this region, their
to represent the system increases, it is advisable to use an IIR filter statistical characteristics can be approximated to be more or less
structure for the model. IIR achieves identical or better frequency constant. Fragmentation of the speech signal into segments of 20-
response as compared to the FIR implementation with the same 25ms was carried out and unvoiced- voiced classification carried
number of coefficients. However, it is not totally advantageous due out using a combination of metrics like LPC, MFCC, zero
to the fact that it might cause oscillations due to its infinite forward crossings and energies. Unvoiced sounds have lower amplitudes
delay elements or infinite impulse response effect obtained as a and envelopes. They have lower energies but high zero frequency
result of its feedback resulting from the pole. Since IIR impulse crossings. The unvoiced segments were detected and attenuated
response is infinite in nature, the obtained time signal is not time and the speech sequence was passed through the IIR filter. The
limited. Moreover, one or more memory elements is required for gain of the filter was kept adaptable. This was done to ensure
the implementation. Hence a trade-off between the two designs consistency at the segment boundaries to obtain the necessary
become necessary. In nature, most systems, if not all are infinite smoothing effect. An equivalent Unvoiced-only sequence was
impulse response. Hence it is expected that the require model could obtained by a similar technique and its gain was kept constant. The
Subject 7 9 - 0.5 1.8
Subject 8 - 35 0.7 1.2
Subject 9 6 - 0.6 1.7
Subject 10 - 30 0.6 1.5
Subject 11 10 - 0.8 1.2
Figure 5. (Clockwise) Sample test Subject 12 - 30 0.7 1.4
message, Original spectrogram and
Modified Spectogram – Note the Subject 13 - 35 0.6 1.6
emphasis of low frequency
components and effects of adaptive Subject 14 6 - 0.4 2.9
gain change Subject 15 7 - 0.7 1.4
MEAN 7.4 33 0.5933 1.7533
gain of the first sequence was always kept greater than one, since
the voiced segments were to be emphasized and the unvoiced It can be seen that while there are no clear trends as far as the
sounds were to be kept at the recorded signal level. According to order of the filters are concerned, the mean Cutoff frequency was
replies of the test subjects, the gain of the filter, the cutoff estimated to be 0.5933 or 0.6 which is in line with the initial guess
frequencies and the order of the filter were varied (order was value. The mean value of the order of IIR Filter required was 7.4.
increased monotonically) to obtain a natural response. Equiripple This is close to the initial guess of 5. The mean value of the order
IIR filter training was done ensuring stability of the system. The of FIR Filter required was 33. This is close to the initial guess of
poles obtained by this method correspond to the resonant states of 25. The mean value of gain of the filters used was 1.7533. This can
the skull. be stated as the amount of emphasis required for the low
frequencies.
4.3 Design of Lossy Coupler: 5. FUTURE WORK
The modeling of the Eustachian tube was done by having the
gain of the sequence adaptive. The simplest method to incorporate The model currently used does not incorporate resonances and
the model would be to vary the gains in the previous two stages. anti-resonances. The modeling can be improved to incorporate
other known physiological effects. Furthermore, the sample
4.4 Experimental Conditions population was very small to conclude to definitive opinions. Also,
STFT was employed for the spectral analysis and modification. A it would be ideal to replace the subjective evaluation based on
PC compatible sampling frequency of 44.1 kHz was used for perception, by an objective metric. Also higher order models are
recording. The microphone used for recording was KOSS: CS 95 envisioned for the future. Future advancements could also include
Echo-Cancellation Microphones with frequency response range of multiple coupling and usage of adaptive filters. This would make
30-16,000 Hz, stereophone impedance of 32 ohms, sensitivity the model more accurate.
91dB SPL/1mW and distortion less than 1%. With the used
The current model has succeeded in demonstrating a first
sampling frequency, for a stationarity period of 20-25ms, 882-
prototype of a WYHIWYS device. Study of relation between filter
11025 samples are required to be used, necessitating a 1024 point
parameters and bone density of individuals is a natural extension of
DFT to prevent anti-aliasing.
this work.
From Figure 5, the spectral characteristics of each time segment 6. REFERENCES
(message used here is ‘this is a test message for the audio project’) [1] Digital Processing of Speech Signals, 'Acoustic Phonetics',
can be identified. It can be seen clearly that the corresponding to L.R.Rabiner & R.W.Schafer, PE
the voiced segments /t/ /au/ etc. are emphasized, while the [2] http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
unvoiced segments /h/ /k/ etc. are limited. These results obtained [3] Discrete Time Speech Processing, Thomas Quatieri, PE
using an FIR filter of length 35, cutoff frequency of 0.67 and gain [4] http://www.orthovita.com/patient_info/bonehealth.html
of 1.3. Similarly a number of filter structures were implemented in [5] 'Normal Functioning of the Ear', Brown University,
the sequence mentioned earlier (Table 1). http://biomed.brown.edu/Courses/BI108/BI108_2001_Groups/Coc
TABLE I hlear_Implants/normalfunction.html
COMPARISON OF WYHIWYS FILTERS [6] Stenfelt S, Hakansson B, Tjellstrom A. Vibration
Cutoff characteristics of bone conducted sound in vitro. J Acoust Soc
IIR Am. 107:422-431, 2000
Subject FIR Order Frequency Gain
Order
(Fc) [7] Cai Z, Richards DG, Lenhardt ML, Madsen AG, ‘Response of
Human Skull to Bone-Conducted Sound in the Audiometric-
Subject 1 5 - 0.5 2.1
Ultrasonic Range’, Int Tinnitus J. 2002;8(1):3-8.
Subject 2 10 - 0.6 1.9 [8] Dunlap SA, Lenhardt ML, Clarke AM. ‘Human skull vibratory
Subject 3 6 - 0.4 2.3
patterns in audiometric and supersonic ranges’, Otolaryngology –
Head and Neck Surgery 99:389-391, 1988.
Subject 4 7 - 0.7 1.3 [9] J. O. Smith, Physical Audio Signal Processing,
Subject 5 - 35 0.6 2.0
http://ccrma.stanford.edu/~jos/pasp/, 2010.
[10] J. O. Smith, Spectral Audio Signal Processing,
Subject 6 8 - 0.5 2.0 http://ccrma.stanford.edu/~jos/sasp/, Mar. 2010.

You might also like