You are on page 1of 2

The self-to-other ratio applied

as a phonation detector for


voice accumulation
Svante Granqvist, Royal Institute of Technology, KTH, Stockholm, Sweden
E-mail: svante@speech.kth.se

Abstract Example
Binaural microphones were utilised to This experimental recording was made in a
detect phonation in a human subject laboratory environment where speech-shaped
(figure 1). This detection was used to cut noise was played back through loudspeakers and a
the audio waveform in two (actually four) female speaker wore the microphones during a
separate channels; one for voiced conversation with the author. The resulting level
segments and one for the background curves and switched audio can be seen in figures 2
noise. Given that the own voice almost and 3.
always is louder than the background
noise at the ears of the subject, the
channel with the voiced segments can be
used for extraction of speaking level, F0
and phonation time. The Background
channel can be used to estimate the
background noise level. The method has
previously been used as part of a voice
accumulator in other studies [Sdersten et
al. 2001].

Figure 1. The two


microphones are
attached near the Figure 2. Example: switched levels. Note the
ears of the subject peaks in the S/O ratio channel

Web
This poster and sound samples are also
available on the web:
http://www.speech.kth.se/~svante/aura
E-mail: svante@speech.kth.se

References Figure 3. Example: switched audio


Ternstrm S. (1994) Hearing myself with others: Sound levels in
choral performance measured with separation of ones own
voice from the rest of the choir. J Voice 1994;8(4):293-302.
Sdersten M., Hammarberg B., Granqvist S., Szabo A., (2001)
Vocal behaviour and vocal loading factors for pre-school
teachers at work studied with binaural DAT-recordings.
Submitted for publication
The self-to-other ratio applied
as a phonation detector for
voice accumulation
Svante Granqvist, Royal Institute of Technology, KTH, Stockholm, Sweden
E-mail: svante@speech.kth.se

Computer program
The binaural stereo recordings is used as input to the computer program
Aura (figure 4). The signals are processed and a number of channels can
be selected to appear in the output files. The output files can contain either
switched audio or switched level curves.

Signal processing
From the two microphone signals five level signals is derived, (figure 5):
1. The level at the left microphone (L level)
2. The level at the right microphone (R level)
3. The level of the difference signal (L-R level)
4. The level of the sum signal (L+R level)
5. The S/O ratio [Ternstrm, 1994], which is the
difference between channels 3 and 4. Figure 4. The computer program Aura, which implements
the method.
The sum and difference channels are high-pass filtered at 1 kHz before
level extraction, see below.
Normally, the level in the S/O ratio channel has a high correlation with the
instances of phonation, see figure 2 and can thus be used as a control signal
for the switching of audio and level signals. Two separate thresholds are
applied to control the Self and Other switching. Typically, the Self signal
will contain the voiced portions of the recording, with all pauses and
unvoiced segments removed. On the other hand, the Other signal will
contain these pauses and unvoiced segments. There are, however, instances
when there is a need for improved control. This is acheived in the post-
processing blocks to the right in figure 5. The most important feature is the
construction of a Background control signal from the Other control signal
(figure 6). Using this control signal, rather than the the Other control
signal, the output is further cleaned from the subjects voice. This is
extremely important for estimation of the background noise level. Similarly
a Talk channel can be derived by including short pauses and unvoiced
segments (figure 7).
Figure 5. Schematic of the signal processing in
High-pass filter the computer program Aura.
The fundamental idea with the method is that ambient sound sources arrive
uncorrelated to the microphones and thus the level of sum and difference
signals will be approximately equal. However, for low-frequency sounds,
the signals will appear in phase due to the fact that the wavelength is large
compared to the distance between the microphones, and will thus be mis-
interpreted as voicing from the subject. The 1 kHz high-pass filter will
reduce this effect and thus improve the accuracy of the switching. The need
for the high-pass filter was verified with the following experiment. A
subject was positioned in the diffuse field from two loudspeakers in a Figure 6. The steps to derive a Background channel from the
standard laboratory environment. The subject was then rotated 360 degrees, Other channel by modifying the instances of switching
and long-time average spectra (LTAS) were used to analyse the spectral
properties of the Self and Other channels. The results confirm a raise of the
level of the S/O ratio at low frequencies (figure 8), even though the subject
did not phonate during the experiment.

Diffuse field, no phonation

60

50

40

30

20
Diff [dB]
Sum [dB]
Figure 7. The steps to derive a Talk channel from the
S/O ratio [dB] Self channel by modifying the instances of switching
10

0
Raised S/O ratio level at low frequencies Figure 8. A diffuse field yields a high S/O ratio at low frequencies even
-10
though no phonation occurs. Theconsequences of this effect is reduced
-20
by applying a high-pass filter to the signals.
10 100 1000 10000
Frequency

You might also like