You are on page 1of 5

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220732612

Forensically inspired approaches to automatic


speaker recognition
Conference Paper in Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on
May 2011
DOI: 10.1109/ICASSP.2011.5947519 Source: DBLP

CITATIONS

READS

56

6 authors, including:
Jason Pelecanos

Weizhong Zhu

IBM

IBM Research

48 PUBLICATIONS 706 CITATIONS

18 PUBLICATIONS 36 CITATIONS

SEE PROFILE

All in-text references underlined in blue are linked to publications on ResearchGate,


letting you access and read them immediately.

SEE PROFILE

Available from: Weizhong Zhu


Retrieved on: 13 November 2016

FORENSICALLY INSPIRED APPROACHES TO AUTOMATIC SPEAKER RECOGNITION


K. J. Han, M. K. Omar, J. Pelecanos, C. Pendus, S. Yaman, W. Zhu
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA
{kjhan,mkomar,jwpeleca,cpendus,syaman,zhuwe}@us.ibm.com

ABSTRACT
This paper presents ongoing research leveraging forensic methods for automatic speaker recognition. Some of the methods forensic
scientists employ include identifying speaker distinctive audio segments and comparing these segments using features such as pitch,
formant, and other information. Other approaches have also involved
performing a phonetic analysis to recognize idiolectal attributes, and
an implicit analysis of the demographics of speakers.
Inspired by these forensic phonetic approaches, we target three
threads of work; hot-spot analysis, speaker style and pronunciation
modelling, and demographics analysis. As a result of this work we
show that a phonetic analysis conditioned on select speech events
(or hot-spots) can outperform a phonetic analysis performed over all
speech without conditioning. In the area of pronunciation modelling,
one set of results demonstrate signicantly improved robustness by
exploiting phonetic structure in an automatic speech recognition system. For demographics analysis, we present state-of-the-art results
of systems capable of detecting dialect, non-nativeness and native
language.
Index Terms Forensics, speaker verication, hot-spot, pronunciation modelling, demographics.
1. INTRODUCTION
For a number of years, the speaker recognition community has examined how to model the general patterns of short-term spectral coefcients, prosody and complementary high-level information. The
underlying phenomena was typically modelled somewhat as a black
box. There may be a benet in breaking down the speech into distinctive events and modelling their detail in a similar manner to the
approaches used by some forensic phoneticians. The aim of our ongoing work is to develop techniques motivated by the procedures
performed in the forensics community and include them for the benet of improving automatic speaker recognition. We partition these
techniques into three main areas: hot-spot analysis, pronunciation
modelling and demographics analysis.
Hot-Spot/Landmark Analysis: Some forensic approaches compare the properties of particular words and phone sequences [1, 2, 3].
Similarly, we can begin to include information from this style of
analysis in an automated system. This involves identifying segments
in an utterance that are considered discriminative and comparing
them with like segments in other utterances. The long-term goal is
This research was funded by the Ofce of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA),
through the Army Research Laboratory (ARL). All statements of fact, opinion or conclusions contained herein are those of the authors and should not be
construed as representing the ofcial views or policies of IARPA, the ODNI,
or the U.S. Government.

978-1-4577-0539-7/11/$26.00 2011 IEEE

5160

to investigate which events are important (including a natural expansion to events longer than single words), how to model rare events
and how they should be matched when event recognition errors, intrinsic speaker and speech variation (such as co-articulation) effects
are considered.
Speaker Style and Pronunciation Modelling: It is observed that
forensic phoneticians often encompass high-level speech information implicitly as part of their analysis. Additionally, they have also
considered features such as pitch, formant and related information.
Here we investigate methods for incorporating high-level information in the speaker modelling process. We also include a brief analysis of some of the features used in forensic and pathological voice
analysis.
Demographic Analysis and Speaker Proling: One of the rst
things forensic practitioners examine (often implicitly) is a speakers
dialect. Not surprisingly, forensic scientists have used it to provide signicant evidence in court [2]. One important example by
Labov [4] was the case of Paul Prinzivalli who worked for Pan Am
as a cargo handler. He was suspected of making a telephone bomb
threat to his employer. According to phoneticians working for the
defense, the voice from the recorded telephone threat was characteristic of a New England accent, while Prinzivallis accent was distinctively New Yorker. Based on this and other evidence Prinzivalli
was later acquitted. Accordingly, we have developed demographics
based detectors to assist in the problem of automatic voice comparison.
The three research threads proposed here represent steps toward
encompassing techniques performed by phoneticians in forensic
voice comparison casework. Correspondingly, Section 2 discusses
progress on work related to hot-spot analysis, Section 3 details
research on speaker style and pronunciation modelling, Section 4
refers to the demographics work, while Section 5 overviews system
combination and score conditioning efforts.
2. HOT-SPOT ANALYSIS
One approach that forensic phoneticians have followed for comparing voices is to locate important phonetic events (hot-spots) and to
extract appropriate speech measurements accordingly [1, 2]. Along
these lines, we examine two approaches; word and phone sequence
conditioned acoustic comparison followed by a related phone sequence analysis methodology.
2.1. Word/Phone-Sequence Conditioned Acoustic Comparison
There are several previous works employing speech candidate selection methods for automatic voice comparison. For example,
Sturim [5] used the features from a small set of keywords to produce
a Gaussian Mixture Model (GMM). Similarly, Boakye [6] utilized

ICASSP 2011

Fig. 1. Plot of minimum DCF versus the relative quantity of speech.

100
Word Conditioned PSV
Baseline PSV w/o
Word Conditioning

95

90
minDCF (X 103)

a small keyword set and modelled the features from each keyword
using a Hidden Markov Model (HMM). Weber [7] examined the
use of a large vocabulary automatic speech recognition system to
perform speaker recognition modelling. More recent work by Bocklet [8] demonstrated the utility of modelling specic syllabic events
for speaker recognition.
The ongoing goal of this work is to include additional forensic acoustic-phonetic knowledge in the modelling process. Another
goal is to generalize the current English centric approaches for other
languages. With this aim, we compare an English language focused
system conditioned on words to one conditioned on phone sequences
(which may be more efciently adapted to other languages). We also
incorporate recent high dimensional kernel dot-product scoring and
session variability compensation methods.
We begin by investigating the performance of words modelled
using HMMs as part of a high dimensional symmetric supervector kernel derived in a similar manner to Campbell [9] through a
Kullback-Leibler distance approximation. Here the supervectors
are formed from the concatenation of the individual (currently English) word HMM supervectors. The result is determined from a
dot-product of the supervector representations followed by ZT-score
normalization. Nuisance Attribute Projection (NAP) is also applied [9] with the directions determined by calculating the Within
Class Covariance (WCC) [10]. To target the goal of being language
independent, we also analyze the performance of phone bi-grams
and phone tri-grams as part of a supervector kernel. For both the
words and phone sequences we select the most popular events by
count.
Figure 1 presents a plot of the speaker recognition performance
versus the relative quantity of speech used (measured using minimum DCF [11] on the male subset of Task 8 in the NIST 2008
Speaker Recognition Evaluation). It shows that the performance of
each of the 3 hot-spot systems track relatively well. This suggests
that phonetic sequences may potentially be used as word level proxy
events and such models may lend themselves to more language independent modelling. We also present a result of the Phonetically Inspired UBM (PI-UBM) system (see Section 3.1) for contrast. While
the performance statistics of the hot-spot systems are not as good
as the PI-UBM setup there are some small (however not signicant)
gains observed using fusion. These systems will need to be evaluated on data sets with more trials to infer stronger conclusions. With
system combination in mind, this type of kernel which is built from
a limited set of discrete events (word and phone sequences), can be
extended to use only select events that are complementary to other
state-of-the-art systems.
2.2. Word Conditioned Phone Sequence Analysis
Another approach that we are currently exploring for incorporating more information from a forensic phonetic viewpoint is word
conditioned phone sequence analysis. This aims to capture idiolectal differences in pronunciation between speakers by analyzing
word-specic phone sequence realizations. Forensic phoneticians
can identify a speakers dialect based on allophonic rules for commonly occurring transformations from general English to dialectal
forms [2, 12] (e.g., pen /pEn/ could be pronounced as either [pEn]
or [pIn] which can relate to a speakers geographical background).
Correspondingly, we develop a system that can capture speakerdependent attributes from speech through phonetic analysis given
keywords.
Figure 2 compares the baseline phonetic speaker verication
system [13, 14] and our word-conditioned extension using the 50

5161

85

80

75

70

65

10

15
20
25
30
35
40
Number of Keywords for Word Conditioning

45

50

Fig. 2. Plot of minimum DCF versus the number of keywords used


for a word conditioned phonetic system.

most frequent keywords. The results are presented in terms of


minDCF (103 ) for NISTs 2008 SRE core Task 8 data set (male
subset). This initial result on word conditioning suggests that our
forensic-style approach can handle speaker-specic phonetic information more effectively (18.3% relative improvement at 50 keywords). This result could be enhanced if we utilized more keywords
or selected events that are more speaker distinctive.

3. SPEAKER STYLE & PRONUNCIATION MODELLING


In this section we introduce methods of incorporating additional
phonetic and speaker information in the modelling process. We also
overview the use of features other than standard cepstral coefcients
for voice comparison.

3.1. Exploiting Phonetic and Speaker Information in UBM


Training
With the goal of encompassing additional phonetic and speaker information into the modelling process this work investigates two criteria for training a Universal Background Model (UBM).
In the rst, the phonetically inspired UBM (PIUBM) is estimated directly from the acoustic models of an Automatic Speech
Recognition (ASR) system by using K-means clustering. A symmetric variant of the Kullback-Leibler (KL) distance between two Gaussian components is used as a distance measure in the K-means clustering algorithm to achieve the nal clustering of the ASR acoustic
model (250K Gaussians) to a UBM of 1024 Gaussian components.
The details of the algorithm and the update equations are provided
in [15].
In the second approach, we add a discriminative regularization
(DR) term to the log-likelihood objective function to reduce the
value of the imposter scores and increase the value of the target
scores. The parameters of the UBM are updated using an EM-like
algorithm [15] to maximize the regularized maximum likelihood
objective function.
The two previously discussed methods to train the UBM parameters were evaluated on the English tasks of the core condition of the
NIST 2008 Speaker Recognition Evaluation [11]. The development
data set consists of 13770 utterances from Switchboard II Phase III,
NIST 2004, NIST 2006, and NIST 2008 development databases.
The rst three systems in Table 1 are: the baseline system using
the 36 MFCC-based speaker recognition frontend, a system with a
UBM trained on the ASR FMPE frontend using the EM algorithm,
and the PIUBM which uses a UBM generated from the ASR acoustic
models. The 1024 Gaussian component, Maximum-Likelihood Estimated UBMs for both the baseline and the ASR frontend systems are
trained using the whole 13770-utterance development set. As shown
in Table 1, the performance of the two systems which use the ASR
frontend features outperform the baseline system on the Int-Tel, the
Tel-Mic, and the Int-Int-S tasks. The results in Table 1 also show that
the PIUBM system outperforms the other two systems signicantly
on the Tel-US and Tel-Eng tasks.
For the discriminative regularization system, only the NIST
2008 development data is used for training the UBM parameters. As
shown in Table 1, signicant gains in EER and minimum DCF are
obtained by using the discriminative regularization objective functions to train the UBM parameters on tasks which have interview
data or microphone data. As shown in Table 1, signicant improvements on all tasks are obtained by combining the PIUBM and DR
systems.
3.2. Voice Source and Formant Features
From the point of view of the source-lter model of speech production, the frequency spectrum is in part a function of the both the
voice source and the vocal tract (along with channel and other artifacts). Formant trajectories, related to the lter component of speech
production, are used by some forensic phoneticians in their analysis. Alternatively, the voice source, which plays an important role
in the voice quality, also produces signicant cues about the physical traits of the speaker. State-of-the-art speaker recognition systems
mainly use features that are more closely related to short time spectral band-energies or envelopes, for example MFCCs/LPCCs. It was
also shown that prosody (voice source related) also contains speaker
information [16, 17].
In this overview, we discuss our results on the use of vocal
source and formant features in a speaker recognition task. In partic-

5162

ular, we propose using additional voice source features such as jitter


and shimmer, previously known to be useful for measuring pathological voice quality [18]. In our system, the voice source features
include: Log energy, Spectral tilt, Voicing level, Log pitch, Jitter,
and Shimmer. Jitter and shimmer features are computed for each
frame from a surrounding window of 100 ms in duration. The lter
related features we used are formant frequencies and their bandwidth
values extracted using the Praat voice analysis software.
We evaluated the use of these features on Tasks 6, 7 and 8 from
the NIST 2008 core evaluation. The minimum DCF performance
statistics of both the voice source and the formant systems are about
two to four times worse than the MFCC baseline system. However,
linear score fusion of the voice source and formant features provided consistent gains, indicating a certain degree of independence
between these two feature sets. An oracle fusion result achieved up
to 40% relative improvement over either system alone. We also observed some gains on English trials for the combination of the voice
source features and the baseline MFCC features. We note that no
consistent gains were observed when combining the formant features and the MFCC features.
4. DEMOGRAPHICS: DIALECT, NON-NATIVENESS AND
NATIVE LANGUAGE DETECTION
Demographic information has been used by forensic scientists in
a court of law. Correspondingly, we have developed three types
of GMM based detectors targeted at improving automatic speaker
recognition: one to detect American English dialects, one to detect
non-native English speakers, and the last to detect the native language of the speaker.
In our experiments for American English dialect detection, we
used 20900 recordings from 10670 speakers in the English Fisher
database. Two-thirds of this data is used for training the system, and
one third is for testing. Dialectal classes are determined based on
the location where the speaker was raised. We used eight dialectal
classes namely North, South, West, Midland, New England, Canada,
English-speaking countries, and Other countries. We achieved an
average detection accuracy of 53.9%.
For the non-native speaker detection task, we used the same
data set and setup specied in [19, 20]. The PIUBM system, described before, reduced the average EER to 9.5% [20]. This result
outperforms the best published result on this database by 23.4% relative [19]. For the native language detection, we use the Fisher and the
CSLU foreign accented English (CSLU-FAE) databases for training
and testing. We considered the 25 native languages available in the
two sets. The result on the CSLU-FAE database, in terms of detection error, outperforms the best published result on this database by
35.8% relative [21, 20].
In the speaker verication experiments, we trained an SVM classier to generate the nal scores. The input to this SVM classier
is a vector of three sets of values: the baseline speaker verication
system score, the absolute difference between the individual detection scores of the enrollment and the test recordings, and the sum of
the detection scores of the two recordings. The details of the results
in [20] show small but not statistically signicant improvements due
to using the demographic detection scores.
5. SYSTEM COMBINATION AND CONDITIONING
There is an ongoing trend in forensic sciences toward using the loglikelihood ratio (LLR) as the correct measure of evidence [1, 2].

System
Baseline
ASR Frontend
PIUBM
DR
Combination

Int-Int-All
23.9 (4.6)
21.4 (4.8)
23.4 (5.3)
15.9 (2.7)
13.7 (2.7)

Performance
minDCF (x103 ) and EER (%) (in parentheses)
Int-Int-S Int-Int-D
Int-Tel
Tel-Mic
Tel-Eng
2.0 (0.8) 24.2 (4.6) 37.5 (10.3) 28.8 (7.4) 15.6 (3.5)
0.7 (0.4) 22.2 (5.0)
31.8 (7.6)
21.8 (6.4) 16.4 (3.4)
1.7 (0.3) 24.5 (5.5)
30.7 (8.6)
22.1 (6.7) 12.7 (2.7)
1.9 (0.6) 16.1 (2.8)
29.1 (7.1)
25.6 (7.2) 14.0(3.4)
1.3 (0.3) 14.2 (2.7)
20.3 (5.1)
15.5 (4.7)
9.7 (2.1)

Tel-US
15.4 (4.4)
15.4 (4.4)
11.6 (3.0)
14.1 (4.1)
9.3 (2.1)

Table 1. Results on the NIST 2008 English core condition tasks comparing the baseline with systems utilizing ASR features and the PIUBM.
When there are multiple sets of attributes compared across conversations, consolidating the corresponding likelihood ratios is a difcult
problem. Many successful automatic speaker verication systems
rely on a combination of various systems [22, 23]. Our recent research investigates a discriminative approach that boosts the performance of weighted system combination by exploiting information in
reduced dimensional representations of GMM supervectors. In our
experiments, we obtained a minimum DCF of 0.046 for Task 6 of the
NIST 2008 SRE using only 400 dimensional supervector representations. Using these reduced dimensional representations in addition
to the original scores obtained with high dimensional supervectors,
we obtained an improvement of 10% relative to using only the original scores.
Furthermore, each of these systems are prone to severe enroll/test mismatch problems due to speech signal conditions including environment-related factors (such as whether the recordings
are highly noise contaminated and whether the recordings are telephone speech or microphone speech), and speaker-characteristics
factors (such as age, nativeness, and dialect). Our recent work has
indicated that fusion strategies that apply a different score normalization for different language conditions (for instance, considering
if the enrollment and test speech is of the same language or not) are
very effective. Our experiments in Task 6 of the NIST 2008 SRE
demonstrate that a language-conditioned fusion strategy reduced the
minDCF to 0.024 to produce a competitive result on this task.
6. CONCLUSIONS
We presented ongoing work encompassing information from forensic phonetic approaches for improving automatic voice comparison.
We also overviewed our automatic nativeness, dialect/accent, and
native language detection systems that are inspired by the role
demographic information has implicitly played in forensics. The
importance of alternative features, language conditioning and system combination were also discussed. Early results show that some
of the forensic inspired approaches yield improved results over
conventional methods. The early work on hot-spot analysis demonstrated that better performance can be achieved when particular
speech events are selected and modeled, in contrast to using general
models that utilize all data. The work encompassing phonetic related
information extracted from automatic speech recognition systems
(the PI-UBM) and speaker information using discriminative training
also showed marked improvements over the corresponding baseline.

[4] W. Labov and W. Harris, Addressing social issues through linguistic


evidence, Language and the Law, 1994.
[5] D. Sturim, et al, Speaker verication using text-constrained Gaussian
mixture models, ICASSP, 2002.
[6] K. Boakye and B. Peskin, Text-constrained speaker recognition on a
text-independent task, Odyssey, 2004.
[7] F. Weber, B. Peskin, M. Newman, A. Corrada-Emmanuel, and
L. Gillick, Speaker recognition on single- and multispeaker data,
Digital Signal Processing, vol. 10, no. 1/2/3, 2000.
[8] T. Bocklet and E. Shriberg, Speaker recognition using syllable-based
constraints for cepstral frame selection, IEEE ICASSP, 2009.
[9] W. Campbell, D. Sturim, D. Reynolds, and A. Solomonoff, SVM
based speaker verication using a GMM supervector kernel and NAP
variability compensation, ICASSP, 2006.
[10] A. Hatch, S. Kajarekar, and A. Stolcke, Within-class covariance normalization for SVM-based speaker recognition, International Conference on Spoken Language Processing, 2006.
[11] National Institute of Standards and Technology, NIST speech group
website, http://www.nist.gov/speech, 2008.
[12] R. Schwartz, et al, Construction of a phonotactic dialect corpus using
semiautomatic annotation, Interspeech, 2007.
[13] W. Campbell, J. Campbell, D. Reynolds, D. Jones, and T. Leek, Phonetic speaker recognition with support vector machines, IEEE NIPS,
2003.
[14] A. Hatch, B. Peskin, and A. Stolcke, Improved phonetic speaker
recognition using lattice decoding, IEEE ICASSP, 2005.
[15] M. Omar and J. Pelecanos, Training universal background models for
speaker recognition, IEEE Odyssey, 2010.
[16] A. Adami, et al, Modeling prosodic dynamics for speaker recognition, ICASSP, 2003.
[17] S. Kajarekar, L. Ferrer, K. Sonmez, and J. Zheng, Modeling nerfs for
speaker recognition, IEEE Odyssey, 2004.
[18] M. Farrus, J. Hernando, and P. Ejarque, Jitter and shimmer measurements for speaker recognition, Eurospeech, 2007.
[19] E. Shriberg, et al, Detecting non-native speech using speaker recognition approaches, IEEE Speaker Odyssey, 2008.
[20] M. Omar and J. Pelecanos, A novel approach to detecting non-native
speakers and their native language, ICASSP, 2010.
[21] G. Chouelter, G. Zweig, and P. Nguyen, An empirical study of automatic accent classication, in proc. of ICASSP, 2008.
[22] N. Brummer,
Focal bilinear toolkit,
/site/nikobrummer/focalbilinear, 2010.

http://sites.google.com

[23] L. Ferrer, et al, System combination using auxiliary information for


speaker verication, ICASSP, 2008.

7. REFERENCES
[1] G. Morrison, Forensic voice comparison, Expert Evidence, 2010.
[2] P. Rose, Forensic Speaker Identication, Taylor-Francis, 2002.
[3] H. Hollien, Forensic Voice Identication, Academic Press, 2002.

5163

You might also like