Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220732612
CITATIONS
READS
56
6 authors, including:
Jason Pelecanos
Weizhong Zhu
IBM
IBM Research
18 PUBLICATIONS 36 CITATIONS
SEE PROFILE
SEE PROFILE
ABSTRACT
This paper presents ongoing research leveraging forensic methods for automatic speaker recognition. Some of the methods forensic
scientists employ include identifying speaker distinctive audio segments and comparing these segments using features such as pitch,
formant, and other information. Other approaches have also involved
performing a phonetic analysis to recognize idiolectal attributes, and
an implicit analysis of the demographics of speakers.
Inspired by these forensic phonetic approaches, we target three
threads of work; hot-spot analysis, speaker style and pronunciation
modelling, and demographics analysis. As a result of this work we
show that a phonetic analysis conditioned on select speech events
(or hot-spots) can outperform a phonetic analysis performed over all
speech without conditioning. In the area of pronunciation modelling,
one set of results demonstrate signicantly improved robustness by
exploiting phonetic structure in an automatic speech recognition system. For demographics analysis, we present state-of-the-art results
of systems capable of detecting dialect, non-nativeness and native
language.
Index Terms Forensics, speaker verication, hot-spot, pronunciation modelling, demographics.
1. INTRODUCTION
For a number of years, the speaker recognition community has examined how to model the general patterns of short-term spectral coefcients, prosody and complementary high-level information. The
underlying phenomena was typically modelled somewhat as a black
box. There may be a benet in breaking down the speech into distinctive events and modelling their detail in a similar manner to the
approaches used by some forensic phoneticians. The aim of our ongoing work is to develop techniques motivated by the procedures
performed in the forensics community and include them for the benet of improving automatic speaker recognition. We partition these
techniques into three main areas: hot-spot analysis, pronunciation
modelling and demographics analysis.
Hot-Spot/Landmark Analysis: Some forensic approaches compare the properties of particular words and phone sequences [1, 2, 3].
Similarly, we can begin to include information from this style of
analysis in an automated system. This involves identifying segments
in an utterance that are considered discriminative and comparing
them with like segments in other utterances. The long-term goal is
This research was funded by the Ofce of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA),
through the Army Research Laboratory (ARL). All statements of fact, opinion or conclusions contained herein are those of the authors and should not be
construed as representing the ofcial views or policies of IARPA, the ODNI,
or the U.S. Government.
5160
to investigate which events are important (including a natural expansion to events longer than single words), how to model rare events
and how they should be matched when event recognition errors, intrinsic speaker and speech variation (such as co-articulation) effects
are considered.
Speaker Style and Pronunciation Modelling: It is observed that
forensic phoneticians often encompass high-level speech information implicitly as part of their analysis. Additionally, they have also
considered features such as pitch, formant and related information.
Here we investigate methods for incorporating high-level information in the speaker modelling process. We also include a brief analysis of some of the features used in forensic and pathological voice
analysis.
Demographic Analysis and Speaker Proling: One of the rst
things forensic practitioners examine (often implicitly) is a speakers
dialect. Not surprisingly, forensic scientists have used it to provide signicant evidence in court [2]. One important example by
Labov [4] was the case of Paul Prinzivalli who worked for Pan Am
as a cargo handler. He was suspected of making a telephone bomb
threat to his employer. According to phoneticians working for the
defense, the voice from the recorded telephone threat was characteristic of a New England accent, while Prinzivallis accent was distinctively New Yorker. Based on this and other evidence Prinzivalli
was later acquitted. Accordingly, we have developed demographics
based detectors to assist in the problem of automatic voice comparison.
The three research threads proposed here represent steps toward
encompassing techniques performed by phoneticians in forensic
voice comparison casework. Correspondingly, Section 2 discusses
progress on work related to hot-spot analysis, Section 3 details
research on speaker style and pronunciation modelling, Section 4
refers to the demographics work, while Section 5 overviews system
combination and score conditioning efforts.
2. HOT-SPOT ANALYSIS
One approach that forensic phoneticians have followed for comparing voices is to locate important phonetic events (hot-spots) and to
extract appropriate speech measurements accordingly [1, 2]. Along
these lines, we examine two approaches; word and phone sequence
conditioned acoustic comparison followed by a related phone sequence analysis methodology.
2.1. Word/Phone-Sequence Conditioned Acoustic Comparison
There are several previous works employing speech candidate selection methods for automatic voice comparison. For example,
Sturim [5] used the features from a small set of keywords to produce
a Gaussian Mixture Model (GMM). Similarly, Boakye [6] utilized
ICASSP 2011
100
Word Conditioned PSV
Baseline PSV w/o
Word Conditioning
95
90
minDCF (X 103)
a small keyword set and modelled the features from each keyword
using a Hidden Markov Model (HMM). Weber [7] examined the
use of a large vocabulary automatic speech recognition system to
perform speaker recognition modelling. More recent work by Bocklet [8] demonstrated the utility of modelling specic syllabic events
for speaker recognition.
The ongoing goal of this work is to include additional forensic acoustic-phonetic knowledge in the modelling process. Another
goal is to generalize the current English centric approaches for other
languages. With this aim, we compare an English language focused
system conditioned on words to one conditioned on phone sequences
(which may be more efciently adapted to other languages). We also
incorporate recent high dimensional kernel dot-product scoring and
session variability compensation methods.
We begin by investigating the performance of words modelled
using HMMs as part of a high dimensional symmetric supervector kernel derived in a similar manner to Campbell [9] through a
Kullback-Leibler distance approximation. Here the supervectors
are formed from the concatenation of the individual (currently English) word HMM supervectors. The result is determined from a
dot-product of the supervector representations followed by ZT-score
normalization. Nuisance Attribute Projection (NAP) is also applied [9] with the directions determined by calculating the Within
Class Covariance (WCC) [10]. To target the goal of being language
independent, we also analyze the performance of phone bi-grams
and phone tri-grams as part of a supervector kernel. For both the
words and phone sequences we select the most popular events by
count.
Figure 1 presents a plot of the speaker recognition performance
versus the relative quantity of speech used (measured using minimum DCF [11] on the male subset of Task 8 in the NIST 2008
Speaker Recognition Evaluation). It shows that the performance of
each of the 3 hot-spot systems track relatively well. This suggests
that phonetic sequences may potentially be used as word level proxy
events and such models may lend themselves to more language independent modelling. We also present a result of the Phonetically Inspired UBM (PI-UBM) system (see Section 3.1) for contrast. While
the performance statistics of the hot-spot systems are not as good
as the PI-UBM setup there are some small (however not signicant)
gains observed using fusion. These systems will need to be evaluated on data sets with more trials to infer stronger conclusions. With
system combination in mind, this type of kernel which is built from
a limited set of discrete events (word and phone sequences), can be
extended to use only select events that are complementary to other
state-of-the-art systems.
2.2. Word Conditioned Phone Sequence Analysis
Another approach that we are currently exploring for incorporating more information from a forensic phonetic viewpoint is word
conditioned phone sequence analysis. This aims to capture idiolectal differences in pronunciation between speakers by analyzing
word-specic phone sequence realizations. Forensic phoneticians
can identify a speakers dialect based on allophonic rules for commonly occurring transformations from general English to dialectal
forms [2, 12] (e.g., pen /pEn/ could be pronounced as either [pEn]
or [pIn] which can relate to a speakers geographical background).
Correspondingly, we develop a system that can capture speakerdependent attributes from speech through phonetic analysis given
keywords.
Figure 2 compares the baseline phonetic speaker verication
system [13, 14] and our word-conditioned extension using the 50
5161
85
80
75
70
65
10
15
20
25
30
35
40
Number of Keywords for Word Conditioning
45
50
5162
System
Baseline
ASR Frontend
PIUBM
DR
Combination
Int-Int-All
23.9 (4.6)
21.4 (4.8)
23.4 (5.3)
15.9 (2.7)
13.7 (2.7)
Performance
minDCF (x103 ) and EER (%) (in parentheses)
Int-Int-S Int-Int-D
Int-Tel
Tel-Mic
Tel-Eng
2.0 (0.8) 24.2 (4.6) 37.5 (10.3) 28.8 (7.4) 15.6 (3.5)
0.7 (0.4) 22.2 (5.0)
31.8 (7.6)
21.8 (6.4) 16.4 (3.4)
1.7 (0.3) 24.5 (5.5)
30.7 (8.6)
22.1 (6.7) 12.7 (2.7)
1.9 (0.6) 16.1 (2.8)
29.1 (7.1)
25.6 (7.2) 14.0(3.4)
1.3 (0.3) 14.2 (2.7)
20.3 (5.1)
15.5 (4.7)
9.7 (2.1)
Tel-US
15.4 (4.4)
15.4 (4.4)
11.6 (3.0)
14.1 (4.1)
9.3 (2.1)
Table 1. Results on the NIST 2008 English core condition tasks comparing the baseline with systems utilizing ASR features and the PIUBM.
When there are multiple sets of attributes compared across conversations, consolidating the corresponding likelihood ratios is a difcult
problem. Many successful automatic speaker verication systems
rely on a combination of various systems [22, 23]. Our recent research investigates a discriminative approach that boosts the performance of weighted system combination by exploiting information in
reduced dimensional representations of GMM supervectors. In our
experiments, we obtained a minimum DCF of 0.046 for Task 6 of the
NIST 2008 SRE using only 400 dimensional supervector representations. Using these reduced dimensional representations in addition
to the original scores obtained with high dimensional supervectors,
we obtained an improvement of 10% relative to using only the original scores.
Furthermore, each of these systems are prone to severe enroll/test mismatch problems due to speech signal conditions including environment-related factors (such as whether the recordings
are highly noise contaminated and whether the recordings are telephone speech or microphone speech), and speaker-characteristics
factors (such as age, nativeness, and dialect). Our recent work has
indicated that fusion strategies that apply a different score normalization for different language conditions (for instance, considering
if the enrollment and test speech is of the same language or not) are
very effective. Our experiments in Task 6 of the NIST 2008 SRE
demonstrate that a language-conditioned fusion strategy reduced the
minDCF to 0.024 to produce a competitive result on this task.
6. CONCLUSIONS
We presented ongoing work encompassing information from forensic phonetic approaches for improving automatic voice comparison.
We also overviewed our automatic nativeness, dialect/accent, and
native language detection systems that are inspired by the role
demographic information has implicitly played in forensics. The
importance of alternative features, language conditioning and system combination were also discussed. Early results show that some
of the forensic inspired approaches yield improved results over
conventional methods. The early work on hot-spot analysis demonstrated that better performance can be achieved when particular
speech events are selected and modeled, in contrast to using general
models that utilize all data. The work encompassing phonetic related
information extracted from automatic speech recognition systems
(the PI-UBM) and speaker information using discriminative training
also showed marked improvements over the corresponding baseline.
http://sites.google.com
7. REFERENCES
[1] G. Morrison, Forensic voice comparison, Expert Evidence, 2010.
[2] P. Rose, Forensic Speaker Identication, Taylor-Francis, 2002.
[3] H. Hollien, Forensic Voice Identication, Academic Press, 2002.
5163