You are on page 1of 12

An Unsupervised Deep Domain Adaptation Approach for Robust Speech

Recognition

Sining Sun, Binbin Zhang, Lei Xie, Yanning Zhang


School of Computer Science, Northwestern Polytechnical University, Xi’an, China

Abstract
This paper addresses the robust speech recognition problem as an domain adaptation task. Specifically,
we introduce an unsupervised deep domain adaptation (DDA) approach to acoustic modeling in order to
eliminate the training-testing mismatch that is common in real-world use of speech recognition. Under
a multi-task learning framework, the approach jointly learns two discriminative classifiers using one deep
neural network (DNN). As the main task, a label predictor predicts phoneme labels and is used during
training and at test time. As the second task, a domain classifier discriminates between the source and the
target domains during training. The network is optimized by minimizing the loss of the label classifier and
to maximize the loss of the domain classifier at the same time. The proposed approach is easy to implement
by modifying a common feed-forward network. Moreover, this unsupervised approach only needs labeled
training data from the source domain and some unlabeled raw data of the new domain. Speech recognition
experiments on noise/channel distortion and domain shift confirm the effectiveness of the proposed approach.
For instance, on the Aurora-4 corpus, compared with the acoustic model trained only using clean data, the
DDA approach achieves relative 37.8% word error rate (WER) reduction.
Keywords: domain adaptation, robust speech recognition, deep neural network, deep learning

1. Introduction tral stage in speech recognition [5, 2], replacing the


GMM-HMM architecture. We have witnessed the
The increasing availability of multimedia big success of various types of (deep) neural networks
data, including various genres of speech, is fostering (DNNs) not only in speech recognition, but also in
a new wave of multimedia analytics that aim to ef- visual data processing, data mining and other ar-
fectively access the content and pull meaning from eas [6, 7, 8, 9, 10].
the data. Automatic speech recognition (ASR), Speech is a typical big data: not just in volume,
which transcribes speech into text, serves as a nec- but also noisy and heterogeneous. In practice, we
essary preprocessing step for multimedia analytics. desire a robust speech recognizer that is able to han-
With the help of big data, supercomputing infras- dle noisy data. For many machine learning tasks,
tructure and deep learning [1], the speech recogni- including ASR, we usually assume that the training
tion accuracy has been dramatically lifted during data and the testing data have the same probabil-
the past years [2]. Besides the Gaussian mixture ity distributions. However, real-world applications
model - hidden Markov model (GMM-HMM) ar- often fail to meet this hypothesis [5, 11]. In speech
chitecture that dominates the acoustic modeling in recognition, both the GMM-HMM and DNN-HMM
speech recognition for many years, artificial neural systems are Bayesian classifiers by nature. Theo-
networks have been historically used as an alter- retical investigation has shown that the training-
native model but with limited success [3]. Only testing mismatch notoriously leads to increase of
recently, neural network has re-emerged as an ef- errors in Bayesian classification [11]. There are
fective tool for acoustic modeling because of the many reasons that lead to the mismatch such as
power of big data and effective learning method [4]. environmental noises, channel distortions [12] and
The DNN-HMM architecture has come to the cen- room reverberations [13, 14]. To improve the en-
Preprint submitted to Journal of LATEX Templates November 13, 2016
vironmental robustness of a speech recognizer, a a “domain-invariant feature extractor”.
common and efficient approach is multi-condition Our work is inspired by a recent DNN based
training [15] that uses the contaminated noisy data, unsupervised domain adaptation approach for im-
together with the clean data, in the acoustic model- age classification [23]. This deep domain adap-
ing training. But it is impossible to cover all kinds tation (DDA) approach combines domain adap-
of real-world conditions and the mismatch still ex- tation and deep feature learning within a single
ists. Therefore, environment robustness is still a training process. Specifically, under a multi-task
big challenge remain unsolved. On the other hand, learning framework, the approach jointly learns
real-world speech data is heterogeneous. Speech one feature extractor and two discriminative clas-
in different domains, e.g., broadcast news, lectures, sifiers using one single DNN: the feature extrac-
meeting recordings and conversations, has different tor is trained to extract domain-invariant and
characteristics. This causes another mismatch that classification-discriminative featutes; the label pre-
apparently decrease the speech recognition perfor- dictor predicts class labels and is used both dur-
mance [14]. ing training and testing; a domain predictor dis-
In order to eliminate the training-testing mis- criminates between the source and the target do-
match, a large number of robust speech recogni- mains during training. In order to obtain domain-
tion methods have been proposed, which in general invariant and classification-discriminative features,
fall into two categories: feature-space approaches the feature extractor sub-network is optimized by
and model-space approaches [16, 17]. Most ap- minimizing the loss of the label predictor and max-
proaches need some prior knowledge about the mis- imizing the loss of the domain predictor at the
match. For example, noise characteristics have to same time, which is achieved by a special objec-
be known beforehand or clean-noisy speech pairs 1 tive function we defined later. The parameters of
are needed [18]. Model adaptation is a typi- two predictor sub-networks are optimized in or-
cal model-space approach that is quite useful in der to minimize their losses on the training set.
noise robustness. The acoustic model, e.g., GMM- Compared with other unsupervised adaptation ap-
HMMs, is adapted using the new data either in a proaches, the DDA approach is easy to implement
supervised manner [19] or an unsupervised man- by simply augmenting a common feed-forward net-
ner [20]. For feature-space approaches, it is com- work with few standard layers and a simple new
mon to combine information about speaker, envi- gradient reversal layer. Moreover, this approach
ronment and noise, such as using i-vector [21], to only needs the labeled training data from the source
acoustic features. domain and some unlabeled raw data of the new do-
In this paper, we regard the robust speech main. Experiments show that the DDA approach
recognition problem as a domain adaptation (DA) outperforms previous state-of-the-art image classifi-
task [22]. Learning a discriminative classifier in the cation approaches on several popular datasets [23].
presence of the mismatch between training and test- In this study, we introduce the DDA approach
ing distributions is known as domain adaptation. to robust speech recognition. Applying DDA to
The essence of the domain adaptation and robust speech recognition is not trivial. This is because
speech recognition is identical, that is, to eliminate speech recognition is a more challenging task as
the mismatch between the training data and the compared with image classification. We elaborate
test data. We find that the speech features yield some of the major challenges as follows.
to different distributions if they come from differ-
• The large number of labels: In the typi-
ent domains (such as clean and noisy speech con-
cal image classification task in [23], the num-
ditions [16], and data sets with different genres).
ber of classes are only dozens. In contrast, in
Specifically, if we train a DNN acoustic model using
speech recognition, the class labels are thou-
clean speech, we discovered that the feature distri-
sands of senones (i.e., phoneme states). The
butions of clean and noisy speech yielded from this
effectiveness of DDA on a large scale classifi-
acoustic model are significantly different. Hence we
cation task like speech recognition desires an
would like to embed the domain information dur-
intensive study.
ing the acoustic model training in order to obtain
• Decoding: As compared with image classifi-
1 Noisy speech may be generated manually by adding cation, speech recognition is a rather compli-
noises into clean speech. cated task with frame-level classification (clas-
2
sify each speech frame into senone labels) and adopted to remove the noise before speech recog-
decoding (Viterbi search from a large graph nition. But the unavoidable distortions in the en-
based on the classified frame labels). The accu- hanced speech may cause another new mismatch
racy gain in frame-level classification may not problem.
ensure consistent accuracy gain at the word Rather than updating the features, the acoustic
level [24]. model parameters can be compensated to match
the testing conditions. A simple example of up-
• Deeper networks: The neural networks in dating the models is to re-train them with the new
speech recognition usually have many hidden data; or more popular, adding a variety of noise
layers in order to learn highly nonlinear and samples to clean training data, known as multi-
discriminative features which are robust to ir- style or multi-condition training [17, 15]. However,
relevant variabilities. due to the unpredictable nature of real-world noise,
To bridge the gap, in this paper, we study how it is impossible to account for all noise conditions
to integrate DDA into acoustic modeling and that may be encountered. Thus adaptive and pre-
present a systematic analysis of the performance dictive methods are proposed in the model-space.
of DDA in robust speech recognition. Our study The adaptive methods update the model parame-
shows that the DDA approach can significantly ters when sufficient corrupted speech data are avail-
boost the speech recognition performance in both able. Popular methods include maximum a poste-
noisy/channel distortion and domain-shift condi- riori re-estimation (MAP) [30] and maximum likeli-
tions. hood linear regression (MLLR) [31]. In the predic-
The rest of this paper is structured as follows. tive methods, a noise model is combined with the
Section 2 surveys the related work. Section 3 clean speech models to provide a corrupted speech
presents the framework of deep domain adaptation acoustic model using some model of the acoustic en-
and studies how to use it in the speech recogni- vironment. Parallel model combination (PMC) [32]
tion task. Experimental settings and results are and vector taylor series (VTS) [33] fall into this cat-
discussed in Section 4, 5, 6 and finally conclusions egory.
are drawn in Section 7.
2.2. DNN based methods
Compared with GMMs, DNNs have an outstand-
2. Related work ing non-linear learning ability, which makes DNN a
more robust acoustic model. Hence the DNN-HMM
As we just mentioned, robust speech recog- architecture is inherently noise robust to some ex-
nition methods can be classified into two cate- tent as compared with GMM-HMM [17]. However,
gories: feature-space approaches and model-space it is not enough to solve the mismatch problem
approaches [16, 17]. Compared with model-space merely relying on the non-linear learning ability.
approaches, feature-space approaches do not need Recently, many methods in the feature and model
to modify or retrain the acoustic model. Instead, spaces have been proposed to make DNN-HMM
various operations can be performed in the acoustic more robust to the mismatched test data. In order
features to improve the noise (or other distortions) to account for the mismatch, many useful auxiliary
robustness of the features. As for the model-space features, reflecting environmental noise and speaker
approaches, rather than focusing on the modifica- information [17, 18], are combined with acoustic
tion of features, the acoustic model parameters are features as the DNN input. Neural networks can
adjusted to match the testing data. be used as a speech enhancement tool. In [34, 35],
a denoising autoencoder (DAE) is adopted to re-
2.1. Traditional methods construct clean speech features from noisy ones.
In the feature space, feature normalization is This kind of method needs stereo data, i.e., clean
the most straightforward strategy to eliminate the speech and corresponding noisy speech, to train
training-testing mismatch. Popular strategies in- the denoising DNN. DNN feature enhancement and
clude cepstral mean subtraction (CMS) [25], cep- DNN acoustic model can be trained jointly [36, 37].
stral mean variance normalization (CMVN) [26] Multi-task training is another popular strategy to
and histogram equalization (HEQ) [27]. Obvi- improve the robustness of the acoustic model [38].
ously, speech enhancement methods [28, 29] can be By adding one or more auxiliary output layers in
3
the DNN and optimizing several tasks (e.g., main d ∈ {[0, 1], [1, 0]}. Specifically, this model is de-
task: prediction of senone labels, side task: denois- composed into three parts to perform different map-
ing) at the same time, the network gains more ro- pings: a feature extractor Gf , a label predictor Gy
bustness [38]. and a domain predictor Gd .
More formally, the mapping functions are:
f = Gf (x; Θf ); (1)
3. Deep Domain Adaptation for Robust
ASR y = Gy (f; Θy ); (2)
d = Gd (f; Θd ); (3)
3.1. The Model where Θf , Θy , Θd are the parameters of the network
We treat the training-testing mismatch problem (in Figure 1) and f is a D-dimension feature vector.
as a domain adaptation task, bridging the tar- Our aim is to jointly train Gf , Gy and Gd . Specif-
get (testing) and the source (training) domains. ically, we want to seek Θf to minimize the label
The main purpose of deep domain adaptation prediction loss and to maximize the domain classi-
(DDA) [23] is to embed the domain information into fication loss at the same time. The maximization
the process of learning representation, so that the of the domain classification loss is actually to make
final classification decisions are made based on fea- the two feature domain distributions as similar as
tures that are both discriminative and invariant to possible. Meanwhile, In order to assure the domain
the changes of domains. This means the represen- classification, the Θd has to make the mapping Gd
tation learned by the DNN classifier has the same perform well in domain classification. This leads to
or very similar distributions in the source and the the loss function of this network:
target domains.
Assume that the neural network model works
X
E(Θf , Θy , Θd ) = Ly (Gy (Gf (xi ; Θf ); Θy ), yi ) −
with input samples x ∈ X and certain labels i=1,...N
di =[1,0]
y ∈ Y where X and Y are input space and out- X
put space, respectively. Here in speech recognition, λ Ld (Gd (Gf (xi ; Θf ); Θd ), di )
x and y are framewise acoustic features and senones i=1,...N
(phoneme states), respectively. There are two dis- X X
= Liy (Θf , Θy ) − λ Lid (Θf , Θd )
tributions S(x, y) and T(x, y) on X ⊗ Y , which are
i=1,...N i=1,...N
referred to as the source distribution (for training) di =[1,0]
and the target distribution (for testing) and both (4)
the two distributions are assumed complicated and
unknown. Due to domain shift, S and T are similar where Ly (., .) and Ld (., .) are loss functions for label
but different. and domain predictors respecitvely, while Liy (., .)
In the training-testing mismatch scenario, we and Lid (., .) denote the loss of the i-th training sam-
train the model with S(x, y), but we test the ple. Loss functions can be cross entropy or mean
model with the data yields to distribution T(x, y). square error function depends on the tasks. λ is
However, we can access to many training sam- a positive hyper parameter used to trade off two
ples {x1 , x2 , ..., xN } from source domain and tar- losses in practice. Frankly, the similar loss functions
get domain according to the marginal distributions are common used in many other machine learning
S(x) and T(x). Denote with di ([0, 1] or [1, 0]) task [39, 40, 41]
the (domain label ) for the i-th sample, which in-
3.2. Optimization
dicates whether xi comes from the source domain
(xi ∼ S(x) if di = [1, 0]) or from the target domain According to the loss function derived from Sec-
(xi ∼ T(x) if di = [0, 1]). tion 3.1, we can optimize the DDA network using
The unsupervised deep domain adaptation ar- an approach similar to stochastic gradient decent
chitecture [23] is depicted in Figure 1. The ar- (SGD) [42]. The aim of the optimization is to seek
chitecture is simply based on a feed-forward neu- the optimized parameters that:
ral network. But different from a common one,
this network has two output layers, which are the (Θ̂f , Θ̂y ) = arg min E(Θf , Θd , Θy ), (5)
main class label y ∈ Y and the domain label Θf ,Θy

4
Label Predictor G y (f ; y )
L y L y Ly
Input Vector Feature Extractor G f (x; f )   y  y y
...
Ly
x  f f
...

...

...
...

...
...

...
...

Domain Predictor Gd (f ; d )

...
...

...
... d
...

Ld
 ...
 f

...

...
...
L
 d Ld Ld
 d
 d

Figure 1: Unsupervised deep domain adaptation architecture.

Θ̂d = arg max E(Θf , Θd , Θy ). (6) 3.3. Applying DDA to Speech Recognition
Θd
State-of-the-art ASR systems are Bayesian classi-
Although Θd is optimized by maximizing Eq (4), fiers by nature. A typical speech recognition system
it equals to minimize the second item of Eq (4). So can be formulated as a simple equation:
Θd will make sure the performance of domain pre-
dictor. Θf is optimized by minimizing the first item Ŵ = argmax P (X|W)P (W) (10)
W∈L
and maximizing the second item (because of the mi-
nus symbol). This training strategy will keep the where W = {w1 , w2 , . . . } is a possible word se-
feature extracted from the neural network domain- quence in langauge L, X = {x1 , x2 , . . . } is the
invariant and classification-discriminative. Under observation sequence with frame-level acoustic fea-
the multi-task learning framework, the following ture x, P (X|W) is the acoustic model and P (W)
equations are used to update the parameters: is the language model. Therefore speech recogni-
tion (or decoding) is to find out the optimal word
∂Liy ∂Li
Θf ← Θf − µ( − λ d) (7) sequence Ŵ that maximizes the joint acoustic and
∂Θf Θf language probabilities.
As for the language model, word level N -gram
∂Lid model [43], trained from a large set of textual
Θd ← Θd − µ (8) data, is usually used. The acoustic model is of-
∂Θd
ten built at fine-grained phoneme (subword) level,
trained from labelled speech data with transcripts.
∂Liy The distribution of speech data is complex and the
Θy ← Θy − µ (9) speech production is apparently a dynamic process.
∂Θy
Traditionally, hidden Markov models (HMMs) are
where µ is step size. used to model this dynamic process in a phoneme
5
Acoutic Model Training Stage Decoding Stage
Input
Vector GMM-HMM AM WFST
Training Samples x
Feature GMM-HMM
Alignment
1
H1
Extractor Training :w

1.7
1/0
.5

wi/
Hi:
Tri-phones states labels
3 Lexicon
2
as1s1

H2:w2/0.7

/5.7
as2 s2 a

:w8
Context

H8
... as3s3

/0.7
k Dependenc
Input Vector

...

:w8
4 Transducer
asn sn

H4

...

.7
/0
Input

k
:w
m

Domain labels
Vector ...

Hm
Test Samples x
Language

Feature Model
Extractor N0
DNN-HMM AM

Test Stage
Word sequence hypotheses

Figure 2: The DDA approach used for robust ASR.

through state transitions, while Gaussian mixture tures (represented by input vector x in Figure 2),
models (GMMs) are used to depict the distribution such as MFCC or FBank, for the training speech
of speech data at HMM state level (sub-phoneme or samples. Then the acoustic feature sequences are
so-called senone). This is the so-called GMM-HMM used to train triphone GMM-HMM acoustic models
architecture. In practice, context-dependent mod- (so-called senones). The GMM-HMM models are
els, e.g., triphones, are used to model the impor- just used to perform forced alignment to the train-
tant coarticulation phenomenon in speech produc- ing samples, obtaining the labelled training sam-
tion. Recently, neural networks have re-emerged ples (speech frame and its corresponding senone la-
as a powerful acoustic modeling tool with superior bel). Within the pairwise frame-label data, a DNN
performance [5, 2], replacing GMMs to depict the acoustic model is thus learned that classifies the in-
distribution of speech data, namely the DNN-HMM put frame-level acoustic vector into senone label. In
architecture. Either GMM-HMM or DNN-GMM, this process, we can use the DDA approach to learn
if the distributions of the training data and the the senone label classifier and the domain classifier
test data have some differences, the error of the at the same time using the labelled training data
Bayesian classifier will be increased [11]. Hence in and some of the unlabelled raw data from the test-
this study, we use the unsupervised deep domain ing domain. At the test stage, the domain predictor
adaptation (DDA) strategy to adjust the acoustic is discarded and we only use the senone predictor
model during the training time. Our purpose is to as the acoustic model.
let the DNN acoustic model learn similar distribu- Given the predicted senone label scores, a speech
tions both in the training data and the test data, recognizer still needs a decoder to obtain the best
which may increase the robustness of the Bayesian word sequence. As we mentioned in the beginning
classifier. of this section, decoding involves not only an acous-
Figure 2 shows how to use the DDA strategy in tic model, but also a language model. The acous-
speech recognition. A speech recognition system tic score and the language score are combined in
is composed of an acoustic model training stage2 the decoding process for the decision of the final
and a testing stage. In the acoustic model train- word sequence. Here we use the weighted finite-
ing stage, the first step is to extract acoustic fea- state transducers (WFST) [24] based static decoder
to do the combination. In order to compose the de-
2 A language model is also needed, but its training is out coding WFST, apart from the acoustic model and
of the scope of this paper. the language model, a lexicon and the context are
6
also needed [24, 44]. Using the compose operation we use the clean-condition training set of Aurora-4,
in WFST, the different level representations are in- which includes 7138 utterances, to train a triphone
tegrated in just one WFST graph, which maps the GMM-HMM acoustic model. The acoustic feature
HMM states to words. For efficiency reasons, token is 39-dim MFCC. Then the GMM-HMM acoustic
passing [45] and beam search algorithms are ofen model is used to align the training data to obtain
applied in the decoding process. the triphone state (senones) labels.
After that, two different DNN-HMM acoustic
models are trained: the conventional DNN-HMM
4. Experiments for Noise/Channel Robust-
model trained with a standard feed-forward net-
ness
work and the new DNN-HMM model trained using
We evaluate the noise robustness of DDA on the DDA approach in Figure 1. For clarity, they
Aurora-4 [15], a popular corpus for robust ASR re- are named as Clean-DNN-HMM and DDA-DNN-
search. Aurora-4 is designed to verify the effective- HMM, respectively. The Clean-DNN-HMM model
ness of robust ASR methods on a medium vocabu- is trained using all the 7138 clean-condition training
lary continuous speech recognition task. There are utterances, as a baseline model. The training data
two different training conditions: (1) clean training of DDA-DNN-HMM consists of two parts: 7138
condition, which includes 7138 utterances recorded clean-condition utterances with senone labels and
with the primary microphone without any added 3000 multi-condition utterances without senone la-
noise or distortions; and (2) multi-condition train- bels. The clean-condition utterances are used to
ing condition, including the same 7138 utterances, train the whole network (Gf , Gy , Gd ) while the
but with one half of the data was recorded by the multi-condition utterances are used to train the fea-
primary microphone and the other half recorded us- ture extractor and the domain classifier (Gf , Gd ).
ing the second microphone; all are contaminated Because the data from the target domain does not
with six types of added noises at 10-20 dB SNR. In have senone labels, we randomly generate senone
order to investigate different noise/channel distor- labels for the target domain data in order to train
tion conditions, the Aurora-4 test set is composed the model in a uniform framework. Specifically, we
of four subsets. use a binary flag to control if the errors of the cur-
rent frame is used to optimize the feature extrac-
• Subset A (Clean): 330 clean utterances with- tor and the senone labels predictor or not. If the
out any noises or distortions, recorded with the current frame comes from the target domain, the
primary microphone; senone predictor errors are thus discarded. As for
the domain predictor, we also have two domain la-
• Subset B (Noise): 330 × 6 utterances, by cor- bels to predict. Although there are various kinds of
rupting Subset A with six different noises; noises in our training data, we do not distinguish
• Subset C: (Channel distortion): 330 utter- them because we do not want to use too much priori
ances, same as Subset A, but recorded with the knowledge of the data. Hence for simplicity, there
second microphone, without any added noises. are just two class labels to predict (clean and noise).
For the two DNN-HMM systems, the input layer
• Subset D (Noise+Channel distortion): 330 × 6 is a context window of 11 frames of 40-dim FBANK
utterances, by corrupting Subset C with six with delta and acceleration coefficients (40×3×11).
different noises The Gf part of the network has 6 hidden layers
with 1024 units in each layer. We also compare
All the speech files are sampled at 16KHz, quanti- our approach with a state-of-the-art approach –
fied by 16 bits. DNN-PP [35]. Two DNNs are used in this ap-
proach [35]: speech enhancement DNN and acous-
4.1. Clean condition training with multi-condition tic model DNN. The first DNN, as a pre-processor
testing for denoising, trained with clean-noisy speech pairs.
This experiment is designed to evaluate the ro- All the training data, including clean and noisy
bustness of the DDA approach in mismatched samples, go through the first DNN and then used
training-testing condition: acoustic model is for DNN acoustic model (the second DNN) train-
trained using clean speech while tested in multiple ing. Apart from these experiments, we also experi-
conditions with contaminated speech. Specifically, ment with the semi-supervised method for compar-
7
working. We can see that WER goes down with
Table 1: Experimental results for clean condition training
with multi-condition test on Aurora-4 in terms of WER the increase of λ and the lowest WER is achieved
(Word Error Rate). The hyper-parameter λ = 0.45 for DDA- when λ = 0.45. On the contrary, when we set λ a
DNN-HMM. value below zero, WER increases. This is because
Model A B C D Avg. the domain difference is enlarged when λ is set to
Clean-DNN-HMM 3.36 29.74 21.02 50.73 36.22
a negative value, as seen in Eq. (6). Another fac-
DDA-DNN-HMM 3.24 14.52 17.82 34.55 22.53 tor which may affect the DDA-DNN-HMM acous-
Semi-Ada-DNN-HMM 4.13 17.55 15.67 37.73 25.11 tic model is the position where we put the feature
DNN-PP [35] 5.1 12.0 10.5 29.0 18.7 layer f. If we regard the Gf and Gy as an whole
network and change the position of feature repre-
sentation layer from top (near to softmax layer of
ison. For the target domain data, we do not have Gy ) to down (near to the input of Gf ), we find that
senone labels. Hence we first decode the unlabeled WER increases as shown in Figure 3 (b). Figure 3
target data using the Clean-DNN-HMM model and (c) shows the relationship between WER and the
get the senone labels. Please note that the resul- amount of adaptation data. We find that the per-
tant senone labels do have inevitable errors. The formance improves with the increase of adaptation
adapted model, namely Semi-Ada-DNN-HMM, is data. But beyond 4000 adaptation utterances, the
then obtained by fine-tuning the the Clean-DNN- performance gain becomes very small.
HMM acoustic model using these labels. The Semi-
Ada-DNN-HMM model is used to test the target 4.3. Multi-condition training with surprise noise
domain test data. testing
Table 1 shows the experimental results. From As we pointed out in Section 2, multi-condition
the results, we notice that the Clean-DNN-HMM training is an effective approach to improve the
model, which is trained using clean data, per- robustness of an ASR system. This is achieved
forms badly under noisy and channel mismatch by training the acoustic model using contaminated
conditions. The word error rate sharply increases speech. Hence the distributions of the training data
from 3.36% to 50.73% when the system encoun- and test data become identical or similar. How-
ters both noise and channel distortions. Mean- ever, in real-world, multi-condition training cannot
while, we clearly observe that the DDA-DNN-HMM cover all types of contamination (noise or channel
model consistently reduces the word error rates for distortion). We carry out an experiment to check if
all testing subsets. Especially for the most chal- the DDA approach still works when the the multi-
lenging condition, i.e., subset D (with both noise condition trained ASR system encounters some sur-
and channel distortion), the WER is significantly prise types of noise. In the experiment, test data is
dropped from 50.73% to 34.55%. In average, DDA- derived by adding three kinds of new noise to the
DNN-HMM achieves relative 37.8% WER reduc- clean test data with 5-10 dB SNR 3 . The multi-
tion (from 36.22% to 22.53%). Our approach is even condition DNN-HMM, denoted as MultiCon-DNN-
better than the Semi-Ada-DNN-HMM model. This HMM, is trained only using the multi-condition
is because of the inevitablely wrong senone labels training data from Aurora-4. The DDA-DNN-
used for model fine-tuning in the semi-supervised HMM is trained using the multi-condition training
approach. The average WER of DDA-DNN-HMM and 3000 noisy utterances corrupted by the three
is even close to DNN-PP [35], a method that needs new noises. The network is the same with that in
pairwise clean-noisy data for front-end speech en- Section 4.1. Results are summarized in Table 2. We
hancement. notice that multi-condition training is quite effec-
tive and the WER of MultiCon-DNN-HMM is sig-
4.2. Impact of Hyper-parameters nificantly decreased as compared with the Clean-
We also investigate the impacts of hyper- DNN-HMM in Table 2. But with the DDA ap-
parameters λ, the position of feature representation proach, the WER is further reduced from 8.22% to
layer f and the amount of adaptation data. Their 7.45% and relative WER reduction of 9.36% is thus
impacts are depicted in Figure 3. Figure 3 (a) shows achieved.
how λ affects the average WER. When λ = 0, the
DDA-DNN-HMM model becomes the Clean-DNN- 3 These three types of noise are from another noise dataset

HMM model, in which the domain predictor is not and they are totally different with the noises in Aurora-4.

8
(a) Impact of λ (b) Impact of feature layer position (c) Impact of adaptation data size
42 42 42

40 40 40

38 38 38

36 36 36

34 34 34
WER(%)

WER(%)

WER(%)
32 32 32

30 30 30

28 28 28

26 26 26

24 24 24

22 22 22
-0.4 -0.2 0 0.2 0.4 4 5 6 0 1000 2000 3000 4000 5000 6000 7000
λ layer index number of adaptation utterances

Figure 3: Relationship between WER and (a) hyper-parameter λ, (b) position of feature representation layer and (c) the
amount of adaptation data. For comparison, the blue dotted line represents the WER of Clean-DNN-HMM.

utterance was recorded in two channels: a high-


Table 2: Experimental results for multi-condition training
with surprise noise testing on Aurora-4. quality “primary” microphone (a head-mounted,
noise-cancelling Sennheiser HMD410), and an addi-
Model WER (%) tional microphone (desk-mounted Crown or other).
MultiCon-DNN-HMM 8.22 The total duration of WSJ0 and WSJ1 are about
DDA-DNN-HMM 7.45 80 hours. LibriSpeech is a 1000-hours corpus de-
rived from audiobooks that are part of the LibriVox
Project. The WSJ and LibriSpeech corpus can be
Table 3: Experimental results for domain shift. The DDA- used to train large vocabulary continuous speech
DNN-HMM acoustic model is trained using 80h WSJ la- recognition (LVCSR) acoustic models.
belled data and 30h LibriSpeech unlabelled data. We first train a GMM-HMM acoustic model ac-
Model WER (%) cording to the configuration in [47], resulting in
3414 senones. Then, we train a DNN-HMM acous-
Baseline 31.19 tic model using the 80-hour WSJ data as a base-
DDA-DNN-HMM 29.40 line system. After that, we train a DDA-DNN-
HMM acoustic model using 80-hour WSJ data
(with senone labels) and 40-hour adaptation data
5. Experiments for Domain Shift from Librispeech (without senone labels) out of 500-
hour “train-other-500” subset. The DNN has the
As we discussed in Section 1, real-world speech
same topology with the one used in Section 4. We
is heterogeneous with different genres. We test the
use the 5.4-hour Librispeech “test-other” set for
proposed DDA approach to see if it shows robust-
testing. Table 3 shows the results on this test set.
ness when the speech recognizer is used in another
We can see that about 6.9% relative WER reduc-
domain.
tion is achieved when the DDA approach is used.
In this experiment, we regard the WSJ [46] and
This confirms that the proposed approach shows
Librispeech [47] corpus as data from different “do-
robustness to domain shift.
mains”. The WSJ0 and WSJ1 corpus4 consist
primarily of read speech with texts drawn from
a machine-readable corpus of Wall Street Jour- 6. Analysis
nal news text. WSJ0 includes a 5000-word text
As we mentioned in Section 3.1, our purpose is
while WSJ1 includes a 20000-word text. Each
to learn domain-invariant feature representations
which have the same or similar distributions in the
4 The WSJ corpus contains WSJ0 and WSJ1. source and the target domains. In our experiments,
9
Feature Representation of Clean−DNN−HMM
0
Dim2

−2

−4

−6

Clean Speech
−8
Noisy Speech

−10
−7 −6 −5 −4 −3 −2 −1 0
Dim1

Feature Representation of DDA−DNN−HMM


0
Dim2

−2

−4

−6

−8
Clean Speech
−10 Noisy Speech

−12
−12 −10 −8 −6 −4 −2 0
Dim1

Figure 4: Comparison of learned feature representations of Clean-DNN-HMM and DDA-DNN-HMM. The top figure is obtained
by feeding the clean and corresponding noisy speech to the Clean-DNN-HMM acoustic model described in Section 4.1. The
bottom figure is obtained by feeding the same clean and noisy speech to the DDA-DNN-HMM acoustic model. We only visualize
two dimensions for clarity.

we regard the training data as the target domain ing an unsupervised deep domain adaptation ap-
and the test data as the target domain. The learned proach. Through a multi-task learning framework,
feature representations, denoted as f in Figure 1, a deep neural network feature extractor is learned
can be visualized for analysis. The dimension of by minimizing the loss of the phoneme classifier
this representation is 1024 in our model and we (main task) and to maximize the loss of the domain
randomly choose two dimensions to visualize. To classifier (second task) at the same time. Specifi-
this end, we feed some clean speech frames and cally, during the acoustic model training, the do-
corresponding noisy speech frames to Clean-DNN- main classifier tries to eliminate the differences of
HMM and DDA-DNN-HMM models, respectively, data distribution between the source and the tar-
discussed in Section 4.1 and the two feature dimen- get domains. This approach significantly improves
sions are plotted in Figure 4. From the top figure the performance of DNN acoustic model using some
in Figure 4, it is obvious that the representations unlabeled data from the new domain. When eval-
of clean speech (denoted as red points) and noisy uated in the “clean condition training and multi-
speech (denoted as green points) obtained by Clean- condition testing” scenario on Aurora-4 corpus, the
DNN-HMM acoustic model have very different dis- proposed approach decreases the word error rate
tributions, which shows the mismatch between the from 36.22% to 22.53%, with 37.8% relative error
training and test data. In contrast, this difference reduction. In the domain shift experiment, the ap-
in distributions clearly becomes smaller for DDA- proach achieves 6.9% relative word error rate re-
DNN-HMM, in which the deep domain adaptation duction. Analysis shows that the performance gain
approach effectively narrows the training-testing comes from the elimination of the mismatches be-
mismatch. tween the distributions of the training and test-
ing data. In the future work, we plan to imple-
ment the domain adaptation approach in convo-
7. Conclusion
lutional neural network (CNN) [48] and recurrent
In this paper, we has addressed the training- neural networks (RNN) [49] that have shown su-
testing mismatch problem in speech recognition us- perior performances in speech recognition. We also

10
want to investigate the performances if treating dif- [12] J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview
ferent types of noises as different domains in the of noise-robust automatic speech recognition, Audio,
Speech, and Language Processing, IEEE/ACM Trans-
DDA framework. We notice that a recent multi- actions on 22 (4) (2014) 745–777.
tasking training (MTL) approach has similar idea [13] V. Peddinti, G. Chen, D. Povey, S. Khudanpur, Rever-
with our proposed DDA approach. In [50], an beration robust acoustic modeling using i-vectors with
MTL approach is proposed to simultaneously pre- time delay neural networks, Proceedings of INTER-
SPEECH. ISCA.
dict the class label and the clean speech from the [14] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Ha-
noisy speech input. We plan to experimentally com- bets, R. Haeb-Umbach, W. Kellermann, V. Leutnant,
pare the DDA approach with this MTL approach R. Maas, T. Nakatani, B. Raj, A. Sehr, T. Yoshioka, A
in our future work. summary of the reverb challenge: state-of-the-art and
remaining challenges in reverberant speech processing
research, EURASIP Journal on Advances in Signal Pro-
cessing 2016 (1) (2016) 1–19.
Acknowledgements [15] H.-G. Hirsch, D. Pearce, The aurora experimen-
tal framework for the performance evaluation of
speech recognition systems under noisy conditions, in:
We would like to thank Yaroslav Ganin for ASR2000-Automatic Speech Recognition: Challenges
the constructive discussions when performing this for the new Millenium ISCA Tutorial and Research
study. Workshop (ITRW), 2000.
[16] B. Li, Noise-robust speech recognition using deep neural
network, Ph.D. thesis, National University of Singapore
(2014).
References [17] M. L. Seltzer, D. Yu, Y. Wang, An investigation of deep
neural networks for noise robust speech recognition,
[1] D. Y. Li Deng, Deep learning: Methods and applica- in: Acoustics, Speech and Signal Processing (ICASSP),
2013 IEEE International Conference on, IEEE, 2013,
tions, Tech. rep. (May 2014).
[2] G. Hinton, L. Deng, D. Yu, A. rahman Mohamed, pp. 7398–7402.
[18] Y. Qian, T. Tan, D. Yu, An investigation into using
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. S. G.
Dahl, B. Kingsbury, Deep neural networks for acoustic parallel data for far-field speech recognition.
[19] Y. Estève, P. Deléglise, Adaptation and discriminative
modeling in speech recognition, IEEE Signal Processing
Magazine 29 (6) (2012) 82–97. training of acoustic models, Techniques for Noise Ro-
bustness in Automatic Speech Recognition (2012) 283–
[3] E. Trentin, M. Gori, A survey of hybrid ann/hmm mod-
els for automatic speech recognition, Neurocomputing 310.
[20] U. Remes, K. J. Palomaki, M. Kurimo, Robust auto-
37 (1) (2001) 91–126.
[4] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learn- matic speech recognition using acoustic model adapta-
ing algorithm for deep belief nets, Neural computation tion prior to missing feature reconstruction, in: Sig-
18 (7) (2006) 1527–1554. nal Processing Conference, 2009 17th European, IEEE,
[5] G. E. Dahl, T. N. Sainath, G. E. Hinton, Improving 2009, pp. 535–539.
deep neural networks for lvcsr using rectified linear units [21] V. Gupta, P. Kenny, P. Ouellet, T. Stafylakis, I-vector-
and dropout, in: Acoustics, Speech and Signal Process- based speaker adaptation of deep neural networks for
ing (ICASSP), 2013 IEEE International Conference on, french broadcast audio transcription, in: Acoustics,
IEEE, 2013, pp. 8609–8613. Speech and Signal Processing (ICASSP), 2014 IEEE In-
[6] M. Xin, H. Zhang, H. Wang, M. Sun, D. Yuan, ternational Conference on, IEEE, 2014, pp. 6334–6338.
Arch: Adaptive recurrent-convolutional hybrid net- [22] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza,
works for long-term action recognition, Neurocomput- F. Pereira, J. W. Vaughan, A theory of learning from
ing 178 (2016) 87–102. different domains, Machine learning 79 (1-2) (2010)
[7] P. Miao, Y. Shen, Y. Li, L. Bao, Finite-time recur- 151–175.
rent neural networks for solving nonlinear optimization [23] Y. Ganin, V. Lempitsky, Unsupervised domain adapta-
problems and their application, Neurocomputing. tion by backpropagation, in: Proceedings of the 32nd
[8] M. S. Ali, S. Saravanan, Robust finite-time h??? con- International Conference on Machine Learning (ICML-
trol for a class of uncertain switched neural networks of 15), JMLR Workshop and Conference Proceedings,
neutral-type with distributed time varying delays, Neu- 2015, pp. 1180–1189.
rocomputing. [24] M. Mohri, F. Pereira, M. Riley, Weighted finite-state
[9] N. Nedjah, F. M. G. França, M. De Gregorio, transducers in speech recognition, Computer Speech &
L. de Macedo Mourelle, Weightless neural systems, Language 16 (1) (2002) 69–88.
Neurocomputing. [25] M. Westphal, The use of cepstral means in conversa-
[10] P. Wang, B. Xu, J. Xu, G. Tian, C.-L. Liu, H. Hao, Se- tional speech recognition., in: EUROSPEECH, 1997.
mantic expansion using word embedding clustering and [26] S. Molau, F. Hilger, H. Ney, Feature space normal-
convolutional neural network for improving short text ization in adverse acoustic conditions, in: Acous-
classification, Neurocomputing 174 (2016) 806–814. tics, Speech, and Signal Processing, 2003. Proceed-
[11] T. Virtanen, R. Singh, B. Raj, Techniques for noise ings.(ICASSP’03). 2003 IEEE International Conference
robustness in automatic speech recognition, John Wiley on, Vol. 1, IEEE, 2003, pp. I–656.
& Sons, 2012. [27] F. Hilger, H. Ney, Quantile based histogram equaliza-

11
tion for noise robust large vocabulary speech recogni- [44] D. Povey, A. Ghoshal, G. Boulianne, L. Burget,
tion, Audio, Speech, and Language Processing, IEEE O. Glembek, N. Goel, M. Hannemann, P. Motlicek,
Transactions on 14 (3) (2006) 845–854. Y. Qian, P. Schwarz, et al., The kaldi speech recog-
[28] S. F. Boll, Suppression of acoustic noise in speech using nition toolkit, in: IEEE 2011 workshop on auto-
spectral subtraction, Acoustics, Speech and Signal Pro- matic speech recognition and understanding, no. EPFL-
cessing, IEEE Transactions on 27 (2) (1979) 113–120. CONF-192584, IEEE Signal Processing Society, 2011.
[29] J. Koehler, N. Morgan, H. Hermansky, H. G. Hirsch, [45] S. J. Young, N. Russell, J. Thornton, Token passing:
G. Tong, Integrating rasta-plp into speech recognition, a simple conceptual model for connected speech recog-
in: Acoustics, Speech, and Signal Processing, 1994. nition systems, Cambridge University Engineering De-
ICASSP-94., 1994 IEEE International Conference on, partment Cambridge, UK, 1989.
Vol. 1, IEEE, 1994, pp. I–421. [46] D. B. Paul, J. M. Baker, The design for the wall street
[30] J.-L. Gauvain, C.-H. Lee, Maximum a posteriori esti- journal-based csr corpus, in: Proceedings of the work-
mation for multivariate gaussian mixture observations shop on Speech and Natural Language, Association for
of markov chains, Speech and audio processing, ieee Computational Linguistics, 1992, pp. 357–362.
transactions on 2 (2) (1994) 291–298. [47] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, LIB-
[31] M. J. Gales, Maximum likelihood linear transforma- RISPEECH: AN ASR CORPUS BASED ON PUBLIC
tions for hmm-based speech recognition, Computer DOMAIN AUDIO BOOKS.
speech & language 12 (2) (1998) 75–98. [48] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, G. Penn,
[32] M. Gales, S. Young, Parallel model combination for Applying convolutional neural networks concepts to hy-
speech recognition in noise, University of Cambridge, brid nn-hmm model for speech recognition, in: Acous-
Department of Engineering, 1993. tics, Speech and Signal Processing (ICASSP), 2012
[33] P. J. Moreno, B. Raj, R. M. Stern, A vector taylor series IEEE International Conference on, IEEE, 2012, pp.
approach for environment-independent speech recog- 4277–4280.
nition, in: Acoustics, Speech, and Signal Processing, [49] A. Graves, A.-r. Mohamed, G. Hinton, Speech recogni-
1996. ICASSP-96. Conference Proceedings., 1996 IEEE tion with deep recurrent neural networks, in: Acoustics,
International Conference on, Vol. 2, IEEE, 1996, pp. Speech and Signal Processing (ICASSP), 2013 IEEE In-
733–736. ternational Conference on, IEEE, 2013, pp. 6645–6649.
[34] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, [50] B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, M. Bac-
P. Nguyen, A. Y. Ng, Recurrent neural networks for chiani, Neural network adaptive beamforming for ro-
noise reduction in robust asr., in: INTERSPEECH, bust multichannel speech recognition, in: Proc. Inter-
2012, pp. 22–25. speech, 2016.
[35] J. Du, Q. Wang, T. Gao, Y. Xu, L.-R. Dai, C.-H. Lee,
Robust speech recognition with speech enhanced deep
neural networks., in: INTERSPEECH, 2014, pp. 616–
620.
[36] T. Gao, J. Du, L.-R. Dai, C.-H. Lee, Joint training of
front-end and back-end deep neural networks for robust
speech recognition, in: Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Confer-
ence on, IEEE, 2015, pp. 4375–4379.
[37] K. LEE, S. J. Kang, W. H. Kang, N. S. Kim, Two-stage
noise aware training using asymmetric deep denoising
autoencoder.
[38] Y. Qian, M. Yin, Y. You, K. Yu, Multi-task joint-
learning of deep neural networks for robust speech
recognition, in: 2015 IEEE Workshop on Automatic
Speech Recognition and Understanding (ASRU), IEEE,
2015, pp. 310–316.
[39] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M. S.
Lew, Deep learning for visual understanding: A review,
Neurocomputing 187 (2016) 27 – 48, recent Develop-
ments on Deep Big Vision.
[40] C. Hong, J. Yu, J. Wan, D. Tao, M. Wang, Multimodal
deep autoencoder for human pose recovery, IEEE Trans.
Image Processing 24 (2015) 5659–5670.
[41] C. Hong, J. Yu, J. You, X. Chen, D. Tao, Multi-view
ensemble manifold regularization for 3d object recogni-
tion, Information Sciences 320 (2015) 395 – 405.
[42] L. Bottou, Large-scale machine learning with stochastic
gradient descent, in: Proceedings of COMPSTAT’2010,
Springer, 2010, pp. 177–186.
[43] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D.
Pietra, J. C. Lai, Class-based n-gram models of natural
language, Computational linguistics 18 (4) (1992) 467–
479.

12

The author has requested enhancement of the downloaded file. All in-text references underlined in blue are linked to publications on ResearchGate.

You might also like