Professional Documents
Culture Documents
on IMDs with
Dual-Mic
1
Ivan Lopez
Espejo
Doctoral Thesis
PART I
Robust Speech Recognition on Intelligent Mobile
Introduction
Devices with Dual-Microphone
Motivation
Objectives
PART II
Author
Summary Ivan Lopez Espejo
Multi-Channel
Supervisors
Power
Spectrum Antonio M. Peinado and Angel M. Gomez
Enhancement
Ph.D. Program in Information and Communication Technologies
Dual-Channel Dept. of Signal Theory, Telematics and Communications
VTS Feature
Compensation
Dual-Channel
Deep
Learning University of Granada
Techniques
Granada, 22nd September 2017
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
2
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
PART II
PART I
Summary Introduction, Motivation and Objectives
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
3
ASR in Noisy Conditions
Ivan Lopez Overview
Espejo Automatic speech recognition (ASR) is a mature technology under
controlled conditions.
PART I
Introduction Gap in performance between humans and machines due to the introduction
Motivation of mismatch between the training and testing conditions of the ASR
Objectives system.
Intra-speaker variability (mood, presence of illness...)
PART II Inter-speaker variability
Transmission channel
Summary Reverberation
Multi-Channel
...
Power
Background (additive) noise
Spectrum
Enhancement While human beings exhibit a high degree of robustness against noise when
Dual-Channel recognizing speech, noise can make ASR systems unusable.
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
4
ASR in Noisy Conditions
Ivan Lopez ASR fundamentals
Espejo
PART I
Front-end: feature extraction, e.g. MFCCs (X, feats).
Introduction Back-end: speech decoding (W, transcription).
Motivation
p(X|W) is the acoustic score
Objectives P(W) is the language score
PART II
EXAMPLE: acoustic models are trained with clean speech data and we try to
Summary
recognize noisy speech data mismatch will cause a wrong transcription.
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
5
ASR in Noisy Conditions
Ivan Lopez Speech distortion modeling
Espejo
We consider the classical linear speech distortion model...
PART I
Time domain: y (m) = h(m) x(m) + n(m)
Introduction Power spectral domain: |Y (f , t)|2 = |H(f , t)|2 |X 2
(f , t)| + |N(f , t)|
2
Relative frequency
Relative frequency
Power 0.03
Noise
0.03 0.03
Spectrum
Enhancement 0.02 0.02 0.02
Dual-Channel
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
6
ASR in Noisy Conditions
Ivan Lopez Approaches to strengthen the ASR systems
Espejo
A possible taxonomy of the noise-robust methods for ASR:
PART I
Introduction
Feature-space approaches
Motivation Noise-robust features (RASTA-PLP, TANDEM...), normalization of
statistical moments of the features (CMN, HEQ...) and feature
Objectives
enhancement (spectral subtraction, Wiener filtering, AFE...)
PART II
Summary
Model-based approaches
Multi-Channel Model adaptation (CMLLR...) and adaptive training (fNAT, SAT...)
Power
Spectrum
Enhancement
Compensation with explicit distortion modeling
Dual-Channel Model adaptation or feature compensation (VTS...)
VTS Feature
Compensation Missing-data approaches
Dual-Channel Ignoring unreliable elements during recognition (marginalization,
Deep
Learning SFD...) and data imputation (TGI...)
Techniques
Results
More approaches: stereo data learning-based techniques (SPLICE, DNNs...),
Summary, exemplar-based techniques (NMF...), etc.
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
7
ASR on Mobile Devices with Various Sensors
Ivan Lopez Overview and motivation
Espejo
PART I
Introduction
Intelligent mobile devices (IMDs), e.g.
Motivation
smartphones or tablets, have revolutionized
Objectives
the way we live.
PART II
ASR has experienced a new upswing
Summary
(search-by-voice, dictation, voice control...)
Multi-Channel
Power
Spectrum
Mobile devices can be used anywhere at
Enhancement any time tackling with noise is more
Dual-Channel important than ever before!
VTS Feature
Compensation
We can take advantage of the small
Dual-Channel
Deep microphone arrays embedded in the latest
Learning
Techniques
IMDs for noise-robust ASR purposes.
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
8
ASR on Mobile Devices with Various Sensors
Ivan Lopez Approaches to strengthen the ASR systems
Espejo
Performance benefits from a combination of single- and
PART I multi-channel techniques.
Introduction
Spatial filtering (beamforming) is a popular choice see 3rd
Motivation
and 4th CHiME Challenges.
Objectives Delay-and-sum, MVDR, adaptive array processing (GSC...)
PART II
Beamforming shortcomings are partially overcome by
Summary
post-filtering (e.g. MVDR + Wiener post-filter = MCWF).
Multi-Channel
Power Low directivity at low frequencies, inaccurate estimation of the
Spectrum steering vector, inability to remove noise coming from the look
Enhancement
direction, etc.
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Problem: beamforming exhibits some important constraints when
Results
Summary,
performing on arrays comprised of a few sensors close each other
Conclusions & (IMDs) specific solutions power level difference (PLD).
Future Work
Robust ASR
on IMDs with
Dual-Mic
9
ASR on Mobile Devices with Various Sensors
Ivan Lopez Power level difference (PLD)
Espejo
PART I
Introduction
Motivation
Clean speech Clean speech
Closetalk Fartalk
Objectives
45 45
Channel 1
PART II 40 40 Channel 2
PSD (dB/Hz)
PSD (dB/Hz)
Summary 35 35
30 30
Multi-Channel
25 25
The 2nd sensor is
Power
Spectrum 20 20
placed in an
Enhancement acoustic shadow
15 15
0 1 2
Frequency (kHz)
3 4 0 1 2
Frequency (kHz)
3 4 speech is
Dual-Channel
VTS Feature Car noise Car noise attenuated at the
Closetalk Fartalk
Compensation 45 45 2nd mic with
Channel 1
Dual-Channel 40 40 Channel 2 respect to the 1st
Deep one.
PSD (dB/Hz)
PSD (dB/Hz)
35 35
Learning
Techniques 30 30 Diffuse noise field
25 25 similar noise
Results
20 20 PSDs at both
Summary, 15 15 mics.
Conclusions & 0 1 2 3 4 0 1 2 3 4
Frequency (kHz) Frequency (kHz)
Future Work
Robust ASR
on IMDs with
Dual-Mic
10
Objectives of this Thesis
Ivan Lopez
Espejo As...
PART I 1 Latest IMDs embed small microphone arrays.
Introduction
2 The performance of beamforming for robust ASR on IMDs is
Motivation
limited.
Objectives
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
PART II
PART II
Summary Contributions and Results
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
12 1 Multi-Channel Power Spectrum Enhancement Summary
Ivan Lopez Combinatorial Strategy
Espejo Dual-Channel Spectral Subtraction (DCSS)
Power-Minimum Variance Distortionless Response (P-MVDR)
PART I Dual-Channel Spectral Weighting (DSW)
Introduction
MMSE-Based Relative Speech Gain Estimation
Motivation 2 Dual-Channel Vector Taylor Series Feature Compensation
Dual-Channel VTS Feature Compensation
Objectives
Calculation of the Posterior Probabilities (VTSS )
PART II Alternative Posterior Computation (VTSC )
Summary Clean Speech Partial Estimate Computation
Multi-Channel 3 Dual-Channel Deep Learning Techniques
Power Overview
Spectrum
Enhancement DNN-Based Missing-Data Mask Estimation
DNN-Based Noise Estimation
Dual-Channel
VTS Feature 4 Results
Compensation
Experimental Framework
Dual-Channel Power Spectrum Enhancement
Deep
Learning VTS Feature Compensation
Techniques Deep Learning-Based Techniques
Results 5 Summary, Conclusions and Future Work
Summary, Summary and Conclusions
Conclusions & Future Work
Future Work
Robust ASR
on IMDs with
Dual-Mic
13
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep I. Lopez-Espejo, A. M. Peinado, A. M. Gomez and J. A. Gonzalez: Dual-Channel Spectral Weighting for
Learning Robust Speech Recognition in Mobile Devices. Submitted to Digital Signal Processing (major revision).
Techniques
I. Lopez-Espejo, A. M. Gomez, J. A. Gonzalez and A. M. Peinado: Feature Enhancement for Robust Speech
Results Recognition on Smartphones with Dual-Microphone. In Proceedings of 22nd European Signal Processing
Conference, September 15, Lisbon (Portugal), 2014. Best student paper award
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
14
Combinatorial Strategy
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
PART II
Summary
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
As in the literature, we optionally consider microphone array
Compensation pre-processing (beamforming).
Dual-Channel
Deep
The virtual primary channel has a higher SNR than any other signal
Learning from the IMD.
Techniques
PART I
SS is extended to a dual-channel framework from
Introduction
|Y1 (f , t)|2 = |X1 (f , t)|2 + |N1 (f , t)|2
Motivation |Y2 (f , t)|2 = |X2 (f , t)|2 + |N2 (f , t)|2
Objectives
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
16
Power-Minimum Variance Distortionless
Ivan Lopez Response (P-MVDR)
Espejo
PART I
Introduction
P-MVDR discards the phase information to
overcome the limitations of the classical MVDR
Motivation beamforming when applied to our dual-channel
Objectives framework:
|Y1 (f , t)|2
PART II |X1 (f , t)|2 = wf>,t
|Y2 (f , t)|2
Summary
Dual-Channel
subject to wf>,t (1, A21 (f , t))> = 1
VTS Feature (distortionless constraint)
Compensation
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
17
Dual-Channel Spectral Weighting (DSW)
Ivan Lopez Biased spectral weighting (DSW-B)
Espejo
Dual-Channel
VTS Feature As Sn1 (f , t) = Sy2 (f , t), the biased Wiener filter results
Compensation
Sy (f , t) Sy2 (f , t)
Dual-Channel H1,b (f , t) = 1
Deep Sy1 (f , t)
Learning
Techniques
Normalized frequency
Normalized frequency
Bin 3643.41 Hz, Anechoic
0.06
PART I Bin 3643.41 Hz, Room
0.1
0.04
Introduction
0.05
0.02
Motivation
0 0
Objectives 4 3 2 1 0 1 2 4 3 2 1 0 1 2
log A21 (f) log A21 (f)
PART II
Summary The 2nd mic also captures speech (by diffraction, reflections...)
A21 (f , t) is non-zero (|X2 (f , t)|2 = A21 (f , t)|X1 (f , t)|2 )
Multi-Channel
Power
Spectrum A bias correction term is introduced:
Enhancement 1 Sy1 (f , t) Sy2 (f , t)
H1,u (f , t) =
Dual-Channel 1 A21 (f , t) Sy1 (f , t)
VTS Feature
| {z }| {z }
Compensation B 1 (f , t) H1,b (f , t)
Dual-Channel
Deep 1
Learning 0.8
Techniques A21(f, t)
H1,u(f, t)
0.6 0
Results 0.2
0.4
0.4
0.2 0.6
Summary, 0.8
Conclusions & 0
0 0.2 0.4 0.6 0.8 1
Future Work H1,b (f, t)
Robust ASR
on IMDs with
Dual-Mic
19
Dual-Channel Spectral Weighting (DSW)
Ivan Lopez Noise equalization (Eq)
Espejo The assumption Sn1 (f , t) Sn2 (f , t) may not be satisfied even in the
presence of a homogeneous noise field.
PART I
Summary
Inspired by MVDR beamforming
2
|N1 (f , t)|2 + std(|N1 (f , t)|2 ) gf>,t (f , t)
Multi-Channel gf ,t = arg mingf ,t E
Power
Spectrum subject to gf>,t (f , t) = 1 (distortionless constraint)
Enhancement
Dual-Channel 85
|N1 (f, t)|2
VTS Feature |N2 (f, t)|2
|N2(f, t)|2 ((f, t) = 1)
Compensation |N2(f, t)|2 ((f, t) 1)
80
Average power (dB)
Dual-Channel
Deep
Learning 75
Techniques
Results 70
Summary,
Conclusions & 65
0 0.5 1 1.5 2 2.5 3 3.5 4
Future Work Frequency bin (kHz)
Robust ASR
on IMDs with
Dual-Mic
20
Dual-Channel Spectral Weighting (DSW)
Ivan Lopez System overview
Espejo
PART I
Introduction
Motivation
Objectives
PART II
Summary
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep The spectro-temporal correlation of the speech signal is
Learning
Techniques exploited to refine the Wiener filter weights.
Results Two-dimensional median filtering
Summary, Two-dimensional Gaussian filtering
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
21
MMSE-Based Relative Speech Gain
Ivan Lopez Estimation
Espejo
Some definitions...
PART I
A21 (f , t) = |A21 (f , t)|2
Introduction
a21 = (A21 (0, t), ..., A21 (M 1, t))> (in the STFT domain)
Motivation a21 = ar21 + jai21
Objectives
Results 0.05
0.005
Summary,
Conclusions & 0
2 1 0 1 2
0
6000 4000 2000 0 2000 4000 6000
Future Work STFT coefficient value STFT coefficient value
Robust ASR
on IMDs with
Dual-Mic
22
MMSE-Based Relative Speech Gain
Ivan Lopez Estimation
Espejo
PART I
The
additive distortion model in the STFT domain:
y1 = x1 + n1
Introduction
y2 = x2 + n2 = a21 x1 + n2 ( is element-wise product)
Motivation
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
PART II
Summary
Dual-Channel Vector Taylor Series
Multi-Channel
Power
Feature Compensation
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Results I. Lopez-Espejo, A. M. Peinado, A. M. Gomez and J. A. Gonzalez: Dual-Channel VTS Feature Compensation
for Noise-Robust Speech Recognition on Mobile Devices. IET Signal Processing, 11:1725, 2017.
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
24
Dual-Channel VTS Feature Compensation
Ivan Lopez MMSE estimation
Espejo
Dual-Channel
Deep
The log-Mel clean speech features are estimated at every time
Learning frame under an MMSE approach as
Techniques
PK
Results x1 = E [x1 |y] = k=1 P(k|y)E [x1 |y, k]
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
25
Calculation of the Posterior Probabilities
Ivan Lopez (VTSS )
Espejo
By using the Bayes rule, the posteriors are defined as
PART I p(y|k)P(k)
(k) (k)
P(k|y) = PK , with p(y|k) = N y y , y
Introduction k 0 =1 p(y|k 0 )P(k 0 )
Motivation
The dual-channel
model is given
by the following stacked vector:
Objectives y1 f(x1 , a11 , n1 )
y= = ,
PART II y2 f(x1 , a21 , n2 )
where yi = f(x1 , ai1 , ni ) = x1 + ai1 + log 1M,1 + e ni x1 ai1
Summary
Multi-Channel
Power
Again, it is assumed that all the variables involved are Gaussians.
Spectrum
Enhancement First-order VTS expansion of the dual-channel
model:
(k) (i,k) (k)
Dual-Channel f(x1 , ai1 , ni ) f x1 , ai1 , ni + Jx x1 x1 +
VTS Feature
(i,k) (i,k)
Compensation Ja ai1 ai1 + Jn ni ni
Dual-Channel
Deep
Learning
The required
parameters arenow
easily estimated:
(k)
Techniques
(k) f x1 , a11 , n1
y
(k)
Results f x1 , a21 , n2
Summary, (k) (k) (k) (k) > (k) (k) > (k) (k) >
Conclusions & y Jx x1 Jx + Ja a Ja + Jn n Jn
Future Work
Robust ASR
on IMDs with
Dual-Mic
26
Alternative Posterior Computation (VTSC )
Ivan Lopez
Espejo
The 2nd channel is being treated in a parallel manner to the 1st one.
PART I
Introduction However, clean speech will be easily masked by noise at the 2nd channel
Motivation highly uncertain relation between the 2nd noisy channel and clean speech.
Objectives
We have found more robust conditioning the 2nd channel to the 1st
PART II channel.
Summary
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
27
Clean Speech Partial Estimate Computation
Ivan Lopez
Espejo The clean speech
Z partial expected values are defined as
PART I
E [x1 |y, k] = x1 p(x1 |y, k)dx1
Introduction
Motivation Two different proposals are considered for clean speech partial
Objectives estimate computation.
PART II
Dual-Channel
Deep
2nd approach (VTSb ): only the information from the 1st
Learning channel is used: (k)
Techniques
E [x1 |y, k] E [x1 |y1 , k] = y1 log 1M,1 + e n1 x1
Results
| {z }
Summary, E [g(x1 , a11 , n1 )|y1 , k]
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
28
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep I. Lopez-Espejo, A. M. Peinado, A. M. Gomez and J. M. Martn-Donas: Deep Neural Network-Based Noise
Learning Estimation for Robust ASR in Dual-Microphone Smartphones. Lecture Notes in Computer Science,
Techniques 10077:117127, 2016. Best paper award at IberSPEECH 16
Results I. Lopez-Espejo, J. A. Gonzalez, A. M. Gomez and A. M. Peinado: A Deep Neural Network Approach for
Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech
Summary, Recognition. Lecture Notes in Computer Science (IberSPEECH 14), 8854:119128, 2014.
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
29
Overview
Ivan Lopez
Espejo
PART I
Unlike the classical signal processing solutions, a main feature of
Introduction
deep learning is that no assumptions on the problem to be
Motivation
addressed are required.
Objectives
PART II
The powerful modeling capabilities of DNNs are applied to
Summary
complex tasks by also taking advantage of the dual-channel
Multi-Channel
Power information:
Spectrum
Enhancement Missing-data mask estimation (spectral reconstruction)
Noise estimation (feature compensation)
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Hybrid DNN/signal processing architectures will be extensively
Deep
Learning
and successfully explored in the near future.
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
30
DNN-Based Missing-Data Mask Estimation
Ivan Lopez
Espejo
Features:
PART I y(t L)
Introduction ..
Y(t) = ,
.
Motivation
y(t + L)
Objectives
PART II where
Summary y1 (t)
y(t) =
y2 (t)
Multi-Channel
Power
Spectrum - Input dim.:
Enhancement
dim(Y(t)) = 2M(2L + 1)
Dual-Channel
VTS Feature Target:
Compensation
- Oracle binary mask vector
Dual-Channel for y1 (t)
Deep
Learning
- Output dim.: M 1
Techniques - 7 dB SNR threshold
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
31
DNN-Based Missing-Data Mask Estimation
Ivan Lopez
Espejo
PART I
Introduction
Training issues:
Motivation
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
32
DNN-Based Noise Estimation
Ivan Lopez
Espejo
PART I
- A parallel approach is
Introduction
considered for noise
Motivation estimation
Objectives
PART II
Dual-Channel
VTS Feature
Compensation - MSE criterion for
Dual-Channel backpropagation learning
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
33
DNN-Based Noise Estimation
Ivan Lopez Noise-aware training
Espejo
Noise-aware training (NAT) first appeared to Augmented features:
PART I strengthen the DNN-based acoustic modeling
Y(t)
for ASR. n(0)
Introduction 1
We want to imitate linear interpolation noise (1)
n
Motivation
YNAT (t) = 1(0) ,
estimation.
Objectives 1
(1)
PART II 1
(t)
Summary
where
Multi-Channel
Power (t) = t/(T 1);
Spectrum
Enhancement t = 0, 1, ..., T 1
Dual-Channel
VTS Feature - Input dim.:
Compensation dim(YNAT (t)) =
Dual-Channel dim(Y(t)) + 4M + 1 =
Deep 4M L + 32 + 1
Learning
Techniques
Target:
Results - It is the same! Actual
Summary, M 1 noise vector n1 (t)
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
34
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
PART II
Results
Summary
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
35
Experimental Framework
Ivan Lopez Databases
Espejo
AURORA2-2C-CT/FT
PART I Emulate the acquisition of noisy speech by a
Introduction dual-mic smartphone in close-talk (CT) and
far-talk (FT) conditions.
Motivation
Objectives
Based on the Aurora-2 corpus (connected
digits).
PART II
Test A (bus, babble, car and pedestrian street)
Summary and Test B (cafe, street and bus and train
Multi-Channel stations).
Power
Spectrum
SNRs: {-5,0,5,10,15,20} dB and clean.
Enhancement
CHiME-3
Dual-Channel
VTS Feature Tablet with 6 mics: 5 facing forward and 1
Compensation
facing backwards.
Dual-Channel
Deep
Based on the speaker-independent medium
Learning vocabulary subset of the Wall Street Journal
Techniques corpus.
Results Real and simulated noisy speech (BUS, CAF,
Summary, PED and STR).
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
36
Experimental Framework
Ivan Lopez Feature extraction and back-end
Espejo
PART I
Introduction
Feature extraction
Motivation
Mel-frequency cepstral coefficients, MFCCs (ETSI ES 201 108).
Objectives AURORA2-2C-CT/FT: (13 MFCCs with CMN) + + 2 .
PART II CHiME-3 (GMMs): (13 MFCCs 7 frames of context) + LDA +
Summary
MLLT + (fMLLR+SAT).
CHiME-3 (DNNs): 13 MFCCs 11 frames of context.
Multi-Channel
Power
Spectrum Back-end
Enhancement
HMMs trained with either clean or multi-style data.
Dual-Channel AURORA2-2C-CT/FT (GMMs): one HMM per digit.
VTS Feature
Compensation CHiME-3 (GMMs): 2500 tied triphone HMM states modeled by a
total of 15000 Gaussians.
Dual-Channel
Deep
CHiME-3 (DNNs): DNN with 7 hidden layers with 2048 neurons
Learning each.
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
37
Power Spectrum Enhancement
Ivan Lopez AURORA2-2C-CT/FT results
Espejo
Close-talk - Multi-style models Far-talk - Multi-style models
100 100
PART I 90
WAcc (%) 80 80
WAcc (%)
Introduction
Baseline, 80.68% 70 Baseline, 81.42%
D&S, 72.18% D&S, 77.32%
Motivation 60
60
MVDR, 77.11% MVDR, 80.91%
DCSS, 85.22% DCSS, 84.83%
Objectives P-MVDR, 84.90% 50
P-MVDR, 84.64%
40 DSW-B, 86.57% DSW-B, 84.77%
40
PART II DSW-UMMSE , 86.55% DSW-UMMSE , 85.37%
DSW-(U+Eq)MMSE , 87.41% 30 DSW-(U+Eq)MMSE , 86.60%
Summary 20
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
WAcc (%)
WAcc (%)
Objectives
70 70
PART II Baseline, 80.68% Baseline, 81.42%
60 DSW-B, 86.57% 60 DSW-B, 84.77%
Summary DSW-UED , 84.68% DSW-UED , 83.49%
50 DSW-UMMSE , 86.55% 50 DSW-UMMSE , 85.37%
Multi-Channel DSW-(U+Eq)ED , 84.17% DSW-(U+Eq)ED , 83.59%
40 DSW-(U+Eq)MMSE , 87.41% 40 DSW-(U+Eq)MMSE , 86.60%
Power
Spectrum -5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Enhancement SNR (dB) SNR (dB)
Dual-Channel
VTS Feature
Compensation ED: Eigenvalue decomposition-based steering vector computation.
Dual-Channel
Deep
Learning
The speech component at the 2nd channel can be safely neglected in
Techniques close-talk conditions, but not in far-talk conditions.
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
39
Power Spectrum Enhancement
Ivan Lopez CHiME-3 results
Espejo
WER (%)
PART I Method GMM models DNN models
Baseline, 5th ch. 32.67 34.00
Introduction 1 AFE 21.62 19.14
1 SMW
More single-channel methods
Motivation 23.77 21.32
1 Wiener+Int 19.15 18.60 Soft-mask weighting (SMW) in the log-Mel
Objectives D&S (5 ch.) 23.91 23.05 domain
PART II
D&S (6 ch.) 25.98 24.55 Wiener filtering with post-processing
MVDR (5 ch.) 21.15 18.93 (Wiener+Int)
Summary MVDR (6 ch.) 20.46 18.64 Beamformer post-filters
1 Lefkimmiatis
Multi-Channel 1 MCNR-like
23.41
20.49
21.08
19.03
Multi-channel Wiener post-filter
Power 1 DCSS
(Lefkimmiatis)
20.41 18.06
Spectrum 1 P-MVDR 20.08 18.03 MCNR-like
Enhancement 1 DSW-B 21.07 18.71
1 DSW-U 20.89 19.41
Dual-Channel MMSE
1 DSW-(U+Eq) 17.68 16.73
VTS Feature MMSE
Compensation
PART I
Motivation
Method GMM models DNN models MVDR beamforming yields a strong
Baseline, 5th ch. 32.67 34.00 dehomogenization of the noise at the virtual
Objectives MVDR (6 ch.) 20.46 18.64 primary and secondary channels.
DSW-B 21.07 18.71
PART II DSW-UED
DSW-UMMSE
20.60
20.89
18.51
19.41
The substantial improvement comes when
Summary bias correction and noise equalization are
DSW-(U+Eq)ED 17.80 16.77 applied together.
Multi-Channel DSW-(U+Eq)MMSE 17.68 16.73
Power
Spectrum
Enhancement
Relative speech gain prior statistics are derived from estimated
Dual-Channel multi-channel clean speech (close-talk mic).
VTS Feature
Compensation
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
41
VTS Feature Compensation
Ivan Lopez AURORA2-2C-CT/FT results
Close-talk - Multi-style models Far-talk - Multi-style models
Espejo 100 100
90 90
PART I
80 80
WAcc (%)
WAcc (%)
Introduction 70 70 Baseline, 81.42%
Baseline, 80.68%
DSW-(U+Eq)MMSE , 87.41% DSW-(U+Eq)MMSE , 86.60%
Motivation 60 1-VTSa , 83.86% 60 1-VTSa , 84.65%
2-VTSSa , 87.10% 2-VTSSa , 86.28%
Objectives 50 1-VTSb , 84.61% 50 1-VTSb , 85.64%
2-VTSSb , 87.41% 2-VTSSb , 86.85%
PART II 40 2-VTSCb , 87.87%
40 2-VTSCb , 87.74%
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Summary
SNR (dB) SNR (dB)
Multi-Channel
Power Clean speech partial estimate computation a (two-channel MMSE
Spectrum approach) and b (single-channel approach).
Enhancement
Posterior computation S (stacked approach) and C (conditional approach).
Dual-Channel
VTS Feature 2-VTS exploits a21 and n (spatial information).
Compensation
2-VTSC conditions the distortion model at the 2nd channel to the 1st one.
Dual-Channel
Deep
Under clean acoustic modeling (WAcc, %)...
Learning Method Close-talk Far-talk
Techniques
Baseline 63.56 65.53
Results DSW-(U+Eq)MMSE 77.20 74.57
2-VTSC b 86.20 86.01
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
42
VTS Feature Compensation
Ivan Lopez CHiME-3 results
Espejo
PART I
WER (%)
Introduction Method GMM models DNN models
Baseline, 5th ch. 32.67 34.00
Motivation 1 AFE 21.62 19.14
Objectives MVDR (6 ch.) 20.46 18.64
1 DSW-(U+Eq) 17.68 16.73
MMSE
PART II 1 1-VTS 28.20 21.47
b
Summary
Multi-Channel
Power 1 MVDR beamforming is again applied to generate a virtual primary channel.
Spectrum
Enhancement
Poor performance of VTS feature compensation under multi-style acoustic
modeling has already been reported in the literature [1].
Dual-Channel
VTS Feature Under clean GMM-based acoustic modeling (WER, %)...
Compensation
Baseline, 5th ch. MVDR (6 ch.) 1 1-VTS
b
Dual-Channel 80.17 46.03 40.37
Deep
Learning
Techniques [1] Fujimoto, M. and T. Nakatani: Feature enhancement based on generative-discriminative hybrid approach
with GMMs and DNNs for noise robust speech recognition. In Proc. of 40th ICASSP, April 19-24, Brisbane,
Results Australia, 2015.
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
43
VTS Feature Compensation
Power spectrum enhancement as pre-processing
Ivan Lopez
Espejo AURORA2-2C-CT/FT results
Close-talk - Multi-style models Far-talk - Multi-style models
100 100
PART I
Introduction 90 90
WAcc (%)
WAcc (%)
Motivation 80
80
Objectives
70 1-VTSb , 84.61% 1-VTSb , 85.64%
(P-MVDR)+(1-VTSb ), 88.96% 70 (P-MVDR)+(1-VTSb ), 88.04%
PART II
DSW+(1-VTSb ), 89.69% DSW+(1-VTSb ), 88.23%
60
2-VTSC
b , 87.87% 60 2-VTSC
b , 87.74%
Summary
(P-MVDR)+(2-VTSC b ), 89.80% (P-MVDR)+(2-VTSC b ), 88.91%
50 DSW+(2-VTSC DSW+(2-VTSC
b ), 90.35% b ), 88.54%
Multi-Channel 50
Power -5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Spectrum SNR (dB) SNR (dB)
Enhancement
Dual-Channel The higher the SNR of the speech data, the higher the recognition
VTS Feature accuracy provided by VTS feature compensation.
Compensation
Dual-Channel
Deep
When P-MVDR or DSW-(U+Eq)MMSE is combined with 1-VTSb , the same
Learning spatial information as in the case of 2-VTSC
b is being used.
Techniques
PART I
80 80
WAcc (%)
WAcc (%)
Introduction
60 60
Baseline, 67.63% Baseline, 59.49%
Motivation AFE, 82.24% AFE, 76.27%
40 TGI+Oracle, 95.83%
TGI+(T-SNR), 80.12%
40
TGI+Oracle, 93.05%
TGI+(T-SNR), 72.99%
TGI: Truncated-Gaussian
Objectives TGI+DNN, 86.89% TGI+DNN, 77.83%
based Imputation
20
PART II
20
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean Oracle: oracle masks
SNR (dB) SNR (dB)
Test set A - Multi-style models Test set B - Multi-style models
T-SNR: masks from
Summary 100 100
thresholding an estimation of
90
90 the a priori SNR
Multi-Channel 80
WAcc (%)
Power 80
WAcc (%) DNN: two-channel
70
Spectrum 70 60
DNN-based mask estimation
Baseline, 85.07% Baseline, 76.29%
Enhancement AFE, 88.13% 50 AFE, 82.58%
60 TGI+Oracle, 95.87% TGI+Oracle, 92.71%
40
Dual-Channel TGI+(T-SNR), 84.22% TGI+(T-SNR), 77.14%
50 TGI+DNN, 88.17% 30 TGI+DNN, 79.87%
VTS Feature
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Compensation SNR (dB) SNR (dB)
Dual-Channel
Deep
Noises of test set B are reserved to evaluate the generalization ability of the
Learning DNN.
Techniques
Performance tends to saturate for L = 2 and greater values.
Results
2 hidden layers with 460 nodes each. Input and output layers have 230 and
Summary, 23 nodes, resp. (M = 23).
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
45
Deep Learning-Based Techniques
Noise estimates for feature compensation
Ivan Lopez
Espejo AURORA2-2C-CT - close-talk
Test set A - Clean models Test set B - Clean models
100 100
PART I
80 80
WAcc (%)
WAcc (%)
Multi-Channel
90
80
DNNNAT
1 : DNN1 with NAT
WAcc (%)
WAcc (%)
Power
Spectrum
80
Baseline, 85.07%
60
Baseline, 76.29% DNNNAT
2 : DNN2 with NAT
70 TGI+DNN, 88.17% TGI+DNN, 79.87%
Enhancement DNN1 , 83.55% DNN1 , 73.41%
60 DNN2 , 90.95% 40 DNN2 , 84.13%
Dual-Channel DNNNAT
1 , 84.54% DNNNAT
1 , 76.96%
50 DNNNAT
2 , 88.21% DNNNAT
2 , 81.35%
VTS Feature 20
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Compensation SNR (dB) SNR (dB)
Dual-Channel
Deep
Noises of test set B are reserved to evaluate the generalization ability of the
Learning DNN.
Techniques
5 hidden layers with 512 nodes each. Input and output layers have 230
Results (323 with NAT) and 23 nodes, resp. (L = 2, M = 23).
Summary, NAT improves DNN1 while worsening DNN2 .
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
46
Deep Learning-Based Techniques
Noise estimates for feature compensation
Ivan Lopez
Espejo AURORA2-2C-CT - close-talk
WAcc (%)
Objectives Baseline, 63.56%
60 70
Baseline, 80.68%
PART II IMCRA, 78.38% IMCRA, 82.26%
MS, 78.54% 60 MS, 83.68%
Summary 40 Interp., 83.06% Interp., 84.61%
50
Multi-Channel PLDNE, 81.05% PLDNE, 85.29%
Power DNN2 , 86.16% 40 DNN2 , 87.54%
20
Spectrum
Enhancement -5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Dual-Channel
SNR (dB) SNR (dB)
VTS Feature
Compensation
Noise estimation methods for comparison (applied to 1-VTSb )
Dual-Channel
Deep Single-channel methods: Improved minima controlled recursive averaging
Learning
Techniques (IMCRA), minimum statistics (MS) and linear interpolation (Interp.).
Results PLDNE is for dual-mic smartphones (PLD and homogeneous noise field).
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
47
Ivan Lopez
Espejo
PART I
Introduction
Motivation
Objectives
PART II
Summary, Conclusions and Future Work
Summary
Multi-Channel
Power
Spectrum
Enhancement
Dual-Channel
VTS Feature
Compensation
Dual-Channel
Deep
Learning
Techniques
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
48
Summary and Conclusions
Ivan Lopez
Espejo
PART I The multi-channel information coming from mobile devices with several
sensors can be exploited for noise-robust ASR purposes.
Introduction
Dual-Channel
The best power spectrum enhancement results were achieved by
VTS Feature DSW-(U+Eq)MMSE on all the corpora.
Compensation
Dual-Channel Our MMSE-based speech gain estimation method provides similar or better
Deep results than an eigenvalue decomposition-based method with a fraction of
Learning
Techniques
the computational complexity.
Results
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
49
Summary and Conclusions
Ivan Lopez
Espejo
Discarding the phase information is beneficial to overcome the limitations
of classical MVDR beamforming when applied to the considered dual-mic
PART I
framework.
Introduction
We showed the better performance of the dual- versus the single-channel
Motivation
VTS feature compensation approach on the AURORA2-2C-CT/FT
Objectives databases.
PART II
For dual-channel VTS feature compensation, it was proven to be more
Summary robust modeling the conditional dependence of the noisy secondary channel
Multi-Channel given the primary one.
Power
Spectrum Accurate missing-data masks and noise estimates were obtained by jointly
Enhancement
exploiting the dual-channel noisy information and the powerful modeling
Dual-Channel capabilities of DNNs.
VTS Feature
Compensation
The use of the secondary sensor itself can be understood as a more robust
Dual-Channel kind of NAT strategy.
Deep
Learning
Techniques
Our contributions broadly showed an outstanding performance at low
SNRs, which makes them promising techniques to be used in highly noisy
Results
environments such as those where mobile devices might be used.
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
50
Future Work
Ivan Lopez
Espejo
Extension of the different proposals to operate on different
PART I mobile devices with other small microphone array
Introduction configurations.
Motivation
Multi-Channel
Power
With respect to our DNN-based proposals, performing an
Spectrum extensive search regarding the architecture and training
Enhancement
configuration:
Dual-Channel
VTS Feature Use of recurrent neural networks (RNNs)
Compensation
Additional or different kind of features
Dual-Channel
Deep
Learning
Techniques Extension of our DNN-based proposals to deal with a
Results hands-free/far-talk scenario.
Summary,
Conclusions &
Future Work
Thank you very much for your attention