You are on page 1of 51

Robust ASR

on IMDs with
Dual-Mic
1

Ivan Lopez
Espejo
Doctoral Thesis
PART I
Robust Speech Recognition on Intelligent Mobile
Introduction
Devices with Dual-Microphone
Motivation

Objectives

PART II
Author
Summary Ivan Lopez Espejo
Multi-Channel
Supervisors
Power
Spectrum Antonio M. Peinado and Angel M. Gomez
Enhancement
Ph.D. Program in Information and Communication Technologies
Dual-Channel Dept. of Signal Theory, Telematics and Communications
VTS Feature
Compensation

Dual-Channel
Deep
Learning University of Granada
Techniques
Granada, 22nd September 2017
Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
2

Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II
PART I
Summary Introduction, Motivation and Objectives
Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
3
ASR in Noisy Conditions
Ivan Lopez Overview
Espejo Automatic speech recognition (ASR) is a mature technology under
controlled conditions.
PART I

Introduction Gap in performance between humans and machines due to the introduction
Motivation of mismatch between the training and testing conditions of the ASR
Objectives system.
Intra-speaker variability (mood, presence of illness...)
PART II Inter-speaker variability
Transmission channel
Summary Reverberation
Multi-Channel
...
Power
Background (additive) noise
Spectrum
Enhancement While human beings exhibit a high degree of robustness against noise when
Dual-Channel recognizing speech, noise can make ASR systems unusable.
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
4
ASR in Noisy Conditions
Ivan Lopez ASR fundamentals
Espejo

PART I
Front-end: feature extraction, e.g. MFCCs (X, feats).
Introduction Back-end: speech decoding (W, transcription).
Motivation
p(X|W) is the acoustic score
Objectives P(W) is the language score
PART II
EXAMPLE: acoustic models are trained with clean speech data and we try to
Summary
recognize noisy speech data mismatch will cause a wrong transcription.
Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
5
ASR in Noisy Conditions
Ivan Lopez Speech distortion modeling
Espejo
We consider the classical linear speech distortion model...
PART I
Time domain: y (m) = h(m) x(m) + n(m)
Introduction Power spectral domain: |Y (f , t)|2 = |H(f , t)|2 |X 2
 (f , t)| + |N(f , t)|
2

Motivation Log-Mel domain: y = x + h + log 1 + e nxh


Objectives
The statistical distribution of the speech energy is affected in
PART II
the presence of ambient noise (when h = 0):
Summary n = 2 n = 8 n = 14
0.04 Noisy speech 0.04 0.04
Multi-Channel Clean speech
Relative frequency

Relative frequency

Relative frequency
Power 0.03
Noise
0.03 0.03
Spectrum
Enhancement 0.02 0.02 0.02

0.01 0.01 0.01


Dual-Channel
VTS Feature 0 0 0
10 0 10 20 30 10 0 10 20 30 10 0 10 20 30
Compensation LogMel power LogMel power LogMel power

Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
6
ASR in Noisy Conditions
Ivan Lopez Approaches to strengthen the ASR systems
Espejo
A possible taxonomy of the noise-robust methods for ASR:
PART I

Introduction
Feature-space approaches
Motivation Noise-robust features (RASTA-PLP, TANDEM...), normalization of
statistical moments of the features (CMN, HEQ...) and feature
Objectives
enhancement (spectral subtraction, Wiener filtering, AFE...)
PART II

Summary
Model-based approaches
Multi-Channel Model adaptation (CMLLR...) and adaptive training (fNAT, SAT...)
Power
Spectrum
Enhancement
Compensation with explicit distortion modeling
Dual-Channel Model adaptation or feature compensation (VTS...)
VTS Feature
Compensation Missing-data approaches
Dual-Channel Ignoring unreliable elements during recognition (marginalization,
Deep
Learning SFD...) and data imputation (TGI...)
Techniques

Results
More approaches: stereo data learning-based techniques (SPLICE, DNNs...),
Summary, exemplar-based techniques (NMF...), etc.
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
7
ASR on Mobile Devices with Various Sensors
Ivan Lopez Overview and motivation
Espejo

PART I

Introduction
Intelligent mobile devices (IMDs), e.g.
Motivation
smartphones or tablets, have revolutionized
Objectives
the way we live.
PART II
ASR has experienced a new upswing
Summary
(search-by-voice, dictation, voice control...)
Multi-Channel
Power
Spectrum
Mobile devices can be used anywhere at
Enhancement any time tackling with noise is more
Dual-Channel important than ever before!
VTS Feature
Compensation
We can take advantage of the small
Dual-Channel
Deep microphone arrays embedded in the latest
Learning
Techniques
IMDs for noise-robust ASR purposes.
Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
8
ASR on Mobile Devices with Various Sensors
Ivan Lopez Approaches to strengthen the ASR systems
Espejo
Performance benefits from a combination of single- and
PART I multi-channel techniques.
Introduction
Spatial filtering (beamforming) is a popular choice see 3rd
Motivation
and 4th CHiME Challenges.
Objectives Delay-and-sum, MVDR, adaptive array processing (GSC...)
PART II
Beamforming shortcomings are partially overcome by
Summary
post-filtering (e.g. MVDR + Wiener post-filter = MCWF).
Multi-Channel
Power Low directivity at low frequencies, inaccurate estimation of the
Spectrum steering vector, inability to remove noise coming from the look
Enhancement
direction, etc.
Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques
Problem: beamforming exhibits some important constraints when
Results

Summary,
performing on arrays comprised of a few sensors close each other
Conclusions & (IMDs) specific solutions power level difference (PLD).
Future Work
Robust ASR
on IMDs with
Dual-Mic
9
ASR on Mobile Devices with Various Sensors
Ivan Lopez Power level difference (PLD)
Espejo

PART I

Introduction

Motivation
Clean speech Clean speech
Closetalk Fartalk
Objectives
45 45
Channel 1
PART II 40 40 Channel 2
PSD (dB/Hz)

PSD (dB/Hz)
Summary 35 35

30 30
Multi-Channel
25 25
The 2nd sensor is
Power
Spectrum 20 20
placed in an
Enhancement acoustic shadow
15 15
0 1 2
Frequency (kHz)
3 4 0 1 2
Frequency (kHz)
3 4 speech is
Dual-Channel
VTS Feature Car noise Car noise attenuated at the
Closetalk Fartalk
Compensation 45 45 2nd mic with
Channel 1
Dual-Channel 40 40 Channel 2 respect to the 1st
Deep one.
PSD (dB/Hz)

PSD (dB/Hz)

35 35
Learning
Techniques 30 30 Diffuse noise field
25 25 similar noise
Results
20 20 PSDs at both
Summary, 15 15 mics.
Conclusions & 0 1 2 3 4 0 1 2 3 4
Frequency (kHz) Frequency (kHz)
Future Work
Robust ASR
on IMDs with
Dual-Mic
10
Objectives of this Thesis
Ivan Lopez
Espejo As...
PART I 1 Latest IMDs embed small microphone arrays.
Introduction
2 The performance of beamforming for robust ASR on IMDs is
Motivation
limited.
Objectives

PART II Our objectives are...


Summary
1 Reviewing the literature on single- and multi-channel
Multi-Channel
Power
noise-robust ASR for mobile devices.
Spectrum
Enhancement 2 Developing a new series of algorithms exploiting a secondary
Dual-Channel sensor to improve the ASR accuracy on IMDs.
VTS Feature
Compensation
3 Generating new speech resources under a dual-channel mobile
Dual-Channel
Deep
device framework for experimental purposes.
Learning
Techniques 4 Evaluating and comparing our developments to draw
Results conclusions in order to make further progress.
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
11

Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II
PART II
Summary Contributions and Results
Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
12 1 Multi-Channel Power Spectrum Enhancement Summary
Ivan Lopez Combinatorial Strategy
Espejo Dual-Channel Spectral Subtraction (DCSS)
Power-Minimum Variance Distortionless Response (P-MVDR)
PART I Dual-Channel Spectral Weighting (DSW)
Introduction
MMSE-Based Relative Speech Gain Estimation
Motivation 2 Dual-Channel Vector Taylor Series Feature Compensation
Dual-Channel VTS Feature Compensation
Objectives
Calculation of the Posterior Probabilities (VTSS )
PART II Alternative Posterior Computation (VTSC )
Summary Clean Speech Partial Estimate Computation
Multi-Channel 3 Dual-Channel Deep Learning Techniques
Power Overview
Spectrum
Enhancement DNN-Based Missing-Data Mask Estimation
DNN-Based Noise Estimation
Dual-Channel
VTS Feature 4 Results
Compensation
Experimental Framework
Dual-Channel Power Spectrum Enhancement
Deep
Learning VTS Feature Compensation
Techniques Deep Learning-Based Techniques
Results 5 Summary, Conclusions and Future Work
Summary, Summary and Conclusions
Conclusions & Future Work
Future Work
Robust ASR
on IMDs with
Dual-Mic
13

Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II Multi-Channel Power Spectrum Enhancement


Summary

Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep I. Lopez-Espejo, A. M. Peinado, A. M. Gomez and J. A. Gonzalez: Dual-Channel Spectral Weighting for
Learning Robust Speech Recognition in Mobile Devices. Submitted to Digital Signal Processing (major revision).
Techniques
I. Lopez-Espejo, A. M. Gomez, J. A. Gonzalez and A. M. Peinado: Feature Enhancement for Robust Speech
Results Recognition on Smartphones with Dual-Microphone. In Proceedings of 22nd European Signal Processing
Conference, September 15, Lisbon (Portugal), 2014. Best student paper award
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
14
Combinatorial Strategy
Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II

Summary

Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
As in the literature, we optionally consider microphone array
Compensation pre-processing (beamforming).
Dual-Channel
Deep
The virtual primary channel has a higher SNR than any other signal
Learning from the IMD.
Techniques

Results Our contributions behave as post-filters.


Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
15
Dual-Channel Spectral Subtraction (DCSS)
Ivan Lopez
Espejo

PART I
SS is extended to a dual-channel framework from
Introduction
|Y1 (f , t)|2 = |X1 (f , t)|2 + |N1 (f , t)|2

Motivation |Y2 (f , t)|2 = |X2 (f , t)|2 + |N2 (f , t)|2
Objectives

PART II We first establish a couple of relations:


Summary
|X2 (f , t)|2 = A21 (f , t)|X1 (f , t)|2
|N 2
 1 (f , t)| = G12 (f , t)|N2 (f , t)|
2
Multi-Channel N,f ,t (1,2)
Power G12 (f , t) = N,f ,t (2,2)
Spectrum
Enhancement

Dual-Channel By combining equations we get the following


VTS Feature dual-channel spectral subtraction estimator:
Compensation

Dual-Channel |Y1 (f , t)|2 G12 (f , t)|Y2 (f , t)|2


Deep
|X1 (f , t)|2 =
1 G12 (f , t)A21 (f , t)
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
16
Power-Minimum Variance Distortionless
Ivan Lopez Response (P-MVDR)
Espejo

PART I

Introduction
P-MVDR discards the phase information to
overcome the limitations of the classical MVDR
Motivation beamforming when applied to our dual-channel
Objectives framework:
|Y1 (f , t)|2
 
PART II |X1 (f , t)|2 = wf>,t
|Y2 (f , t)|2
Summary

Multi-Channel The optimal weights are estimated through



Power  2
Spectrum wf ,t = arg minwf ,t E wf>,t (f , t)
Enhancement

Dual-Channel
subject to wf>,t (1, A21 (f , t))> = 1
VTS Feature (distortionless constraint)
Compensation

Dual-Channel The final weighting vector is


Deep
Learning
Techniques
1
N (f , t) (1, A21 (f , t))
>
wf ,t =
Results
(1, A21 (f , t)) 1
N (f , t) (1, A21 (f , t))
>

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
17
Dual-Channel Spectral Weighting (DSW)
Ivan Lopez Biased spectral weighting (DSW-B)
Espejo

DSW is based on Wiener filtering:


PART I  2
2 Sy1 (f , t) Sn1 (f , t)
Introduction |X1 (f , t)| = |Y1 (f , t)|2
Motivation
Sy1 (f , t)
| {z }
Objectives H12 (f , t)
PART II

Summary Two assumptions are initially considered:


Multi-Channel
Power
The 2nd mic captures no speech: Sx2 (f , t) = 0
Spectrum The existence of a homogeneous noise field: Sn1 (f , t) = Sn2 (f , t)
Enhancement

Dual-Channel
VTS Feature As Sn1 (f , t) = Sy2 (f , t), the biased Wiener filter results
Compensation
Sy (f , t) Sy2 (f , t)
Dual-Channel H1,b (f , t) = 1
Deep Sy1 (f , t)
Learning
Techniques

Results While the above assumptions can be acceptable in some specific


Summary, cases, in general, they will not be accurate.
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
18
Dual-Channel Spectral Weighting (DSW)
Ivan Lopez Unbiased spectral weighting (DSW-U)
Espejo Closetalk Fartalk
0.08 Bin 1534.88 Hz, Anechoic
0.15 Bin 1534.88 Hz, Room

Normalized frequency
Normalized frequency
Bin 3643.41 Hz, Anechoic
0.06
PART I Bin 3643.41 Hz, Room
0.1
0.04
Introduction
0.05
0.02
Motivation
0 0
Objectives 4 3 2 1 0 1 2 4 3 2 1 0 1 2
log A21 (f) log A21 (f)

PART II

Summary The 2nd mic also captures speech (by diffraction, reflections...)
A21 (f , t) is non-zero (|X2 (f , t)|2 = A21 (f , t)|X1 (f , t)|2 )
Multi-Channel
Power
Spectrum A bias correction term is introduced:
Enhancement 1 Sy1 (f , t) Sy2 (f , t)
H1,u (f , t) =
Dual-Channel 1 A21 (f , t) Sy1 (f , t)
VTS Feature
| {z }| {z }
Compensation B 1 (f , t) H1,b (f , t)

Dual-Channel
Deep 1
Learning 0.8
Techniques A21(f, t)
H1,u(f, t)

0.6 0
Results 0.2
0.4
0.4
0.2 0.6
Summary, 0.8
Conclusions & 0
0 0.2 0.4 0.6 0.8 1
Future Work H1,b (f, t)
Robust ASR
on IMDs with
Dual-Mic
19
Dual-Channel Spectral Weighting (DSW)
Ivan Lopez Noise equalization (Eq)
Espejo The assumption Sn1 (f , t) Sn2 (f , t) may not be satisfied even in the
presence of a homogeneous noise field.
PART I

Introduction The signal at the 2nd channel is transformed to meet


 this requirement:
|Y2 (f , t)|2

Motivation |Y2 (f , t)|2 = |X2 (f , t)|2 + |N2 (f , t)|2 = gf>,t
| {z } | {z } |Y1 (f , t)|2
Objectives
|X2 (f , t)|2 (f , t)|N1 (f , t)|2
PART II

Summary
Inspired by MVDR beamforming
  2 
|N1 (f , t)|2 + std(|N1 (f , t)|2 ) gf>,t (f , t)

Multi-Channel gf ,t = arg mingf ,t E
Power
Spectrum subject to gf>,t (f , t) = 1 (distortionless constraint)
Enhancement

Dual-Channel 85
|N1 (f, t)|2
VTS Feature |N2 (f, t)|2
|N2(f, t)|2 ((f, t) = 1)
Compensation |N2(f, t)|2 ((f, t) 1)
80
Average power (dB)

Dual-Channel
Deep
Learning 75

Techniques

Results 70

Summary,
Conclusions & 65
0 0.5 1 1.5 2 2.5 3 3.5 4
Future Work Frequency bin (kHz)
Robust ASR
on IMDs with
Dual-Mic
20
Dual-Channel Spectral Weighting (DSW)
Ivan Lopez System overview
Espejo

PART I

Introduction

Motivation

Objectives

PART II

Summary

Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep The spectro-temporal correlation of the speech signal is
Learning
Techniques exploited to refine the Wiener filter weights.
Results Two-dimensional median filtering
Summary, Two-dimensional Gaussian filtering
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
21
MMSE-Based Relative Speech Gain
Ivan Lopez Estimation
Espejo

Some definitions...
PART I
A21 (f , t) = |A21 (f , t)|2
Introduction
a21 = (A21 (0, t), ..., A21 (M 1, t))> (in the STFT domain)
Motivation a21 = ar21 + jai21
Objectives

PART II The MMSE estimate of a21 can be expressed  as


Summary a21 = E [a21 |y2 ] = E [ar21 |y2r ] + jE ai21 |y2i = ar21 + j ai21
Multi-Channel
Power
Spectrum For the real part:
Enhancement
ar21 = E [ar21 |y2r ] = Ar21 + Ar21 Y2r 1 r r
Y r (y2 y2 ) 2
Dual-Channel
VTS Feature A21 (f ) Y2 (f )
Compensation 0.2 0.025 Real 449.6 Hz
Imag. 449.6 Hz
Normalized frequency

Dual-Channel 0.02 Real 2992.2 Hz


0.15 Imag. 2992.2 Hz
Deep
Learning 0.015
0.1
Techniques
0.01

Results 0.05
0.005

Summary,
Conclusions & 0
2 1 0 1 2
0
6000 4000 2000 0 2000 4000 6000
Future Work STFT coefficient value STFT coefficient value
Robust ASR
on IMDs with
Dual-Mic
22
MMSE-Based Relative Speech Gain
Ivan Lopez Estimation
Espejo

PART I
The
 additive distortion model in the STFT domain:
y1 = x1 + n1
Introduction
y2 = x2 + n2 = a21 x1 + n2 ( is element-wise product)
Motivation

Objectives From the combination of the above expressions


 we can state
PART II y2r = ar21 (y1r nr1 ) ai21 y1i ni1 + nr2
Summary It is also assumed that the noise variables follow multivariate
Multi-Channel Gaussian distributions.
Power
Spectrum
Enhancement Any linear combination of Gaussian variables follows another
Dual-Channel Gaussian distribution.
VTS Feature
Compensation y2r is linearized by means of a first-order VTS expansion around
Dual-Channel the mean vectors.
Deep 
Learning
Techniques
The parameters of p (y2r ) = N y2r y2r , Y2r are now easily
Results
computed.
Summary, Ar21 Y2r is also derived from this approach.
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
23

Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II

Summary
Dual-Channel Vector Taylor Series
Multi-Channel
Power
Feature Compensation
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques

Results I. Lopez-Espejo, A. M. Peinado, A. M. Gomez and J. A. Gonzalez: Dual-Channel VTS Feature Compensation
for Noise-Robust Speech Recognition on Mobile Devices. IET Signal Processing, 11:1725, 2017.
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
24
Dual-Channel VTS Feature Compensation
Ivan Lopez MMSE estimation
Espejo

PART I VTS feature compensation, in the log-Mel domain, is extended


Introduction to be performed on a dual-channel framework.
Motivation

Objectives Clean speech statistics at the 1st channel are modeled by a


PART II Gaussian mixture model (GMM):
Summary PK  
(k)
Multi-Channel
p(x1 ) = k=1 P(k)N x1 x1 , (k) x1
Power
Spectrum
Enhancement
 
y1
Dual-Channel A noisy stacked vector is defined as y =
VTS Feature y2
Compensation

Dual-Channel
Deep
The log-Mel clean speech features are estimated at every time
Learning frame under an MMSE approach as
Techniques
PK
Results x1 = E [x1 |y] = k=1 P(k|y)E [x1 |y, k]
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
25
Calculation of the Posterior Probabilities
Ivan Lopez (VTSS )
Espejo
By using the Bayes rule, the posteriors are defined as
PART I p(y|k)P(k) 
(k) (k)

P(k|y) = PK , with p(y|k) = N y y , y
Introduction k 0 =1 p(y|k 0 )P(k 0 )
Motivation
The dual-channel
  model is given 
by the following stacked vector:
Objectives y1 f(x1 , a11 , n1 )
y= = ,
PART II y2 f(x1 , a21 , n2 )
where yi = f(x1 , ai1 , ni ) = x1 + ai1 + log 1M,1 + e ni x1 ai1

Summary

Multi-Channel
Power
Again, it is assumed that all the variables involved are Gaussians.
Spectrum
Enhancement First-order VTS expansion of the dual-channel
 model:

(k) (i,k) (k)
Dual-Channel f(x1 , ai1 , ni ) f x1 , ai1 , ni + Jx x1 x1 +
VTS Feature    
(i,k) (i,k)
Compensation Ja ai1 ai1 + Jn ni ni
Dual-Channel
Deep
Learning
The required
parameters arenow
easily estimated:
(k)
Techniques
(k) f x1 , a11 , n1
y 
(k)

Results f x1 , a21 , n2
Summary, (k) (k) (k) (k) > (k) (k) > (k) (k) >
Conclusions & y Jx x1 Jx + Ja a Ja + Jn n Jn
Future Work
Robust ASR
on IMDs with
Dual-Mic
26
Alternative Posterior Computation (VTSC )
Ivan Lopez
Espejo
The 2nd channel is being treated in a parallel manner to the 1st one.
PART I

Introduction However, clean speech will be easily masked by noise at the 2nd channel
Motivation highly uncertain relation between the 2nd noisy channel and clean speech.
Objectives
We have found more robust conditioning the 2nd channel to the 1st
PART II channel.
Summary

Multi-Channel P(k|y) is replaced by


Power p(y1 |k)p(y2 |y1 , k)
Spectrum z }| {
Enhancement p(y1 , y2 |k) P(k)
P(k|y1 , y2 ) = PK
Dual-Channel
k 0 =1 p(y1 , y2 |k 0 )P(k 0 )
VTS Feature
Compensation
Similarly, a VTS
 approach
 is considered to compute the parameters of
Dual-Channel (k) (k)
Deep p(y1 |k) = N y1 , y1 , and
Learning  
(k) (k)
Techniques p(y2 |y1 , k) = N y |y , y |y
2 1 2 1
Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
27
Clean Speech Partial Estimate Computation
Ivan Lopez
Espejo The clean speech
Z partial expected values are defined as
PART I
E [x1 |y, k] = x1 p(x1 |y, k)dx1
Introduction

Motivation Two different proposals are considered for clean speech partial
Objectives estimate computation.
PART II

Summary 1st approach (VTSa ): the dual-channel information is


Multi-Channel
Power
exploited:
(k) > (k) 1
 
(k) (k)
Spectrum
Enhancement
E [x1 |y, k] = x1 + (k)
x J x y y y
| 1 {z }
Dual-Channel x(k)y
VTS Feature 1
Compensation

Dual-Channel
Deep
2nd approach (VTSb ): only the information from the 1st
Learning channel is used:  (k)

Techniques
E [x1 |y, k] E [x1 |y1 , k] = y1 log 1M,1 + e n1 x1
Results
| {z }
Summary, E [g(x1 , a11 , n1 )|y1 , k]
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
28

Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II Dual-Channel Deep Learning Techniques


Summary

Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep I. Lopez-Espejo, A. M. Peinado, A. M. Gomez and J. M. Martn-Donas: Deep Neural Network-Based Noise
Learning Estimation for Robust ASR in Dual-Microphone Smartphones. Lecture Notes in Computer Science,
Techniques 10077:117127, 2016. Best paper award at IberSPEECH 16

Results I. Lopez-Espejo, J. A. Gonzalez, A. M. Gomez and A. M. Peinado: A Deep Neural Network Approach for
Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech
Summary, Recognition. Lecture Notes in Computer Science (IberSPEECH 14), 8854:119128, 2014.
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
29
Overview
Ivan Lopez
Espejo

PART I
Unlike the classical signal processing solutions, a main feature of
Introduction
deep learning is that no assumptions on the problem to be
Motivation
addressed are required.
Objectives

PART II
The powerful modeling capabilities of DNNs are applied to
Summary
complex tasks by also taking advantage of the dual-channel
Multi-Channel
Power information:
Spectrum
Enhancement Missing-data mask estimation (spectral reconstruction)
Noise estimation (feature compensation)
Dual-Channel
VTS Feature
Compensation

Dual-Channel
Hybrid DNN/signal processing architectures will be extensively
Deep
Learning
and successfully explored in the near future.
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
30
DNN-Based Missing-Data Mask Estimation
Ivan Lopez
Espejo
Features:
PART I y(t L)

Introduction ..
Y(t) = ,

.
Motivation
y(t + L)
Objectives

PART II where  
Summary y1 (t)
y(t) =
y2 (t)
Multi-Channel
Power
Spectrum - Input dim.:
Enhancement
dim(Y(t)) = 2M(2L + 1)
Dual-Channel
VTS Feature Target:
Compensation
- Oracle binary mask vector
Dual-Channel for y1 (t)
Deep
Learning
- Output dim.: M 1
Techniques - 7 dB SNR threshold
Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
31
DNN-Based Missing-Data Mask Estimation
Ivan Lopez
Espejo

PART I

Introduction
Training issues:
Motivation

Objectives - Input data are properly


normalized
PART II

Summary - The DNN is pre-trained by


Multi-Channel
considering each pair of
Power layers as RBMs
Spectrum
Enhancement
- The DNN is trained by
Dual-Channel using the backpropagation
VTS Feature algorithm (cross-entropy
Compensation
criterion)
Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
32
DNN-Based Noise Estimation
Ivan Lopez
Espejo

PART I
- A parallel approach is
Introduction
considered for noise
Motivation estimation
Objectives

PART II

Summary - While the same input


Multi-Channel
features Y(t) are used,
Power n1 (t) is the M 1 target
Spectrum vector
Enhancement

Dual-Channel
VTS Feature
Compensation - MSE criterion for
Dual-Channel backpropagation learning
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
33
DNN-Based Noise Estimation
Ivan Lopez Noise-aware training
Espejo
Noise-aware training (NAT) first appeared to Augmented features:
PART I strengthen the DNN-based acoustic modeling
Y(t)

for ASR. n(0)
Introduction 1

We want to imitate linear interpolation noise (1)
n


Motivation
YNAT (t) = 1(0) ,

estimation.

Objectives 1
(1)
PART II 1
(t)
Summary
where
Multi-Channel
Power (t) = t/(T 1);
Spectrum
Enhancement t = 0, 1, ..., T 1
Dual-Channel
VTS Feature - Input dim.:
Compensation dim(YNAT (t)) =
Dual-Channel dim(Y(t)) + 4M + 1 =
Deep 4M L + 32 + 1
Learning
Techniques
Target:
Results - It is the same! Actual
Summary, M 1 noise vector n1 (t)
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
34

Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II
Results
Summary

Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
35
Experimental Framework
Ivan Lopez Databases
Espejo
AURORA2-2C-CT/FT
PART I Emulate the acquisition of noisy speech by a
Introduction dual-mic smartphone in close-talk (CT) and
far-talk (FT) conditions.
Motivation

Objectives
Based on the Aurora-2 corpus (connected
digits).
PART II
Test A (bus, babble, car and pedestrian street)
Summary and Test B (cafe, street and bus and train
Multi-Channel stations).
Power
Spectrum
SNRs: {-5,0,5,10,15,20} dB and clean.
Enhancement
CHiME-3
Dual-Channel
VTS Feature Tablet with 6 mics: 5 facing forward and 1
Compensation
facing backwards.
Dual-Channel
Deep
Based on the speaker-independent medium
Learning vocabulary subset of the Wall Street Journal
Techniques corpus.
Results Real and simulated noisy speech (BUS, CAF,
Summary, PED and STR).
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
36
Experimental Framework
Ivan Lopez Feature extraction and back-end
Espejo

PART I

Introduction
Feature extraction
Motivation
Mel-frequency cepstral coefficients, MFCCs (ETSI ES 201 108).
Objectives AURORA2-2C-CT/FT: (13 MFCCs with CMN) + + 2 .
PART II CHiME-3 (GMMs): (13 MFCCs 7 frames of context) + LDA +
Summary
MLLT + (fMLLR+SAT).
CHiME-3 (DNNs): 13 MFCCs 11 frames of context.
Multi-Channel
Power
Spectrum Back-end
Enhancement
HMMs trained with either clean or multi-style data.
Dual-Channel AURORA2-2C-CT/FT (GMMs): one HMM per digit.
VTS Feature
Compensation CHiME-3 (GMMs): 2500 tied triphone HMM states modeled by a
total of 15000 Gaussians.
Dual-Channel
Deep
CHiME-3 (DNNs): DNN with 7 hidden layers with 2048 neurons
Learning each.
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
37
Power Spectrum Enhancement
Ivan Lopez AURORA2-2C-CT/FT results
Espejo
Close-talk - Multi-style models Far-talk - Multi-style models
100 100

PART I 90
WAcc (%) 80 80

WAcc (%)
Introduction
Baseline, 80.68% 70 Baseline, 81.42%
D&S, 72.18% D&S, 77.32%
Motivation 60
60
MVDR, 77.11% MVDR, 80.91%
DCSS, 85.22% DCSS, 84.83%
Objectives P-MVDR, 84.90% 50
P-MVDR, 84.64%
40 DSW-B, 86.57% DSW-B, 84.77%
40
PART II DSW-UMMSE , 86.55% DSW-UMMSE , 85.37%
DSW-(U+Eq)MMSE , 87.41% 30 DSW-(U+Eq)MMSE , 86.60%
Summary 20
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean

Multi-Channel SNR (dB) SNR (dB)


Power
Spectrum ETSI advanced front-end (AFE), close-talk 85.35%, far-talk 86.06%.
Enhancement
Poor performance of the classical beamforming techniques, especially in
Dual-Channel
VTS Feature
close-talk conditions.
Compensation Discarding the phase as used by MVDR beamforming is positive, especially
Dual-Channel in close-talk conditions.
Deep
Learning
Similar performance of DCSS and P-MVDR:
Techniques (1),DCSS (1),P-MVDR
limA21 (f ,t)0 wf ,t = limA21 (f ,t)0 wf ,t =1
Results (2),DCSS (2),P-MVDR (1,2)
limA21 (f ,t)0 wf ,t = limA21 (f ,t)0 wf ,t = N,f ,t (2,2)
N,f ,t
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
38
Power Spectrum Enhancement
Ivan Lopez AURORA2-2C-CT/FT results
Espejo

PART I Close-talk - Multi-style models Far-talk - Multi-style models


100 100
Introduction
90 90
Motivation
80 80

WAcc (%)
WAcc (%)

Objectives
70 70
PART II Baseline, 80.68% Baseline, 81.42%
60 DSW-B, 86.57% 60 DSW-B, 84.77%
Summary DSW-UED , 84.68% DSW-UED , 83.49%
50 DSW-UMMSE , 86.55% 50 DSW-UMMSE , 85.37%
Multi-Channel DSW-(U+Eq)ED , 84.17% DSW-(U+Eq)ED , 83.59%
40 DSW-(U+Eq)MMSE , 87.41% 40 DSW-(U+Eq)MMSE , 86.60%
Power
Spectrum -5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Enhancement SNR (dB) SNR (dB)
Dual-Channel
VTS Feature
Compensation ED: Eigenvalue decomposition-based steering vector computation.
Dual-Channel
Deep
Learning
The speech component at the 2nd channel can be safely neglected in
Techniques close-talk conditions, but not in far-talk conditions.
Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
39
Power Spectrum Enhancement
Ivan Lopez CHiME-3 results
Espejo

WER (%)
PART I Method GMM models DNN models
Baseline, 5th ch. 32.67 34.00
Introduction 1 AFE 21.62 19.14
1 SMW
More single-channel methods
Motivation 23.77 21.32
1 Wiener+Int 19.15 18.60 Soft-mask weighting (SMW) in the log-Mel
Objectives D&S (5 ch.) 23.91 23.05 domain

PART II
D&S (6 ch.) 25.98 24.55 Wiener filtering with post-processing
MVDR (5 ch.) 21.15 18.93 (Wiener+Int)
Summary MVDR (6 ch.) 20.46 18.64 Beamformer post-filters
1 Lefkimmiatis

Multi-Channel 1 MCNR-like
23.41
20.49
21.08
19.03
Multi-channel Wiener post-filter
Power 1 DCSS
(Lefkimmiatis)
20.41 18.06
Spectrum 1 P-MVDR 20.08 18.03 MCNR-like
Enhancement 1 DSW-B 21.07 18.71
1 DSW-U 20.89 19.41
Dual-Channel MMSE
1 DSW-(U+Eq) 17.68 16.73
VTS Feature MMSE
Compensation

Dual-Channel 1 A virtual primary channel is obtained by means of MVDR beamforming.


Deep
Learning
Considering the secondary sensor for D&S yields a drop in performance.
Techniques Percentage changes:
Results
1 0.36% between MVDR (5 ch.) and MVDR (6 ch.)
Summary, 2 2.35% between MVDR (6 ch.) and DSW-(U+Eq)MMSE
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
40
Power Spectrum Enhancement
Ivan Lopez CHiME-3 results
Espejo

PART I

Introduction WER (%)

Motivation
Method GMM models DNN models MVDR beamforming yields a strong
Baseline, 5th ch. 32.67 34.00 dehomogenization of the noise at the virtual
Objectives MVDR (6 ch.) 20.46 18.64 primary and secondary channels.
DSW-B 21.07 18.71
PART II DSW-UED
DSW-UMMSE
20.60
20.89
18.51
19.41
The substantial improvement comes when
Summary bias correction and noise equalization are
DSW-(U+Eq)ED 17.80 16.77 applied together.
Multi-Channel DSW-(U+Eq)MMSE 17.68 16.73
Power
Spectrum
Enhancement
Relative speech gain prior statistics are derived from estimated
Dual-Channel multi-channel clean speech (close-talk mic).
VTS Feature
Compensation

Dual-Channel Linear time complexity (execution time vs. utterance duration):


Deep ED, 27.208 s/s
Learning
Techniques MMSE , 0.008 s/s
Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
41
VTS Feature Compensation
Ivan Lopez AURORA2-2C-CT/FT results
Close-talk - Multi-style models Far-talk - Multi-style models
Espejo 100 100

90 90
PART I
80 80

WAcc (%)
WAcc (%)
Introduction 70 70 Baseline, 81.42%
Baseline, 80.68%
DSW-(U+Eq)MMSE , 87.41% DSW-(U+Eq)MMSE , 86.60%
Motivation 60 1-VTSa , 83.86% 60 1-VTSa , 84.65%
2-VTSSa , 87.10% 2-VTSSa , 86.28%
Objectives 50 1-VTSb , 84.61% 50 1-VTSb , 85.64%
2-VTSSb , 87.41% 2-VTSSb , 86.85%
PART II 40 2-VTSCb , 87.87%
40 2-VTSCb , 87.74%

-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Summary
SNR (dB) SNR (dB)
Multi-Channel
Power Clean speech partial estimate computation a (two-channel MMSE
Spectrum approach) and b (single-channel approach).
Enhancement
Posterior computation S (stacked approach) and C (conditional approach).
Dual-Channel
VTS Feature 2-VTS exploits a21 and n (spatial information).
Compensation
2-VTSC conditions the distortion model at the 2nd channel to the 1st one.
Dual-Channel
Deep
Under clean acoustic modeling (WAcc, %)...
Learning Method Close-talk Far-talk
Techniques
Baseline 63.56 65.53
Results DSW-(U+Eq)MMSE 77.20 74.57
2-VTSC b 86.20 86.01
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
42
VTS Feature Compensation
Ivan Lopez CHiME-3 results
Espejo

PART I
WER (%)
Introduction Method GMM models DNN models
Baseline, 5th ch. 32.67 34.00
Motivation 1 AFE 21.62 19.14
Objectives MVDR (6 ch.) 20.46 18.64
1 DSW-(U+Eq) 17.68 16.73
MMSE
PART II 1 1-VTS 28.20 21.47
b

Summary

Multi-Channel
Power 1 MVDR beamforming is again applied to generate a virtual primary channel.
Spectrum
Enhancement
Poor performance of VTS feature compensation under multi-style acoustic
modeling has already been reported in the literature [1].
Dual-Channel
VTS Feature Under clean GMM-based acoustic modeling (WER, %)...
Compensation
Baseline, 5th ch. MVDR (6 ch.) 1 1-VTS
b
Dual-Channel 80.17 46.03 40.37
Deep
Learning
Techniques [1] Fujimoto, M. and T. Nakatani: Feature enhancement based on generative-discriminative hybrid approach
with GMMs and DNNs for noise robust speech recognition. In Proc. of 40th ICASSP, April 19-24, Brisbane,
Results Australia, 2015.
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
43
VTS Feature Compensation
Power spectrum enhancement as pre-processing
Ivan Lopez
Espejo AURORA2-2C-CT/FT results
Close-talk - Multi-style models Far-talk - Multi-style models
100 100
PART I

Introduction 90 90
WAcc (%)

WAcc (%)
Motivation 80
80
Objectives
70 1-VTSb , 84.61% 1-VTSb , 85.64%
(P-MVDR)+(1-VTSb ), 88.96% 70 (P-MVDR)+(1-VTSb ), 88.04%
PART II
DSW+(1-VTSb ), 89.69% DSW+(1-VTSb ), 88.23%
60
2-VTSC
b , 87.87% 60 2-VTSC
b , 87.74%
Summary
(P-MVDR)+(2-VTSC b ), 89.80% (P-MVDR)+(2-VTSC b ), 88.91%
50 DSW+(2-VTSC DSW+(2-VTSC
b ), 90.35% b ), 88.54%
Multi-Channel 50
Power -5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Spectrum SNR (dB) SNR (dB)
Enhancement

Dual-Channel The higher the SNR of the speech data, the higher the recognition
VTS Feature accuracy provided by VTS feature compensation.
Compensation

Dual-Channel
Deep
When P-MVDR or DSW-(U+Eq)MMSE is combined with 1-VTSb , the same
Learning spatial information as in the case of 2-VTSC
b is being used.
Techniques

Results These are the best results obtained on the AURORA2-2C-CT/FT


Summary, databases.
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
44
Deep Learning-Based Techniques
Missing-data masks for spectral reconstruction
Ivan Lopez
Espejo AURORA2-2C-CT - close-talk
Test set A - Clean models Test set B - Clean models
100 100

PART I
80 80
WAcc (%)

WAcc (%)
Introduction
60 60
Baseline, 67.63% Baseline, 59.49%
Motivation AFE, 82.24% AFE, 76.27%
40 TGI+Oracle, 95.83%
TGI+(T-SNR), 80.12%
40
TGI+Oracle, 93.05%
TGI+(T-SNR), 72.99%
TGI: Truncated-Gaussian
Objectives TGI+DNN, 86.89% TGI+DNN, 77.83%
based Imputation
20

PART II
20
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean Oracle: oracle masks
SNR (dB) SNR (dB)
Test set A - Multi-style models Test set B - Multi-style models
T-SNR: masks from
Summary 100 100
thresholding an estimation of
90
90 the a priori SNR
Multi-Channel 80

WAcc (%)

Power 80
WAcc (%) DNN: two-channel
70
Spectrum 70 60
DNN-based mask estimation
Baseline, 85.07% Baseline, 76.29%
Enhancement AFE, 88.13% 50 AFE, 82.58%
60 TGI+Oracle, 95.87% TGI+Oracle, 92.71%
40
Dual-Channel TGI+(T-SNR), 84.22% TGI+(T-SNR), 77.14%
50 TGI+DNN, 88.17% 30 TGI+DNN, 79.87%
VTS Feature
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Compensation SNR (dB) SNR (dB)
Dual-Channel
Deep
Noises of test set B are reserved to evaluate the generalization ability of the
Learning DNN.
Techniques
Performance tends to saturate for L = 2 and greater values.
Results
2 hidden layers with 460 nodes each. Input and output layers have 230 and
Summary, 23 nodes, resp. (M = 23).
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
45
Deep Learning-Based Techniques
Noise estimates for feature compensation
Ivan Lopez
Espejo AURORA2-2C-CT - close-talk
Test set A - Clean models Test set B - Clean models
100 100

PART I
80 80

WAcc (%)
WAcc (%)

Introduction Baseline, 67.63% Baseline, 59.49%


60 60
TGI+DNN, 86.89% TGI+DNN, 77.83%
Motivation DNN1 , 80.13% DNN1 , 69.92% Applied to 1-VTSb
DNN2 , 89.76% 40 DNN2 , 82.55%
Objectives
40
DNNNAT
1 , 82.79% DNNNAT
1 , 74.74% DNN1 : only the 1st channel
DNNNAT , 86.24% DNNNAT , 78.87%
20
2 20 2 is used as input
PART II -5 0 5 10
SNR (dB)
15 20 Clean -5 0 5 10
SNR (dB)
15 20 Clean
DNN2 : the dual-channel is
Summary 100
Test set A - Multi-style models
100
Test set B - Multi-style models used as input (as presented)

Multi-Channel
90
80
DNNNAT
1 : DNN1 with NAT
WAcc (%)

WAcc (%)
Power
Spectrum
80
Baseline, 85.07%
60
Baseline, 76.29% DNNNAT
2 : DNN2 with NAT
70 TGI+DNN, 88.17% TGI+DNN, 79.87%
Enhancement DNN1 , 83.55% DNN1 , 73.41%
60 DNN2 , 90.95% 40 DNN2 , 84.13%
Dual-Channel DNNNAT
1 , 84.54% DNNNAT
1 , 76.96%
50 DNNNAT
2 , 88.21% DNNNAT
2 , 81.35%
VTS Feature 20
-5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean
Compensation SNR (dB) SNR (dB)
Dual-Channel
Deep
Noises of test set B are reserved to evaluate the generalization ability of the
Learning DNN.
Techniques
5 hidden layers with 512 nodes each. Input and output layers have 230
Results (323 with NAT) and 23 nodes, resp. (L = 2, M = 23).
Summary, NAT improves DNN1 while worsening DNN2 .
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
46
Deep Learning-Based Techniques
Noise estimates for feature compensation
Ivan Lopez
Espejo AURORA2-2C-CT - close-talk

PART I Clean models Multi-style models


100 100
Introduction
90
Motivation 80
80
WAcc (%)

WAcc (%)
Objectives Baseline, 63.56%
60 70
Baseline, 80.68%
PART II IMCRA, 78.38% IMCRA, 82.26%
MS, 78.54% 60 MS, 83.68%
Summary 40 Interp., 83.06% Interp., 84.61%
50
Multi-Channel PLDNE, 81.05% PLDNE, 85.29%
Power DNN2 , 86.16% 40 DNN2 , 87.54%
20
Spectrum
Enhancement -5 0 5 10 15 20 Clean -5 0 5 10 15 20 Clean

Dual-Channel
SNR (dB) SNR (dB)
VTS Feature
Compensation
Noise estimation methods for comparison (applied to 1-VTSb )
Dual-Channel
Deep Single-channel methods: Improved minima controlled recursive averaging
Learning
Techniques (IMCRA), minimum statistics (MS) and linear interpolation (Interp.).
Results PLDNE is for dual-mic smartphones (PLD and homogeneous noise field).
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
47

Ivan Lopez
Espejo

PART I

Introduction

Motivation

Objectives

PART II
Summary, Conclusions and Future Work
Summary

Multi-Channel
Power
Spectrum
Enhancement

Dual-Channel
VTS Feature
Compensation

Dual-Channel
Deep
Learning
Techniques

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
48
Summary and Conclusions
Ivan Lopez
Espejo

PART I The multi-channel information coming from mobile devices with several
sensors can be exploited for noise-robust ASR purposes.
Introduction

Motivation We have developed a series of contributions intended to operate in a


Objectives dual-mic set-up where the performance of classical beamforming is poor.
PART II
Designing specific solutions is mandatory to achieve high recognition
Summary accuracy on dual-mic mobile devices.
Multi-Channel
Power
Spectrum
The AURORA2-2C-CT/FT corpora are contributions of this Thesis.
Enhancement

Dual-Channel
The best power spectrum enhancement results were achieved by
VTS Feature DSW-(U+Eq)MMSE on all the corpora.
Compensation

Dual-Channel Our MMSE-based speech gain estimation method provides similar or better
Deep results than an eigenvalue decomposition-based method with a fraction of
Learning
Techniques
the computational complexity.

Results

Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
49
Summary and Conclusions
Ivan Lopez
Espejo
Discarding the phase information is beneficial to overcome the limitations
of classical MVDR beamforming when applied to the considered dual-mic
PART I
framework.
Introduction
We showed the better performance of the dual- versus the single-channel
Motivation
VTS feature compensation approach on the AURORA2-2C-CT/FT
Objectives databases.
PART II
For dual-channel VTS feature compensation, it was proven to be more
Summary robust modeling the conditional dependence of the noisy secondary channel
Multi-Channel given the primary one.
Power
Spectrum Accurate missing-data masks and noise estimates were obtained by jointly
Enhancement
exploiting the dual-channel noisy information and the powerful modeling
Dual-Channel capabilities of DNNs.
VTS Feature
Compensation
The use of the secondary sensor itself can be understood as a more robust
Dual-Channel kind of NAT strategy.
Deep
Learning
Techniques
Our contributions broadly showed an outstanding performance at low
SNRs, which makes them promising techniques to be used in highly noisy
Results
environments such as those where mobile devices might be used.
Summary,
Conclusions &
Future Work
Robust ASR
on IMDs with
Dual-Mic
50
Future Work
Ivan Lopez
Espejo
Extension of the different proposals to operate on different
PART I mobile devices with other small microphone array
Introduction configurations.
Motivation

Objectives Taking advantage of the multi-channel noisy observation also for


PART II clean speech partial estimation in VTS feature compensation.
Summary

Multi-Channel
Power
With respect to our DNN-based proposals, performing an
Spectrum extensive search regarding the architecture and training
Enhancement
configuration:
Dual-Channel
VTS Feature Use of recurrent neural networks (RNNs)
Compensation
Additional or different kind of features
Dual-Channel
Deep
Learning
Techniques Extension of our DNN-based proposals to deal with a
Results hands-free/far-talk scenario.
Summary,
Conclusions &
Future Work
Thank you very much for your attention

Ivan Lopez Espejo


Department of Signal Theory, Telematics and Communications
University of Granada
E-mail: iloes@ugr.es

You might also like