Professional Documents
Culture Documents
ABSTRACT
q Maxout networks have brought signicant improvements to various speech recognition
and computer vision tasks.
q We introduce two new types of generalized maxout units, which we call p-norm and
soft-maxout, and nd the p-norm generalization of maxout consistently performs well in
our LVCSR task.
q A simple normalization technique was used to prevent instability during training.
Related
works
q p-norm pooling strategy has been used for learning image features [K. Kavukcuoglu et
al., CVPR09] [Boureau et al., ICML10] [Sermanet et al., ICPR12 ]
q A recent work [Gulcehre et al., arXiv (Nov. 2013)] has proposed a learned-norm pooling
strategy for deep feedforward and recurrent neural networks.
Figure
4.In
terms
of
WER,
pnorm
generally
works
well
with
fewer
parameters
than
the
tanh
system,
and
also
overtrains
more
easily.
Figure
1.
Tuning
the
group
size
and
power
p
Non-linearity
Types
q Traditional nonlinearity for neural networks was sigmoidal functions (tanh or sigmoid).
q Rectified linear unit (ReLU, which is simply max(x, 0)) has also become popular.
q Soft-maxout :
q p-norm:
1/ p
y = max i xi
y = log exp xi
i
"
p%
y = $ xi '
# i
&
Stablizing Training
q Here, we used a group size G = 10 and a number of groups K = 290 in each case (this
was tuned to give about 3 million parameters in the 2-layer case).
q The optimal number of hidden layers seems to be lower for 2-norm (at 3 layers) than for
the other nonlinearities (at around 5).
q When using maxout and related nonlinearities, training sometimes failed after many
epochs.
q Very large activations started appearing, so that the posteriors output from the softmax
layer could be zero.
q We solved this by introducing a renormalization layer after each maxout/pnorm layer
q This layer divides the whole input vector by its root-mean-square value (renormalizes so
rms=1)
q A many-to-many function (e.g. 200 to 200)
q Use this in test also.
Figure
3.
pnorm
overtrains
more
easily
than
the
tanh
system,
in
terms
of
the
objecOve
funcOon
(cross-entropy)
q The two bar-charts here show the WER and ATWV performance, respectively, of the
DNN+SGMM Limited LP systems in all four OP1 languages.
q The WERs are on the respective 10 hour development sets.
q The ATWVs for Bengali and Assamese correspond to NIST IndusDB scoring
protocols as of October 2013, while the Zulu and Haitian Creole ATWVs are based on
JHU-generated keyword lists and reference alignments (RTTM).
The authors
were supported by DARPA BOLT contract N\b{o} HR0011-12-C-0015, and IARPA BABEL contract N\b{o} W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the
www.PosterPresentations.com
authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government.
RESEARCH POSTER PRESENTATION DESIGN 2012