You are on page 1of 5

PHONEME DISCRIMINATION USING NEURONS WITH SYMMETRIC NONLINEAR

RESPONSE OVER A SPECTRAL RANGE

Ondrej uch1,2 , Ondrej kvarek2 , Martin Klimo2

1
Mathematical Institute of Slovak Academy of Sciences, Dumbierska 1, Bansk Bystrica
2
University of ilina, Univerzitn 8215, ilina, Slovakia
arXiv:1311.0819v1 [cs.SD] 4 Nov 2013

ABSTRACT strived to achieve two invariance properties of the resulting


We consider the ability of a very simple feed-forward ne- network.
ural network to discriminate phonemes based on just relative First, we demand that the output of the neural network is
power spectrum. The network consists of two neurons with balanced i.e. invariant with respect to loudness. We achieve
symmetric nonlinear response over a spectral range. The out- this requirement by restricting our attention to neural ne-
put of the neurons is subsequently fed to a comparator. We tworks carrying out computation
show that often this is enough to achieve complete separation
f (s) = f1 (s) f2 (s), (1)
of data. We compare the performance of found discriminants
with that of more general neurons. Our conclusion is that not where s represents sound and f1 , f2 are nonlinear functions
much is gained in passing to real-valued weights. More li- that grow additively with increase in loudness i.e.
kely higher number of neurons and preprocessing of input
will yield better discrimination results. The networks consi- f1 (s) f1 (s0 ) = f2 (s) f2 (s0 ) = d, (2)
dered are directly amenable to hardware (neuromorphic) de-
signs. Other advantages include interpretability, guarantees of if sounds s and s0 differ purely by d decibels in loudness.
performance on unseen data and low Kolmogoroffs comple- Secondly, we demand a symmetric response of neurons
xity. f1 , f2 over a spectral range. Write s = (s1 , . . . , sn ) for the
Index Terms phoneme discrimination, feed-forward discrete log-periodogram of a sound, i.e. si represents the log
neural network, neuromorphic hardware, TIMIT, memristor of power at frequency (i1) f2n S
. A spectral range r is any sub-
sequence (si , si+1 , . . . , sj ) with i j. Symmetric response
over a spectral range r of a neuron represented by k-ary func-
1. INTRODUCTION tion f means that
Artificial neural networks have emerged as one of the most f (r) = f ((r)) (3)
powerful tools in speech recognition [1], and more generally
in machine learning [2, Chapter 5]. If there is a downside for any permutation of k elements. This requirement is a
to employing neural networks, it is their opaqueness. Much strong form of requiring that the response be invariant to small
like a human brain which inspired them, it is often not clear shifts of formant frequencies. Consider a family of signals,
why they work so well. It is however possible, and even ad- spectra of three are sketched in Figure 1. Suppose one wants
visable [3, page 148], to address this opaqueness by adopting
design principles that will provide guarantees about their per-
formance in situations that did not occur during training.
In our work we introduce a new class of neural networks
that may be used for discrimination between phonemes. They
arose in connection with investigation of computational po-
wer of memristor based networks. As such, they use predo-
minantly min and max processing primitives. There are other
approaches that use the same processing elements [4], [5], [6], Fig. 1. Sketch of power spectra of three signals with a similar
[7], [8], [9] as well as a vast body of research on more gene- formant frequency.
ral fuzzy logic systems. The difference in our work is that we
Research partially supported by grants APVV-0219-12 and VEGA
to construct a function with strong response for signals in this
2/0112/11 . Usage of computational facilities of University of ilina and Uni- family with formants varying between frequencies 1 and 2 .
versity of Matej Bel is gratefully acknowledged. If one restricts oneself to linear forms, there is essentally a
single expression that has uniformly strong response for all where i , i2 are means and variances of evaluations of disc-
signals in the family, namely riminant function f over the two classes. For Z-classifiers we
have maximized the following secondary tie-breaking crite-
si + si+1 + + sj . (4)
rion
This is not true, if one tries expressions from nonlinear algeb-  2 2 
ras, even as simple as the algebra generated by binary min FZ (f ) := min 1
, 2 , if sign(1 ) 6= sign(2 ). (9)
and max functions. For instance, consider functions 12 22

max(si , si+1 , . . . , sj ) (5) For our data set we used discrete spectra created by A.
max2 (si , si+1 , . . . , sj ) (6) Buja, W. Stueltze and M. Maechler and used in work [10].
It is freely available in ElemStatLearn package of R statistics
where max2 denotes the second largest element of the set. software as well as online [11]. The data set contains spectra
Both of hese functions have strong and symmetric response of five english phonemes computed from TIMIT database.
uniformly over all signals in the family. We have used custom-built C++ software for results ob-
In our work we shall be concerned with induced B- tained in the next two sections, R for verification, graphing
classifiers that determine the class of a phoneme based on and Nelder-Mead optimization in section 4 and Matlab for
comparing f (s) with a threshold , and a special subclass of computing data in Figure 4.
Z-classifiers, for which = 0.

2. OPTIMIZATION METHODS 3. RESULTS

A vast majority machine learning techniques such as support Figure 2 presents the results of classification by Z-classifiers.
vector machines, neural networks, or various regressions have From the graphs it is clear that in majority of cases, the discri-
a parameter space that forms a Riemannian manifold. Con- minant functions we considered are able to completely sepa-
sequently with these techniques one may use gradient based rate the two classes. Moreover, separation occurs with relati-
optimization mechanisms and sometimes even convex opti- vely short spectral ranges, with size at most 12. There are just
mization. The situation with our class of neural networks is two cases where separation does not occur and that is discri-
different. We need to find an optimal structure in a discrete mination of pairs aa-ao and dcl-iy.
parameter space. There are two hurdles that need to be add- It is interesting to note that passing from Z-classifiers
ressed. to more general B-classifiers does not improve results very
First, in the absence of a clever trick, one needs to search much as can be seen in Table 1.
the discrete space in a reasonable amount of time. One may
opt for a local search whereby an initial network is optimi- Change on
zed by small twists, or perhaps by a genetic algorithm. The phonemes train data test data
disadvantage of the former is that the search may end in a su- aa-ao 0.78 % 1.82 %
boptimal local minimum, whereas the latter may take a long aa-dcl 0.09 % 0%
time to find a good network. In our work we opt for exhaus- aa-iy 0% 0%
tive search of polynomially growing family, namely we will aa-sh 0% 0%
consider only functions of the form ao-dcl 0.08 % 0%
ao-iy 0% -0.17 %
f (s) = q1 (r1 ) q2 (r2 ), (7) ao-sh 0% 0%
where q1 , q2 are quantiles (e.g. max, max2 ), and r1 , r2 are dcl-iy 2.48 % 0.99 %
spectral ranges. If one considers ri of length at most L and the dcl-sh 0% -0.48 %
whole spectrum has N power points, then there are O(N 2 L4 ) iy-sh 0.07 % 0.19 %
such functions and an exhaustive search is feasible.
Secondly, one needs to define a goodness criterion. Ty- Table 1. Improvement (positive) or worsening (negative) of
pically this is classification success. However, in the case performance of B-classifiers compared to Z-classifiers
when multiple discrimination functions achieve perfect se-
paration on training data, a more refined criterion is needed.
In this context one needs to distinguish between training Z- Unlike many other classes of neural networks, the struc-
classifiers and B-classifiers. For B-classifiers, usual Fisher ture (and not only response) of our networks can be clearly
discriminant may be used, which is defined as visualized as seen in Figure 3. In the figure in horizontal scale
we indicate spectral ranges to which the two neurons are sen-
(1 2 )2 sitive. In the vertical scale we indicate maximum length of
FB (f ) := , (8)
12 + 22 spectral ranges r1 , r2 .
5 10 15 20 25 30

dcl sh iy sh
100








95

90

85

80

ao dcl ao iy ao sh dcl iy
100
% correctly classified







95



90

85

80

aa ao aa dcl aa iy aa sh
100

95

90

85


80


5 10 15 20 25 30 5 10 15 20 25 30

maximal width
Fig. 2. Train and test success rates for various pairwise Z-classifiers

4. CONTINUOUS NEIGHBORHOOD OF OPTIMAL with exactly one nonzero entry in both b1 and b2 . One may
QUANTILE CLASSIFIERS relax this condition on bi to obtain other classifiers.

One may try to improve the discrimination results by allowing Example 1. Consider the B-classifier defined by function
more general functions of a spectral range. So let us suppose
f (s) = max(s62 , s63 , . . . , s74 ) s1 , (12)
that we have spectral ranges s1 , s2 . Let us write sort() for the
function that orders its vector argument elementwise. Com-
with threshold value = 4.03279 that decides
monly used LDA tries to optimize Fishers discriminant of
functions (
phoneme is dcl if f (s) < 4.03279,
w1 r1 w2 r2 , (10) (13)
phoneme is iy if f (s) > 4.03279.
with real valued weight vectors w1 , w2 , whereas in the previ- We can consider vectors bi in (11) of the following categories
ous section we optimized separations of expressions
(monotone OWA) nonnegative, increasing entries in
q1 (r1 ) q2 (r2 ) = b1 sort(r1 ) b2 sort(r2 ), (11) each bi , with total sum equal to one,
5. FUTURE WORK
ao iy

In this contribution we have opted to present only the simplest



results due to limited space. It is clear however that more rese-
30




arch is needed for this class of networks to find applications.

Let us outline the directions further research may take.
25




First, one should take into account known psychoacoustic


phenomena of human hearing. It may prove advantageous to
spectral range width

adjust spectral power to reflect varying sensitivity to varying


20





frequency [15]. Our experiments showed that often low fre-

quency power was crucial for discrimination and thus it may
15





prove useful to use Q-transform [16] which provides more

data points in lower frequencies compared to ordinary FFT.
10





Secondly, it is well known that it is two and sometimes up




to 4 formants that characterize a vowel. It is therefore neces-


sary to consider a more complex set of discrimination func-
5



tions. In [17] we proposed an algebra, whose elements are
candidates for describing the structure of more complex ne-
0 2000 4000 6000 8000
tworks.
Frequency (Hz)
Let us conclude with summarizing advantages of propo-
sed networks. By design, they provide a guaranteed perfor-
Fig. 3. Spectral ranges of optimal neural networks for disc-
mance on variations of trained data unseen during training,
rimination ao-iy plotted against increasing spectral range
they are interpretable and have very low Kolmogoroffs com-
width. White circle indicates the position of quantile (right is
plexity, as the example (12) shows.
the maximum, the left end and the middle would represent
the minimum and the median respectively). More graphs are
available in [12, page 83].

(OWA [13]) nonnegative entries in each bi , with total


sum equal to one,

(ordered LDA) arbitrary real valued entries.

Comparison of these methods can be seen in Table 2. All met-


hods in the table use only spectral compoments s1 and the
spectral range s62 , s63 , . . . , s72 . Balanced LDA is a variant
of LDA, in which the sum of all coefficients is 0. Our conclu-
Fig. 4. Evaluation of dcl-iy classifier over a Slovak word
(IPA: /odiSla/) superimposed on PCM signal. Note that the
method train test er- Fisher Fisher
classifier was train on English data (TIMIT).
error ror train test
score score
Last, but not the least, the networks can be easily imple-
quantiles (f ) 2.7 % 5.1 % 6.22 6.54
mented in hardware, since BJT [18], CMOS [19] and even
monotone OWA 3% 4% 6.24 6.5
passive memristor implementations [20] of min, max and
OWA 3% 4.5 % 6.28 6.54
comparison operators exist. In this context it is worthwhile
ordered LDA 0.8 % 1.8 % 15 12.83
to point out that it is the change of, rather than the absolute
balanced LDA 4.2 % 5.7 % 5.69 5.85 spectral content, that can be read off from these discrimi-
LDA 1.2 % 2.2 % 13.81 12.22 nants. This is illustrated in Figure 4 where transition between
phonemes is quite strong. One may thus hypothesize that
Table 2. Comparison of LDA with ordered methods. See text
analog hardware speech recognizer could be based on silicon
for explanation of various methods.
cochlea ([21], followed by processing by a neural network
of the kind described here, whose output would be fed to
sion is that not much is gained by passing from discrete struc- adaptive differentiator like that of Delbrck and Mead [22],
tures represented by quantiles to weighted ones, in line with and finally to a memristive switch [23].
similar research [14].
References [13] R. Yager, On ordered weighted averaging aggrega-
tion operators in multicriteria decisionmaking, Sys-
[1] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. tems, Man and Cybernetics, IEEE Transactions on,
Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, vol. 18, no. 1, pp. 183190, 1988, ISSN: 0018-9472.
and B. Kingsbury, Deep neural networks for acous- DOI : 10.1109/21.87068.
tic modeling in speech recognition: the shared views
[14] D. Soudry and R. Meir. (2013). Mean Field Bayes
of four research groups, Signal Processing Magazine,
Backpropagation: scalable training of multilayer ne-
IEEE, vol. 29, no. 6, pp. 8297, 2012, ISSN: 1053-
ural networks with binary weights, [Online]. Available:
5888. DOI: 10.1109/MSP.2012.2205597.
http://arxiv.org/abs/1310.1867.
[2] C. M. Bishop, Pattern recognition and machine lear-
[15] R. Stern and N. Morgan, Hearing is believing: bio-
ning. Springer, 2006.
logically inspired methods for robust automatic spe-
[3] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A ech recognition, Signal Processing Magazine, IEEE,
learning algorithm for boltzmann machines, Cogni- vol. 29, no. 6, pp. 3443, 2012, ISSN: 1053-5888. DOI:
tive Science, vol. 9, no. 1, pp. 147 169, 1985, ISSN: 10.1109/MSP.2012.2207989.
0364-0213. DOI: http : / / dx . doi . org / 10 .
[16] J. C. Brown, Calculation of a constant q spectral
1016 / S0364 - 0213(85 ) 80012 - 4. [Online].
transform, The Journal of the Acoustical Society
Available: http://www.sciencedirect.com/
of America, vol. 89, no. 1, pp. 425434, 1991. DOI:
science/article/pii/S0364021385800124.
10.1121/1.400476. [Online]. Available: http:
[4] S. Badura, Learning of fuzzy logic circuits with me- / / scitation . aip . org / content / asa /
mory for speech recognition using video, Ph.D. thesis, journal/jasa/89/1/10.1121/1.400476.
ilinsk univerzita, 2012.
[17] O. uch. (2013). Phoneme discrimination using KS
[5] S. Foltn, Speech recognition by means of fuzzy logi- algebra I, [Online]. Available: arxiv . org / abs /
cal circuits, Ph.D. thesis, ilinsk univerzita, 2012. 1302.6031.
[6] S. Foltn, Speech recognition by means of fuzzy lo- [18] T. Yamakawa, Fuzzy logic computers and circuits,
gical circuits, in 18th international conference on soft U.S.Patent 4,875,184, 1987.
computing, MENDEL 2012, Brno, Jun. 2729, 2012.
[19] I. Baturone, S. Sanchez-Solano, A. Bariga, and J. L.
[7] S. Foltn and J. Smieko, Phoneme recognition by Huertas, Implementation of CMOS fuzzy controllers
application of genetic programming, in Sbornk prs- as mixed-signal integrated circuits, IEEE J. of Solid
pevku z mezinrodn vedeck konference, Mezinrodn State Circuits, vol. 5, no. 1, pp. 119, 1997.
konference pro doktorandy a mlade vedeck pracov-
[20] M. Klimo and O. uch. (2011). Memristors can im-
nky, Jun. 2729, 2012, pp. 22342241.
plement fuzzy logic, [Online]. Available: http : / /
[8] S. Badura, M. Klimo, and O. kvarek, Lip reading arxiv.org/abs/1110.2074.
using fuzzy logic network with memory, in Appli-
[21] B. Wen and K. Boahen, A silicon cochlea with ac-
cation of Information and Communication Technolo-
tive coupling, Biomedical Circuits and Systems, IEEE
gies (AICT), 6th International Conference, Oct. 2012,
Transactions on, vol. 3, no. 6, pp. 444 455, Dec. 2009,
pp. 14. [Online]. Available: http://ieeexplore.
ISSN : 1932-4545. DOI : 10 . 1109 / TBCAS . 2009 .
ieee.org/stamp/stamp.jsp?tp=&arnumber=
2027127.
6398471&isnumber=6398461.
[22] T. Delbrck and C.A.Mead, Adaptive photoreceptor
[9] S. Badura, S. Foltn, and M. Klimo, Fuzzy logic
with wide dynamic range, in 1994 International Sym-
networks for speech recognition, Communications,
posium on Circuits and Systems, ISCAS94, 1994, 339
Scientific Letters of University of ilina, pp. 1318, 2
342, vol.4.
2013.
[23] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S.
[10] T. Hastie, A. Buja, and R. Tibshirani, Penalized disc-
Wiliams, The missing memristor found, Nature, vol.
riminant analysis, The Annals of Statistics, vol. 23, no.
453, pp. 8083, 2008, doi:10.1038/nature06932.
1, pp. 73102, 1995.
[11] (Dec. 7, 2012). English phonemes, [Online]. Available:
http://www- stat.stanford.edu/~tibs/
ElemStatLearn/datasets/phoneme.data.
[12] O. uch. (2013). Using min-max circuits for speech
recognition, [Online]. Available: http : / / www .
savbb.sk/~ondrejs/Phoneme/book.pdf.