Professional Documents
Culture Documents
2016 3 International Conference on Computing for Sustainable Global Development, 16 th 18th March, 2016
Bharati Vidyapeeths Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)
rd
Chuya China
(Bhanja)
Tameem S.
Choudhury
I. INTRODUCTION
Speech can be described to be a "spoken" form of
communication generated by the valid incorporation of sounds
particular to a specific language. The sounds are produced
when vowels and consonants blend. An extremely large
number of languages exist which can be differentiated on the
basis of their terminologies, their vocabularies, the patterns
which arrange the words and their collection of phrases. The
mechanism of identifying one's words is initiated at the most
primitive level, the acoustics of the spoken word. Once the
aural signal is examined, vocal sounds are further examined to
isolate auditory clues and phonetic data. This data is used for
higher-level language processes [1]. Generally different types
of characteristics are identified from the speech signal, and the
most prominent among them are the prosodic features. Prosody
represents the beat, emphasis and intonation of speech. It
provides diverse characteristics of the talker and the language:
for example, the mental condition of the talker and the type of
the speech (assertion, doubt and instruction). It also tells about
R. H. Laskar
Aniket Pramanik
Email id:
aniketpramanik@yaho
o.co.in
3038
3039
A Comparitive Study of Discriminative Approaches for Classifying Languages into Tonal and Non-Tonal Categories at Syllabic
Level
Data given to the classifiers Once the pitch and the short term
energy of the signal are extracted and modeled mathematically,
they are given to the classifiers as input. Normalization of the
data is always carried out with respect to the global maxima.
The unique aspect of our work is that the speech signal is not
considered as a whole. The speech signal is divided into
syllable segments assuming CV structure as they are the most
dominant. These segments are then analyzed individually and
the decision is taken based on the nature of the majority of its
constituent syllables. The final decision is taken depending on
the nature of the bulk of the syllables. This method of using the
class of the majority of the syllables to determine the tonality of
the language has improved the performance of our classifier:
the accuracy has improved by a significant 4-5% than what was
achieved by analyzing the utterance as a whole. The time
required has also roughly remained the same. The accuracy of
our syllable detection step is extremely important for the
precise implementation of this step.
In Fig.2, all the steps involved in our work are shown as a
flowchart.
Value of k = 5
B. Artificial Neural Network
Neural network [12] can be defined as computer
architecture, modeled after the human brain, in which
processors are connected in a manner suggestive of connections
between neurons and learns by trial and error. It is commonly
used to estimate unidentified mathematical functions. These are
present in layers and each layer has a number of neurons called
nodes which are all inter-connected. Every neural network has
an activation function, which determines the output of each
neuron. The input layer is presented with the training data,
which is passed to the hidden layer of neurons where the actual
processing is done. Once the processing is done, the hidden
layers provide their output to the output layer. Much such
forward propagation runs can be done to minimize the error.
Advantages include flexible learning, tolerant to faults, nonlinearity etc. Back propagation neural network is used which
works on an error correcting algorithm. After completion of
every forward propagation run during the training phase, the
output is matched to the provided output. Depending on the
error margin, further forward runs are made after weight
adjustments of the interconnections. The parameters, as
implemented in MATLAB, were
Output layer neurons =1
3040
Correct
Detection
Correct
Detection %
Class I
19
14
73.684
Class II
19
17
89.473
Correct
Detection
Correct
Detection %
Class I
19
15
78.947
Class II
19
17
89.473
3.
Correct
Detection
Correct
Detection %
Class I
19
13
68.421
Class II
19
12
63.157
VI. OBSERVATIONS
The study of these three different classifiers gave the
following observations Artificial Neural Network has the highest accuracy with
84.21%. K-NN classifier follows with 81.57%. The SVM
classifier's performance was the least accurate with
65.79%. Less variance among the training data may be a
VII.CONCLUSIONS
In this paper, we have studied the performance of different
classifiers i.e. Artificial Neural Network (ANN), k Nearest
Neighbor (k-NN) and Support Vector Machine (SVM) in
identifying whether an unknown speech sample belongs to a
tonal or non-tonal class. ANN classifier can be said to be the
best classifier of these three classifiers. Using a much
uncomplicated methodology, good results have been obtained
for ANN and k-NN classifier. By pre-classifying the languages
into two broad classes, we can reduce the complexity of
automatic identification of languages. Once we have preclassified the languages into two categories, the complexity in
the final step of identification of languages becomes less time
consuming.. In our pre-classification stage, we are just using
prosodic information whereas for identification of language,
acoustic and phonotactic features are used. This reduces the
time in pre-classification stage.
An automatic Language Identification System can be built using
this pre-classification work as a platform.
3041
A Comparitive Study of Discriminative Approaches for Classifying Languages into Tonal and Non-Tonal Categories at Syllabic
Level
REFERENCES
Journal References
[1]. Implementation of Advanced Communication Aid for People with Severe
Speech Impairment IOSR Journal of Electronics and Communication
Engineering (IOSR-JECE) e-ISSN: 2278-2834, p- ISSN: 22788735.Volume 9, Issue 2, Ver. III (Mar - Apr. 2014), PP 61-66.
[2]. Orsucci et al.: Prosody and synchronization in cognitive neuroscience.
Prosody and synchronization in cognitive neuroscience. EPJ Nonlinear
Biomedical Physics 2013 1:6.
[3]. "The OGI Multi-language Telephone Speech Corpus". Y. K. Muthusamy,
R. A. Cole and B. T. Oshika. Proceedings of the International Conference
on Spoken Language Processing, Banff, Alberta, Canada, October 1992.
[4]. L. Wang, E. Ambikairajah, E. H.C. Choi, A novel method for automatic
tonal and non-tonal language classification. In: IEEE International
Conference on Multimedia and Expo. pp. 352 355,2007.
[5]. Suryakanth V. Gangashetty, C. Chandra Shekhar, B. Yenganarayana
Extraction of fixed dimension patterns from varying duration segments
of consonant-vowel utterances.
[6]. S.R. Mahadeva Prasanna, B. Yegnanarayana. Detection of Vowel Onset
Point Events using Excitation Information.
[7]. Theodoros Theodorou , Iosif Mporas and Nikos Fakotakis. Automatic
Sound Classification of Radio Broadcast News. International Journal of
Signal Processing, Image Processing and Pattern Recognition Vol. 5, No.
1, March, 2012
[8]. A.Y.Ng, M.I.Jordan,On Discriminative vs. Generative classifiers: A
comparison of logistic regression and nave Bayes.
[9]. Leena Mary, B. Yegnanarayana, Extraction and representation of
prosodic
featuresfor
language
and speaker recognitionSpeech Communication vol.50 No.782796,
2008.
[10]. W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, D. A. Reynolds,
Language Recognition with Support Vector Machines In: Proceedings
Odyssey. pp. 4144. 2004.
Book References
[1]. David Talkin, A Robust Algorithm for Pitch Tracking. chapter 14.1995.
[2]. B.Yegnanarayana. Artificial Neural Networks,Prentice-Hall of lndia
Private Limited, New Delhi. 2005
3042