You are on page 1of 5

Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465

2016 3 International Conference on Computing for Sustainable Global Development, 16 th 18th March, 2016
Bharati Vidyapeeths Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)
rd

A Comparative Study of Discriminative


Approaches for Classifying Languages into Tonal
and Non-Tonal Categories at Syllabic Level
Biplav Choudhury
NIT Silchar, India
Email id:
biplav93@gmail.com

Chuya China
(Bhanja)

Tameem S.
Choudhury

NIT Silchar, India


Email id:
chuya.bhanja@gmail.c
om

NIT Silchar, India


Email
id:salmantameem360
@gmail.com

Abstract Languages spoken by us, on the basis of how they use


tone to convey a meaning, can be grouped into two categories:
Tonal and Non-Tonal languages. Pitch is used as a figure of speech
in the case of tonal languages. The connotation of a word changes
depending on the pitch or tone used. Both pitch and pitch range,
are found to be lower for non-tonal languages. A speech signal
contains both speaker and language attributes. In tonal and nontonal language classification, discriminating cues are extracted
from the speech signal and fed to the different classifiers. This
work is unique in the way that the speech signal is divided into its
constituent syllables before doing any further processing for
feature extraction, instead of considering the utterance as a whole.
In this paper, the performance analysis of different classifiers is
done at syllabic level for identifying Tonal and Non-Tonal
languages. In this classification tasks artificial neural network
(ANN) outperforms the other classifiers with the accuracy of
84.21%.
Keywords Syllable; Vowel Onset Point (VOP); Tonal and nontonal languages, Artificial neural network (ANN); support vector
machine; k-Nearest Neighbor (k-NN).

I. INTRODUCTION
Speech can be described to be a "spoken" form of
communication generated by the valid incorporation of sounds
particular to a specific language. The sounds are produced
when vowels and consonants blend. An extremely large
number of languages exist which can be differentiated on the
basis of their terminologies, their vocabularies, the patterns
which arrange the words and their collection of phrases. The
mechanism of identifying one's words is initiated at the most
primitive level, the acoustics of the spoken word. Once the
aural signal is examined, vocal sounds are further examined to
isolate auditory clues and phonetic data. This data is used for
higher-level language processes [1]. Generally different types
of characteristics are identified from the speech signal, and the
most prominent among them are the prosodic features. Prosody
represents the beat, emphasis and intonation of speech. It
provides diverse characteristics of the talker and the language:
for example, the mental condition of the talker and the type of
the speech (assertion, doubt and instruction). It also tells about

R. H. Laskar

Aniket Pramanik

NIT Silchar, India


Email id:
rhlaskar@ece.nits.ac.in

NIT Silchar, India

Email id:
aniketpramanik@yaho
o.co.in

the occurrence of satire, importance, counterpoint, and stress;


and other factor of speech that may not be identified from
language rules or by selection of words [2].
Language identification (LID) involves automatically
identifying the languages by analyzing the audio of the
language. Over the years different LID techniques have been
developed focusing on different types of features as well as
different types of modeling techniques. Language modeling can
be done either by generative or discriminative approach [8].
Generative classifiers use the joint , but discriminative
classifiers model conditional probability directly.
Though the performance of every classifier depends upon the
distribution of data but in some aspects such as handling
missing data etc. discriminative classifiers are preferred over
generative classifiers. For classifying languages into tonal and
non-tonal categories, generative (Gaussian Mixture Model
(GMM), Hidden Markov Model (HMM), Naive Bayes etc.) or
discriminative (k-NN, SVM, ANN etc.) modeling techniques
can be used. In this paper the comparison of three different
discriminative classifiers i.e. k-NN, SVM, and ANN has been
carried out.
Tonal characteristics of a language mainly manifest itself as
pitch. So this classification is based on pitch information. It has
been observed that for discriminating tonal and non-tonal
languages, energy information is also essential. So some of the
existing features like average pitch, pitch changing speed,[4]
pitch changing level, along with some new features like
average energy, energy changing speed and energy changing
level are used for this purpose. The speech signal is first broken
into constituent syllables, then the above mentioned features
are extracted from each syllable and those features are given to
the classifiers for discriminating between tonal and non-tonal
languages.
For our work, the OGI language database has been used which
comprises of taped telephonic data of 11 different languages.
These conversations are recorded versions of actual telephonic
conversations of native speakers of English, Farsi, French,
German, Japanese, Korean, Mandarin Chinese, Spanish, Tamil,
and Vietnamese. Of this database, English, Farsi, French,

Copy Right INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5

3038

Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465


2016 3 International Conference on Computing for Sustainable Global Development, 16 th 18th March, 2016
rd

German, Japanese, Korean, Spanish and Tamil are non-tonal


while rest two, Mandarin Chinese and Vietnamese are tonal [3].
The organization of the paper is as follows: Associated work
done in this field is given in Section II, the method of approach
followed throughout the work is explained in Section III; the
discriminative classifiers used are explained in section IV,
results are delineated in section V, section VI lists the findings
and section VII finally wraps up the paper.

with significant change in signal energy give the confirmation


for the detection of VOPs. These can be used as a clue for
detecting the occurrence of VOPs. Using a local peak
identifying algorithm, peaks are located in the signal energy
plot that signals the VOPs [7].

II. RELATED WORK


Categorization of languages into tonal and non-tonal classes
has been previously attempted in utterance level. L.Wang.et al
[4] modeled the pitch by considering the rate of pitch change
and the way in which it changes. Normalization has been done
using Voiced Duration and Voiced Counter. Using neural
network, the accuracy obtained was 80.6%. L.Mary.et al [9]
conducted language and speaker identification task into syllabic
level using prosodic features. They segmented the continuous
speech into syllable-like unit by using Vowel Onset Points
(VOP) as anchor points for identifying syllables.
III. METHODOLOGY
The complete process has been split into 5 steps.

Elimination of Unvoiced Part - Voiced signals


comprises a lot of portions of silence. Thus, we have to remove
such parts from our signal to identify clear speech segments.
Usually, signal power content of silence parts are less than the
voiced parts. To achieve the identification of clear speech
segments, Signal Energy and Spectral Centroid are used [5].
The steps followed are Signal Energy and Spectral Centroid are calculated in
two sequences.
Thresholds are estimated for both criteria.
Both the sequences are subject to the respective
thresholds.
Using the above thresholding criterion, clean speech
segments are identified.
Spotting of Vowel Onset Point (VOPs) - Vowel Onset Point
(VOP) is the moment in which the utterance of a vowel begins
in our speech signal. The correct determination of VOP is
important as it is imperative to identify consonantvowel (CV)
units in different languages. Out of all possible patterns of
combination of vowel and consonant, CV pattern is most
important as it is the most common pattern among all
languages, particularly in Indic languages. These onset points
of vowels are detected using various methods. Some of them
are increasing nature of resonance peaks in the amplitude,
frequency spectrum; pitch and energy characteristics, zerocrossing rate, wavelet transform, neural network; dynamic time
warping, and excitation source information [6]. The transition
from a consonant to a vowel can be identified by the greater
strength of excitation; this is used as a cue to detect the onset of
vowels in our speech signal. Fig. 1 shows such an example.
Generally the instantaneous energy of our signal is higher for
vowels than the strength of voiced consonants. Thus the places

Fig. 1. Location of a VOP in a speech signal.

Syllabic segmentation Once the VOPs have been identified,


we divide the speech into its constituent syllables. This can be
referred to as the syllabic approach. Syllables are the basic
units of sound, generally containing a single vowel and
surrounded by consonants. The region surrounding a consonant
vowel (CV) utterance gives us the required parameters For this
work the characteristics of the signal around VOP,25
milliseconds before and 40 milliseconds after VOP, is taken to
parameterize the features of the CV unit. When extracting
these, 10 overlapping windows surrounding the VOP where
each window is of 20 milliseconds. Frame shift considered is 5
seconds [9].
Feature extraction This study was done considering the pitch
and energy as the features. Pitch tracking involves tracking the
fundamental frequency. The RAPT (Robust Algorithm for
Pitch Tracking) [9] algorithm is chosen to calculate the
fundamental frequency, which is available as a standard
mathematical function in MATLAB. Short term energy of the
signal is considered due to the dynamic nature of speech signal.
The part of the audio signal containing vocalized sounds will
have very large energy compared to the silent regions. This
classification using the short term energy is extremely useful,
and is particularly utilized for identifying the silent portions of
our speech signal and removing them before processing the
signal.. So overall, the features used are the fundamental
frequency and the short term energy. For each syllable, three
fundamental frequencies and one energy level parameters are
obtained which are all normalized before being passed on for
processing.

Copy Right INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5

3039

A Comparitive Study of Discriminative Approaches for Classifying Languages into Tonal and Non-Tonal Categories at Syllabic
Level
Data given to the classifiers Once the pitch and the short term
energy of the signal are extracted and modeled mathematically,
they are given to the classifiers as input. Normalization of the
data is always carried out with respect to the global maxima.
The unique aspect of our work is that the speech signal is not
considered as a whole. The speech signal is divided into
syllable segments assuming CV structure as they are the most
dominant. These segments are then analyzed individually and
the decision is taken based on the nature of the majority of its
constituent syllables. The final decision is taken depending on
the nature of the bulk of the syllables. This method of using the
class of the majority of the syllables to determine the tonality of
the language has improved the performance of our classifier:
the accuracy has improved by a significant 4-5% than what was
achieved by analyzing the utterance as a whole. The time
required has also roughly remained the same. The accuracy of
our syllable detection step is extremely important for the
precise implementation of this step.
In Fig.2, all the steps involved in our work are shown as a
flowchart.

Fig. 2. Process Flow chart.

IV. CLASSIFIERS UTILIZED


Once the required features are extracted, they are given to
the classifiers. Three different types of classifiers are used: k
Nearest Neighbor algorithm (k-NN), Artificial Neural Network
(ANN) and Support Vector Machine (SVM). A preliminary
description of the classifiersA. K Nearest Neighbor Algorithm
k-NN algorithm is one of the most commonly used methods
for classifying our testing data. This algorithm, as the name
signifies, finds k nearest neighbor in the n-dimensional space,
where n refers to the number of features. The output of this
classifier gives the membership of the testing data to any one of
the classes, depending on the nature of the majority of its
neighbors among the k nearest ones. k=1 refers to a special
case where the class of the testing data depends on the nearest
neighbor among the training data. This algorithm is widely

used due to its simplistic nature. Results can be made more


precise by considering a weighted contribution of the closest
neighbors, by giving the nearer neighbors more
weightage/importance than the farther ones. This classifier has
a downside, it is imperative for us to know the class of the
training data before we can classify the testing data. This can
be said to be the training of the classifier. In this classifier,
parameters considered are given below, available in MATLAB

Distance considered = Hamming distance

Value of k = 5
B. Artificial Neural Network
Neural network [12] can be defined as computer
architecture, modeled after the human brain, in which
processors are connected in a manner suggestive of connections
between neurons and learns by trial and error. It is commonly
used to estimate unidentified mathematical functions. These are
present in layers and each layer has a number of neurons called
nodes which are all inter-connected. Every neural network has
an activation function, which determines the output of each
neuron. The input layer is presented with the training data,
which is passed to the hidden layer of neurons where the actual
processing is done. Once the processing is done, the hidden
layers provide their output to the output layer. Much such
forward propagation runs can be done to minimize the error.
Advantages include flexible learning, tolerant to faults, nonlinearity etc. Back propagation neural network is used which
works on an error correcting algorithm. After completion of
every forward propagation run during the training phase, the
output is matched to the provided output. Depending on the
error margin, further forward runs are made after weight
adjustments of the interconnections. The parameters, as
implemented in MATLAB, were
Output layer neurons =1

Input layer neurons =4

Hidden layer neurons =14

Processing (hidden) layer =1


C. Support Vector Machines
This classifier is used only in cases where the data has only
two divisions and is a kernel learning method. Classification is
done on the basis of decision limit or boundary. The two
classes are separated by a decision boundary [10].The
hyperplane that best separates the classes is chosen by the SVM
classifier, where the best hyperplane can be defined as the one
having the biggest margin between the classes. In case of multilayered SVM classifier, large number of Support Vectors is
present in the output layer. Gaussian kernel function is used
here. Gaussian kernel function is used here. This classifier was
also implemented using MATLAB.
V. RESULTS
This section lists the result obtained. Training data of 4800
seconds is taken out of which 2400 seconds is tonal and the rest

Copy Right INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5

3040

Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465


2016 3 International Conference on Computing for Sustainable Global Development, 16 th 18th March, 2016
rd

2400 seconds is non-tonal. For the purpose of testing, 38


samples are taken which includes 19 tonal and 19 non-tonal
language speech files .For the testing a single file of
600seconds file was made which included 300 seconds of tonal
and 300 seconds of non-tonal language speech. The testing files
and the training files were ensured to be separate. Once the
decision is taken at the syllabic level, the final decision is taken
depending on the nature of the majority of the syllables. The
accuracy is measured by calculating the total correct decisions
made out of the 38 samples, in each of the classifiers. The tonal
class is referred to as Class I and non-tonal class as Class II.

reason for the poor performance of SVM classifier, which


generally downgrades the performance of the SVM
classifier. Comparison of three different discriminative
approaches is shown in Fig.3.
Performance can be increased by increasing the duration of
the speech files.
The least time among all classifiers was taken by k-NN
classifier.
Figure 3 gives a quantitative comparison of the ability of the
classifiers to correctly identify a language as tonal or non-tonal.

1. K Nearest Neighbor Algorithm


Table I illustrates the results obtained using k-NN
algorithm.
TABLE I
Number of test
samples

Correct
Detection

Correct
Detection %

Class I

19

14

73.684

Class II

19

17

89.473

The accuracy obtained is 81.58%.


2.

Artificial Neural Network

Artificial Neural Network yielded the results as shown in Table


II. The accuracy obtained is 84.21%.
TABLE II
Number of test
samples

Correct
Detection

Correct
Detection %

Class I

19

15

78.947

Class II

19

17

89.473

3.

Support Vector Machines

Table III illustrates the results obtained using Support Vector


Machines. The accuracy obtained is 65.79%.
TABLE III
Number of test
samples

Correct
Detection

Correct
Detection %

Class I

19

13

68.421

Class II

19

12

63.157

VI. OBSERVATIONS
The study of these three different classifiers gave the
following observations Artificial Neural Network has the highest accuracy with
84.21%. K-NN classifier follows with 81.57%. The SVM
classifier's performance was the least accurate with
65.79%. Less variance among the training data may be a

Fig. 3. Comparison between k-NN, ANN, SVM.

VII.CONCLUSIONS
In this paper, we have studied the performance of different
classifiers i.e. Artificial Neural Network (ANN), k Nearest
Neighbor (k-NN) and Support Vector Machine (SVM) in
identifying whether an unknown speech sample belongs to a
tonal or non-tonal class. ANN classifier can be said to be the
best classifier of these three classifiers. Using a much
uncomplicated methodology, good results have been obtained
for ANN and k-NN classifier. By pre-classifying the languages
into two broad classes, we can reduce the complexity of
automatic identification of languages. Once we have preclassified the languages into two categories, the complexity in
the final step of identification of languages becomes less time
consuming.. In our pre-classification stage, we are just using
prosodic information whereas for identification of language,
acoustic and phonotactic features are used. This reduces the
time in pre-classification stage.
An automatic Language Identification System can be built using
this pre-classification work as a platform.

Copy Right INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5

3041

A Comparitive Study of Discriminative Approaches for Classifying Languages into Tonal and Non-Tonal Categories at Syllabic
Level
REFERENCES
Journal References
[1]. Implementation of Advanced Communication Aid for People with Severe
Speech Impairment IOSR Journal of Electronics and Communication
Engineering (IOSR-JECE) e-ISSN: 2278-2834, p- ISSN: 22788735.Volume 9, Issue 2, Ver. III (Mar - Apr. 2014), PP 61-66.
[2]. Orsucci et al.: Prosody and synchronization in cognitive neuroscience.
Prosody and synchronization in cognitive neuroscience. EPJ Nonlinear
Biomedical Physics 2013 1:6.
[3]. "The OGI Multi-language Telephone Speech Corpus". Y. K. Muthusamy,
R. A. Cole and B. T. Oshika. Proceedings of the International Conference
on Spoken Language Processing, Banff, Alberta, Canada, October 1992.
[4]. L. Wang, E. Ambikairajah, E. H.C. Choi, A novel method for automatic
tonal and non-tonal language classification. In: IEEE International
Conference on Multimedia and Expo. pp. 352 355,2007.
[5]. Suryakanth V. Gangashetty, C. Chandra Shekhar, B. Yenganarayana
Extraction of fixed dimension patterns from varying duration segments
of consonant-vowel utterances.
[6]. S.R. Mahadeva Prasanna, B. Yegnanarayana. Detection of Vowel Onset
Point Events using Excitation Information.
[7]. Theodoros Theodorou , Iosif Mporas and Nikos Fakotakis. Automatic
Sound Classification of Radio Broadcast News. International Journal of
Signal Processing, Image Processing and Pattern Recognition Vol. 5, No.
1, March, 2012
[8]. A.Y.Ng, M.I.Jordan,On Discriminative vs. Generative classifiers: A
comparison of logistic regression and nave Bayes.
[9]. Leena Mary, B. Yegnanarayana, Extraction and representation of
prosodic
featuresfor
language
and speaker recognitionSpeech Communication vol.50 No.782796,
2008.
[10]. W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, D. A. Reynolds,
Language Recognition with Support Vector Machines In: Proceedings
Odyssey. pp. 4144. 2004.
Book References
[1]. David Talkin, A Robust Algorithm for Pitch Tracking. chapter 14.1995.
[2]. B.Yegnanarayana. Artificial Neural Networks,Prentice-Hall of lndia
Private Limited, New Delhi. 2005

Copy Right INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5

3042

You might also like