You are on page 1of 12

Automatic Recognition of Correctly Pronounced

English Words using Machine Learning


Ronalyn C. Pedronan, Rizaldy A. Manglal-lan Jr.,
Kristine Joy B. Galasinao, Reychell P. Salvador, and James Patrick A. Acang*
Department of Computer Science, Mariano Marcos State University, City of Batac 2906, Ilocos Norte, Philippines

ARTICLE INFORMATION Abstract

Article History: Speech recognition is a form of human-


Received: 21 March 2017 machine communication where interpreting
Received in revised form: 9 November 2017 speech is done by the computer. This research
Accepted: 11 December 2017 deals with the problem of recognizing correct
pronunciation of words in English. In view of
Keywords: using this technology to help in education,
digital signal processing; Hidden Markov the researchers gathered voice samples from
Model; Mel Frequency Cepstral Coefficient; middle graders and they labelled them based
Pronunciation Recognition; Speech on ground-truth, English, pronunciation data
Recognition. from Google. The words were based from the
current curriculum of the samples. The words
*Corresponding author: James Patrick Acang were also clustered according to syllables to
(jamespatrickacang@gmail.com) see how the model performs as the complexity
of the words to be recognized is increased.
Since there are numerous voice or speech
features to consider, the researchers selected
three of the known feature extraction
techniques subjected for evaluation. Results
show that the Mel Frequency Cepstral
Coefficient with Linear Predictive Coding
model have better performance with high
and stable recognition rates compared to
the other models. It was also observed that
the model only needs four syllables to reach
its optimum 100% recognition rate when
recognizing English words. To make the
model more robust to noise, an automatic
signal segmentation approach is needed
to detect the significant components of the
signal for analysis.

Introduction transmission and recognition (Anusaya


et. al., 2009). Voice analysis decodes the
Voice Analysis is one of the analogue signals to digital signals to be
technological advancements that has used in computers. Speech recognition is
been developed nowadays. This has been in line with this idea. Speech recognition
investigated as a natural form of Human- is the process of machine interpretation
Machine Communication. It is focused on the or understanding voice commands from
understanding of speech generation, coding, spoken words it receives. There are two

82 ASIA Pacific Higher Education Research Journal              Volume 4     Issue No. 2
subsystems of Speech Recognition, the machine learning models to get the best
Automatic Speech Recognition (ASR) and possible accuracy from data. Some of these
Speech Understanding (SU) (Varshney, include the Hidden Markov Model (HMM)
2014). The goal of ASR is to transcribe (Barbu, 2007) and the Artificial Neural
natural speech while SU is to understand the Network (ANN) (Shi,et. al., 2006). Figure 1
meaning of the transcription. presents the HMM where yi are the observed
variables and the si are the hidden states.
Voice Recognition systems, can be
classified into two categories speaker-
dependent and speaker-independent (Aurora
et. al., 2012). Speaker-dependent systems
work by comparing a whole word input
with a user-supplied pattern while speaker
independent systems require no training
operations. Voice recognition systems
perform two fundamental operations: signal
modelling and pattern matching. Signal
modelling represents process of converting
speech signal into a set of parameters. Pattern
matching is the task of finding parameter
set from memory which closely matches
the parameter set obtained from the input Figure 1. The Hidden Markov Model.
speech signal (Ananthi et. al., 2013).
However, there is still a vague space
To ensure high recognition rates, on how pronunciation is represented in
proper selection of features should the analysis as this is an inherent or hidden
be considered. Feature enhancement, entity (feature) in speech. These speech
distribution normalization, and noise robust features are captured and described by
feature extraction are often used. Feature feature extraction methods but because of
enhancement tries to remove the noise from the diverse number of feature extraction
the signal, such as in spectral subtraction techniques in voice analysis (Krenker,
(SS) (Muzaffar et. al., 2005). Distribution n.d.), it is difficult to determine the suitable
normalization reduces the distribution feature for the domain of recognizing correct
mismatches between training and test pronunciation. Hence, this study is aimed to
speech, like those presented in cepstral determine the model to recognize correct
mean subtraction (CMS) (Boll, n.d.) and in pronunciation in English words. Specifically,
Cepstral Mean and Variance Normalization this research is aimed to accomplish the
(CMVN) (Furui, 1981). Noise robust following activities: collect voice samples
features include improved Mel Frequency from middle graders to be used in the training
Cepstral Coefficients (MFCCs), which is and testing of the recognition models; gather
similar to root-cepstrum (Viikki, 1998). ground-truth word pronunciation data or the
One feature extraction which has a melodic basis to be used in the labelling of the said
cepstral analysis is MFCC (Sarikaya, 2001). collected middle grader voice samples; build
It represents the dominant features used in three recognition models from three known
speech and speaker recognition domains voice features; and evaluate the recognition
(Adams, 1990). Other feature extraction models based on sensitivity, specificity, and
method is the FFT (Fast Fourier Transform) accuracy metrics.
which is used to make the spectrum of each
windowed sequence be computed after the In the field of education, proper
MFCC feature. To fuse everything together, pronunciation of words, especially English
voice or speech recognition uses various words, is necessary as this is widely used in

Volume 4      Issue No. 2              ASIA Pacific Higher Education Research Journal 83
communication (Bagge, 2001). In this area, or distort information in any anticipatory
pronunciation recognition can be applied manner. Moreover, he gave a highlight on
as a bridge in learning languages with the the Linear Predictive Coding feature that
aid of an individual expert or the computer it reduces error rates found in difficult
itself. In fact, the Philippines uses various conditions.
dialects hence an application powered by a
pronunciation recognizer can be very useful in The fast and effective performance of
learning these languages. Since the platform the Mel Frequency Cepstral Coefficient was
is pronunciation driven, its application exploited in the work of Daphal and Jagtap
can extend from local to international (2012). They reported that the said feature
language learning (Alsulaiman et. al, 2011; has significant impact in their classifier. The
Lyu et. al, 2014; Nitta et. al, n.d.). This same strategy was employed in the work of
recognizer can also be implemented to voice Sapijaszko and Michael (2012) where they
operated user interfaces that can be used in tried to experiment with this feature with
instruction providing extraordinary learning frame algorithms. In this research Linear
environment to students. This simply shows Predictive Coding was also considered to
that the recognizer has unending potentials enhance the recognition rates.
to be effective from simple text-to-speech to
complex applications. Noise is also a factor that degrades
the performance of the recognizer. In the
work of Jarng (2011), noise is included
Literature Review as part of the subject. In this research he
asserted that Mel Frequency Cepstral
The field of Digital Signal Processing Coefficient is resilient to noise but including
(DSP) has drastically and significantly new parameters can dramatically improve
improved since its conception (Anusaya et. recognition.
al., 2009; Aurora et. al., 2012; Muzaffar, 2005;
Furui, 1981; Viikki, 1998; Sarikaya, 2001). A All these work have focused on the
manifestation of which is the development of recognition of speech itself. The researchers
DSP driven applications like the Google Voice have extended this by considering the
Search (Hispanicallyspeakingnews, 2011), characteristic of the speech which is
VLingo (Maurice, 2013) and the Siri Assistant pronunciation.
(Ludwig, n.d.). The vision of the researchers
is to exploit these capabilities and to apply
these in education for students. Methodology

In speech recognition, speech In machine learning, data is required


features are very important. In fact, Thakare to develop models to capture the behaviour
emphasized that speech signals carries of the subject. Similarly, in pronunciation
all auditory information as compared to recognition, voice samples are needed to
speech feature extraction methods that construct these models for investigation.
can effectively be used in various speech
recognition domains. He highlighted in The Speech Recognition Process
this work that the Mel Frequency Cepstral
Coefficient feature reduces the frequency Speech Recognition is the process
information of the speech signal into small where speech signals are used to create
number of coefficients which is relatively recognizers for speech (Mastin, 2011). The
fast to compute. He also stressed in the speech signals are usually processed in
work that the Fast Fourier Transform digital representation as it undergo series
feature is good because of its linearity in of processes. A typical speech recognition
the frequency domain as it does not discard process include: data gathering, feature

84 ASIA Pacific Higher Education Research Journal              Volume 4     Issue No. 2
extraction and model construction, and another processing procedure. In here,
analysis as shown in figure 2. transformations of the data into a format
that will be more easily and effectively
processed are done. When the data input of
Speech an algorithm is too large to be processed,
then it can be transformed into a reduced
set of features. This process is called
feature extraction. The extracted features
Pre-Processing are expected to contain the relevant
information from the input data so that
the desired task can be performed using
Training Testing this reduced representation instead of the
Feature Extraction
Dataset Dataset complete initial data. In a dataset, a training
set is implemented to build a model, while
a test (validation) set is to validate the
Model Model Comparison models built. Data points in the training
Construction and Testing set are excluded from the test set. To
construct a model, accurate labelling and
Speech Recognition
tagging of the dataset are needed. Analysis
Model
Figure 2. The speech recognition process. is the process where the test data together
with the model are evaluated to get the
recognition rate and to describe how the
Data gathering includes the process model behaves. Moreover, the output is the
of collecting voice samples. Gathered data result to be produced by the model out of
are processed and divided into testing the inputted data.
and training set. The training set is used
to build the recognition models while the Data Acquisition
test set is used to analyse the models. This
process involves extraction of features that To enable and determine the correct
are relevant for classification, which are pronunciation of each word, we used ground
common in both phases. truth word recordings from Google (Barett,
2006). This is used to tag or label the gathered
In the training phase, the parameters voice samples for the model construction.
of the classification model are estimated Manual tagging was utilized to ensure that
using a large number of class examples the collected voice samples are labelled
(training data). During the testing or correctly. In view of using this technology in
recognition phase, the feature of the test education, we selected 300 middle graders
pattern (test speech data) is matched with from Ilocos Norte, Philippines to gather
the trained model of each and every class. voice samples. Two-hundred voice samples
The test pattern is declared to belong to per word were gathered for correctly
that model who matches the test pattern pronounced words, half of which are males
best (Kale, n.d.). The analysis stage deals and the other are females. This is to capture
with suitable models for further analysis the properties of the voice on both genders.
and evaluation. The recognition rates of the Fifty percent (50%) of which, where each
different feature extraction methods are gender is equally represented, will constitute
shown in the Experimental Results section. the training set and the other fifty percent
(50%) is the testing set. Each student was
The speech signal serves as input requested to speak for words, five times
for the speech recognition process. Pre- each, as they are recorded to capture the
processing describes any type of processing pronunciation characteristics of the signal.
performed on raw data to prepare it for

Volume 4      Issue No. 2              ASIA Pacific Higher Education Research Journal 85
Since the researchers are concerned A key ingredient to an accurate
with pronunciation, we also considered recognizer of this domain is the proper
mispronounced words. To take the selection of features that can capture the
mispronounced words into consideration, characteristic of the word pronunciation
the other 100 students were considered effectively. In here, the researchers
to gather mispronounced samples. The considered three of the commonly known
same gender arrangement as the correctly feature extraction techniques in literature
pronounced words is followed. This (Karpagachelvi et. al., 2010). These features
provided 50 males and 50 females who include: Mel Frequency Cepstral Coefficient
mispronounced voice samples for the testing (MFCCs), Full Fast Fourier Transform (FFT)
set collectively. and Mel Frequency Ceptral Coefficient with
Linear Predictive Coding (MFCC + LPC).
The researchers also selected and
clustered the word samples by difficulty MFCC is the dominant feature used
(syllables) based on the current curriculum in speech recognition systems such as
of the samples. Fifteen (15) words per systems that can automatically recognize
cluster were considered. The researchers speech spoken into a computer. They are
considered five clusters for one syllable to also common in speaker recognition, which
five syllable words to model and capture the is the task of recognizing people from
complexity of the word being pronounced. their voices (Vimala et. al., 2012). Noise
That is, easy for the one syllabled words and sensitivity is one of the considerations for
difficult for the five syllabled words. This is choosing this method. MFCC are not very
shown in the table below. robust in the presence of additive noise, and

Table 1
List of Words per Cluster

1 Syllable 2 Syllable 3 Syllable 4 Syllable 5 Syllable


ache compare abattoir acquaintance civilization
aide complete accurate advantageous economical
aisle compose aerospace anaesthesia enthusiasm
arm congress antimissile annexation inconceivable
art connect assemblage beneficial inexhaustible
ash conscious circuitous caricature inextricably
awed consent clandestine catastrophic investigation
badge corsage combatant choreography itinerary
bait cottage credulous clairvoyance pronunciation
balm council disputant diminutive unavoidable
beach country fractional dirigible unconquerable
beat couple icicle exigency university
beige courage negligee inclination vocabulary
bend cousin pathetic interlining
biped cover posthumous lamentable

86 ASIA Pacific Higher Education Research Journal              Volume 4     Issue No. 2
so it is common to normalize their values power spectrum on a nonlinear Mel scale of
in speech recognition systems to lessen the frequency as shown in the following formula
influence of noise (Li, J. et. al., 2013). The (Anand et. al., n.d.).
job of MFCC is to accurately represent the
phoneme being produced.

The following formula were used


(Mermelstein, 1980): In this formula, f signifies the
frequency of the speech signal. Discrete
cosine transform (DCT) is finally applied to
convert log Mel spectrum into time domain.
The result of this conversion is called Mel
Frequency Cepstral Coefficients.
where dt is a delta coefficient, from
frame  computed in terms of the static When the Linear Prediction Analysis
coefficients   to  and N is the was developed, the basis of it is the
analysis window. A typical value for N is 2. prediction of current sample as linear
combination of past samples, where is the
FFT is the traditional technique to order of prediction
analyze frequency spectrum of the signal in
speech recognition (Ernawan et. al., 2011). As
compared to methods exploiting knowledge
about the human auditory system, the
full FFT spectrum carries relatively more and are the linear prediction
information about the speech signal. The coefficients and s(n) is the pre-processed
logarithm of the FFT spectrum is also often signal as shown in the aforementioned
used to model loudness perception. The full formula. Then, the prediction error is
FFT formula is defined as: defined as:

The primary objective of this method


where is the is to minimize the total prediction error
sample time index j and i is the imaginary and to find the linear prediction
number is a vector of N values coefficients.
at frequency index k corresponding to the
magnitude of the sine waves resulting from Three models were constructed for
the decomposition of the time index signal. each of these feature extraction methods
from the same training dataset. Each of them
LPC is one of the most powerful was evaluated with metrics using the test
speech analysis techniques and is a useful dataset.  These were built using HMM since
method for encoding quality speech at a low this offers faster training compared to the
bit rate (Speech, n.d.). However, LPC cannot other models (Sak, 2015). In here minimal
stand alone, it only serves as an enhancer to HMM configuration was applied. Five
feature extraction. LPC analysis is usually states, which is commonly used in speech
most appropriate for modelling non- recognition, was utilized (Rabiner, 1989).
nasalized vowels which are periodic. To estimate the parameters, the researchers
used the Baum-Welch algorithm, a variant of
MFCC is a representation of the speech the well-known Expectation-Maximization
signal as a linear cosine transform of its log algorithm (Baum, 1970; Dempster, 1977).

Volume 4      Issue No. 2              ASIA Pacific Higher Education Research Journal 87
Analysis and Performance Comparison words while FN is the number of misclassified
correctly pronounced words. Sensitivity or
The collected data were analysed true positive rate determines the capacity
together with the methods to determine of the model in recognizing correctly
which of them will be tagged as the best pronounced words while sensitivity or true
performer. In here, those models are created negative rate works on the mispronounced
based on 3 different features. The training terms. Moreover, accuracy determines how
data is used to compose these models. good the model in recognizing correctly
Correctly pronounced words were gathered pronounced to mispronounced words.
and properly labelled. Five (5) speech
samples per word, for each student sample,
were used in the training. The labelling is Results and Discussion
based on ground truth data from Google (Li,
n.d.). This is to capture the pronunciation In speech recognition systems, training
details of the voice. datasets are used to create the models
for analysis. The test dataset is needed
Since the researchers also gathered to determine the accuracy, sensitivity,
and properly labelled mispronounced and specificity of the different models. In
words, the recognition of mispronounced to this case, since the researchers used the
properly pronounced words is achievable. three feature extraction methods namely:
Since it is assumed that the words for each MFCC, FFT, MFCC+LPC, three models were
test are known, then the classification is compared with the metrics.
binary. That is, if it is correctly pronounced
or otherwise. In here researchers used codes to
illustrate the model and the metric in each
These models were analysed using box plot. A prefix (the code before the dash)
the test data and were compared based on represents the models that were compared.
accuracy, specificity, and sensitivity. The In here ML represents the MFCC and LPC
formula of the metrics are the following: (MFCC+LPC) model, M represents the MFCC
model and F represents the full FFT model.
The suffixes (code after the dash) represent
the metric done in each plot. In here A, Se,
Sp represents accuracy, sensitivity and
specificity, respectively. The ( + ) symbols are
the outliers referring to the values that are
numerically distant from the rest of the data
points.

It was observed from the data that


the models have the same behaviour in
each of the clusters. Say for example, it was
TP refers to the number of true observed that most of the models suffer from
positives or the recognized correctly the 1-syllable group due to the limited signal
pronounced words while TN refers to the to work on. Say for example, MFCC+LPC
number of true negatives or the correctly misclassifies the words ache, arm, and beat;
recognized mispronounced words. FP and FN MFCC misclassifies the words badge, bait,
refers to false positives and false negatives and balm; and FFT misclassifies the words
respectively which defines the number of ache, aide, and biped. They have increasing
misclassified word pronunciations. FP is accuracy while the number of outliers
the number of misclassified mispronounced decreases as the syllables are increased.

88 ASIA Pacific Higher Education Research Journal              Volume 4     Issue No. 2
Figure 3. Performance Comparison of the Models.

Hence, the researchers consolidated or In the above figure, we can observe


merged the data from each word cluster to that MFCC+LPC performs better than the
have a better view of their performance. other models. Although they all have high
Consolidating the data from the results of recognition rates. MFCC+LPC has better
1 up to 5 syllable words, using the three accuracy than MFCC. This conforms to the
models, can help us picture the overall observation of MFCC+LPC in its 4 syllable
performance of each model in each metric as performance that in only needs 4 syllables
shown in Figure 3. to be optimum. Most of the correctly
pronounced words were correctly identified

Figure 4. MFCC+LPC Performance.

Volume 4      Issue No. 2              ASIA Pacific Higher Education Research Journal 89
by the model since most of the data points lie input data is continuous. This means that the
in the 0.9 to 1.0 rates. It can also be observed speech or voice sample may contain one or
here that the MFCC+LPC model has the more words to analyse. In this research, the
same performance as MFCC in recognizing model was designed to receive single-word
mispronounced and correctly pronounced voice input hence noise and other sound
words having the recognition rates in the background could be a problem. To
(specificity and sensitivity respectively) of make the model more robust, an approach
0.9 or higher. Though this performance is that could extract significant segment of the
similar to MFCC, MFCC+LPC is better since signal, like extracting only the important
it can recognize less syllabled words as its words for analysis, is needed to enhance
performance is optimum from 4 syllables as the model. Real-time continuous signal
shown in figure 4. processing may also be a good improvement
to make interaction more natural to the users
The label that the researchers used is in the educational context.
the code of the metrics with the number of █ █ █
syllable of the word. Acc is for accuracy, Sen is
for sensitivity and Spe is for specificity. It can
be observed here that the model improves as References
the difficulty of the word being recognized
increases. This is apparent since MFCC+LPC Adams, R. E. (1990). Sourcebook of
works best in longer signal samples. Also, automatic identification and data
it only needs four syllables to be optimum. collection. New York: Van Nostrand
This behaviour was not observed in the two Reinhold.
other models since they need as much as five
syllable to be on their optimum performance. Alsulaiman, M., Muhammad, G., &
Ali, Z. (2011). Comparison
of voice features for Arabic
Conclusion and Recommendation speech recognition. 2011 Sixth
International Conference on
In this research the researchers Digital Information Management.
investigated the problem of recognizing doi:10.1109/icdim.2011.6093369.
correct pronunciation in English words.
Three models were created, with different Anand, D., & Meher, P. (n.d.). Combined LPC
features namely: MFCC+LPC, MFCC and and MFCC Fetaures based technique
FFT, using HMM. Results show that the for Isolated Speech Recognition.
model with the MFCC+LPC feature works Hyderabad India.
best in recognizing correctly pronounced
from mispronounced English words. This Ananthi, S., & Dhanalakshmi, P. (2013).
manifested in the high recognition rates Speech Recognition System and
of the model from the three different Isolated Word Recognition based
metrics defined. Also, the model is better in on Hidden Markov Model (HMM)
recognizing correct pronunciation on less for Hearing Impaired. International
syllabled words as the model is at its optimum Journal of Computer Applications, 73.
from 4 syllables. This implies that the model
has higher recognition rates on less syllabled Anusuya, M., & Katti, S. (2009). Speech Rec-
words thereby making the MFCC+LPC model ognition by Machine: A Review. Pro-
more stable and more suitable model for ceedings of the International Confer-
pronunciation recognition. ence on Computer Applications, 6.

Though the MFCC+LPC has outstanding Aurora, S., & Singh, R. (2012). Automatic
performance, the model suffers when the Speech Recognition: A Review.

90 ASIA Pacific Higher Education Research Journal              Volume 4     Issue No. 2
International Journal of Computer Ernawan, F., Abu, N., & Suryana, N.
Applications, 60. (2011). Spectrum Analysis of
Speech Recognition via discrete
Bagge, N., & Donica, C. (2001). Final TchebichefTransform.Spie.Digital
Project Text Independent Speaker Library. Retrieved September 15,
Recognition.”, ELEC 301 Signals and 2015, from http://proceedings.
Systems Group Projects. spiedigitallibrary.org/proceeding.
aspx?articleid=1197978.
Barbu, T. (2007). A supervised text-
independent speaker recognition Furui, S. (1981). Cepstral analysis technique
approach. In Proceedings of the for automatic speaker verification.
12th International Conference on IEEE Transactions on Acoustics,
Computer, Electrical and Systems Speech, and Signal Processing,
Science, and Engineering, CESSE 29(2), 254-272.
2007, 22, 444-448.
Hispanicallyspeakingnews. (2011). Google’s
Barrett, G. (2006). Kale, K., Mehrotra, S., Voice Search Sends Hunter Around
& Manza, R. (n.d.). Computer Vi- the World, Hispanically Speaking
sion and Information Technology: News. Retrieved September
Advances and Applications. Re- 4, 2015, from http://www.
trieved September 15, 2015, from hispanicallyspeakingnews.com/
http://www.waywordradio.org/ latino-daily-news/details/googles-
Official_Dictionary_of_Unofficial_En- voice-search-sends-hunter-around-
glish-Grant-Barrett-0071458042. the-world/9520/
pdf.
Jarng, S. (2011). HMM Voice Recognition
Baum, L., Petrie, T., Soules, G., & Weiss, N. Algorithm Coding. International
(1970). A Maximization Technique Conference on Information Science
Occurring in the Statistical Analysis and Applications.
of Probabilistic Functions of Markov
Chains. The Annals of Mathematical Kale, K., Mehrotra, S., & Manza, R. (n.d.).
Statistics, 41, 164-171. Computer Vision and Information
Technology: Advances and
Boll, S. (n.d.). Suppression of acoustic noise Applications.
in speech using spectral subtraction.
IEEE Trans. Acoust., Speech, Signal Karpagachelvi, S., Arthanari, M., & Sivakumar,
Process, ASSP-27, 113-120. M. (2010). ECG Feature Extraction
Techniques - A Survey Approach
Daphal, S., & Jagtap, S. (2012). DSP Based (Vol. 8, Ser. 1).
Improved Speech Recognition
System. International Conference Krenker, A., Bester, J., & Kos, A. (n.d.).
on Communication, Information & Introduction to the Artificial Neural
Computing Technology (ICCICT). Networks. Slovenia: University of
Ljubljana.
Dempster, A., Laird, N., & Rubin, D.
(1977). Maximum likelihood Li, J. et. al. (n.d.) . An Overview of Noise-Ro-
from incomplete data via the EM bust Automatic Speech Recog-
algorithm. Journal of the Royal nition. Retrieved September 15,
Statistical Society, 39(1), 1-38. 2015, from https://www.lsv.
uni-saarland.de/fileadmin/pub-
lications/non_articles/an_over-

Volume 4      Issue No. 2              ASIA Pacific Higher Education Research Journal 91
view_of_noise_robust_automatic_ and Signal Processing. doi:10.1109/
speech.pdf. icassp.1982.1171875.

Ludwig, S. (n.d.). Siri Assistant 1.0 for Rabiner, L. (1989). A tutorial on hidden
iPhone. Retrieved September 4, Markov models and selected
2015, from http://www.pcmag. applications in speech recognition.
com/article2/0,2817,2358823,00.
asp. Sak, H. et. al. (2015). Google Voice Search:
Faster and More Accurate. Re-
Lyu, M., Xiong, C., & Zhang, Q. (2014). Electro- trieved September 15, 2015, from
myography (EMG)-based Chinese http://googleresearch.blogspot.
voice command recognition. 2014 com/2015/09/google-voice-
IEEE International Conference on search-faster-and-more.html.
Information and Automation (ICIA).
doi:10.1109/icinfa.2014.6932784. Sarikaya, R., & Hansen, J. (2001). Analysis
of the root-cepstrum for acoustic
Mastin, L. (2011). Language Issues: English modeling and fast decoding
as a Global Language. Retrieved Sep- in speech recognition. Proc.
tember 15, 2015, from http://www. Eurospeech’01. Aalborg, Denmark.
thehistoryofenglish.com/issues_
global.html. Sapijaszko, V., & Michael, W. (2012). An
overview of recent window based
Maurice. (2013). 5 Ways to get Siri Alternatives feature extraction algorithms for
for Android Phones. Retrieved speaker recognition. IEEE 55th
September 4, 2015, from http:// International Midwest Symposium
www.tipsotricks.com/2013/03/5- on Circuits and Systems (MWSCAS).
best-siri-alternatives-for-android-
phones.html. Shi, Z., Shimohara, K., & Feng, D. (2006).
Intelligent information processing
Mermelstein, D. (1980). Comparison of III. IFIP TC12 International
parametric representation for Conference on Intelligent
monosyllabic word recognition in Information Processing (IIP 2006).
continuously spoken sentences.
IEEE T. Acoust., Speech Signal P., Speech. (n.d.). Retrieved September 15,
28(4), 357-366. 2015, from https://www.uic.edu/
classes/ece/ece434/chapter_file/
Muzaffar, F., Mohsin, B., Naz, F., & Jawed, F. Chapter5_files/Speech.htm.
(2005). DSP Implementation of
Voice Recognition Using Dynamic Thakare, V. (n.d.). Techniques for Feature
Time Warping Algorithm. 2005 Extraction In Speech Recognition
Student Conference on Engineering System : A Comparative Study.
Sciences and Technology, 1. Retrieved September 4, 2015, from
http://arxiv.org/abs/1305.1145#.
Nitta, T., Murata, T., Tsuboi, H., Takeda, K.,
Kawada, T., & Watanabe, S. (n.d.). Varshney, N. (2014). Embedded Speech
Development of Japanese voice- Recognition System. International
activated word processor using Journal of Advanced Research
isolated monosyllable recognition. in Electrical, Electronics and
ICASSP 82. IEEE International Instrumentation Engineering, 3.
Conference on Acoustics, Speech,

92 ASIA Pacific Higher Education Research Journal              Volume 4     Issue No. 2
Viikki, O., & Laurila, K. (1998). Cepstral
domain segmental feature vector
normalization for noise robust
speech recognition. Speech
Communication, 25(1-3), 133-147.

Vimala, C., & Radha, V. (2012). A Review on


Speech Recognition Challenges and
Approaches. World of Computer
Science and Information Technology
Journal (WCSIT), 2(1), 1-7.

Volume 4      Issue No. 2              ASIA Pacific Higher Education Research Journal 93

You might also like