DSP Final Project

Digital Speech Processing Final Project
Speaker Recognition
R94921035 R94921131 ABSTRACT In this report we provide a brief overview of speaker recognition and introduce some modern speaker verification systems that are based on support vector machines and Gaussian mixture models. The underlying principles of these systems construction are shown, and the operation details of each component are also described. The structure of this report is as follows: Section 1 provides an overview of speaker recognition, and the reference papers are [1], [5]; Section 2 describes the general model of speaker verification system; the concept of GMM and SVM are introduced in Section 3 and Section 4, respectively; an adapted GMMs model proposed in [2] and a sequence discriminant SVMs in [3] are also presented in corresponding sections; the novel concept of combining GMM and SVM proposed in [4] is shown in Section 5. Section 6 concludes the report. 1. OVERVIEW 1.1. Information revealed in speech The basic level of information that we can retrieve from digital speech processing is the meaning of word. Else, emotion of speaker and gender of them can also be extracted. Moreover, according to different features associated to different people, the identity of speaker can be recognized. 1.2. What is Speaker Recognition? The field of speaker recognition can be classified into two main areas. Speaker Identification: (close-set) Given a known set of speakers, the identities of the input voice can be told. Speaker Verification: (open-set) It is to determine whether the speaker is indeed the one he or she claims.
According to the dependency of text, there are two kinds of input voice. Text-dependent: The content of speech is known to the system. It is based on the assumption that the speaker is cooperative. This is the one that easier to implement, and with higher accuracy. Text-independent: In this condition, the content of speech is unknown. In other words, there is no prior knowledge in the system. This is the one that harder to implement, but with higher flexibility.
Fig.1 Areas related to speaker recognition 1.3. Application Some representative applications are shown as following. It is most pervasively to be used in the banking and telecommunication areas, such as access control and transaction authentication. It is also used for law enforcement; activity area restriction and forensics are included. What is more, it can lead to better personalization of service. 1.4. Performance Detection error tradeoff (DET) plot is used to exhibit the false rate of a model. Selecting the equal error rate (ERR) points of different models, the relative performance can be evaluated. Equal error rate (ERR) point: the point on the curve where FA (false accept
rate in percentage) = FR (false reject rate in percentage).
Fig. 2 performance range of speaker verification Figure 2 depicts the range of speaker verification performance. The meanings of four points are shown as below. (1) Text-dependent using combination lock phrases. (2) Text-dependent using 10 digit strings. (3) Textindependent using conversational speech. (4) Text-independent using read sentences. As the result, the more the constraints are, the better the performance is. 1.5. Advantages and disadvantages Since speech is a natural message express by human, the retrieval is easy and unobtrusive. The hardware used, such as telephone and wireless network, is low cost media with ubiquitous property, so there is no extra construction requirement. However, the uncertainties of several aspects may increase the difficulty of recognition. Speech, as a behavioral signal, would be altered based on different health or mental conditions, while the I/O devices and channels could also cause variability. At the same time, mobility of cell phone introduces more uncontrolled and time-variant environments. 1.6. Future trends Nowadays, only the low-level features are broadly used in the automatic speaker recognition systems. Since humans use several levels of information
to recognize speakers form speech alone, more efforts should be made to discover higher-level information, such as idiolect, which means the usage of word, and prosodic measures. Also, factors that are environment-dependent and hardware-dependent should be eliminated to improve real world robustness.
2. SPEAKER VERIFICATION MODEL
Fig. 3 Speaker verification model Figure 3 shows the basic model of speaker verification. It can be divided into four parts. The details are briefly described below. 2.1. Front-end processing The first step of front-end processing is speech detection. It is to find out the non-speech portions and remove them. Following is the process called feature extraction, for example, LPC, FFT, MFCC. In this step, features containing speaker information are extracted for upcoming use. Finally, channel compensation is used to kick off the impact brought by different channels. 2.2. Speaker modeling An ideal speaker model should contain three attributes, having a theoretical underpinning, generalizable to new data, and parsimonious representation in both size and computation. The selection of model is largely application-oriented. A large number of models are purposed, and following are the most prevalent ones. Template Matching is a text-dependent model. Comparing the feature vector sequences of fixed phrase, using DTW, the match score representing the similarity between training and testing data is computed.
Nearest Neighbor is a method to compute the match score based on the knearest neighbors, according to the cumulated distance between test feature vector and those training vectors. Neural Network is an approach that is like a classifier to discriminate between the modeled speaker and others. However, the training is so costly that this method is seldom used. HMM is a statistical model. Multi-state left-to-right HMMs are used for textdependent applications, while Gaussian mixture model, a single-state HMM, is used for text-independent applications. Currently, it is proved that HMM is the best way for modeling speakers. 2.3. Imposter Modeling Imposter model can minimize non-speaker related variability by normalizing the likelihood ratio scores. Generally, there are two approaches to represent the imposter models. Likelihood Sets (Background Sets) is a collection of other speaker models. For each speaker, a specific model is constructed using the models of all nonclaimant speakers. Universal Background Modeling is a single speaker-independent model that is used by all speakers. In addition to smaller storage space required, it usually provides better performance. 2.4 Likelihood Ratio Test The last step in speaker verification is the decision making. Given a segment of speech and a hypothesized speaker S, the test of speaker verification is to determine whether the speech is spoken by S. The basic goal of the speaker verification system is to find the method to compute the two likelihood values. H0: Y is from the hypothesized speaker S. H1: Y is not from the hypothesized speaker S. If p(Y| H0) / p(Y| H1) is larger than or equal to A, which is the threshold value, H0 is accepted. Else, if p(Y| H0) / p(Y| H1) is smaller than A, H0 is rejected. However, the tuning of decision thresholds is very troublesome in speaker verification. This uncertainty is mainly due to the score variability between trials. Score normalization has been introduced explicitly to cope with score variability and to make speaker-independent decision threshold tuning easier.
3. GAUSSIAN MIXTURE MODELS 3.1. What is Gaussian Mixture Model (GMM)?
The equation above is the basic form of GMM. 3.2. Using GMM in Speaker Verification For computationally efficient, the covariance matrix used is usually assumed to be diagonal. This is not out of generality because every matrix can be transformed to a higher level diagonal. Besides, it is outperformed than a full matrix by observation. At the training stage, EM (expectation-maximization) algorithm is performed to estimate the parameters of GMM. Gaussian components can be considered to model the underlying broad phonetic sounds that characterize a persons voice. Computationally inexpensive, well understood, insensitive to temporal aspects, are all obvious advantages of using GMM in speaker verification. However, the independence to time may miss some information that is timerelated. 3.3. Speaker Verification Using Adapted Gaussian Mixture Model In this paper, major elements of the GMM-UBM system used for highaccuracy speaker recognition is presented. The system is constructed of several elements, including the optimal likelihood ration test, the GMMs for likelihood functions, the UBMs representing imposters, the adaptation procedures, and handset detector and score normalization. 3.3.1. Training the Universal Background Model (UBM) Figure 4(a) shows the training process that all data are used to train a UBM with EM algorithm. To avoid bias of the model, the pooled data should be balanced over the subpopulations. Figure 4(b) is another approach for training. EM is preformed for all subpopulations in turn so that each of it produces a specific UBM. After normalizing the weight of each UBM, the result UBM will be produced by a
simply agglomeration.
Fig. 4 Two approaches of modeling UBM Obviously, the method shown in (b) is much more convenient, and it is also much more flexible since the ratio of subpopulations could be manipulated at the last step. 3.3.2 Adaptation of Speaker Model After the model of UBM is constructed, MAP (maximum a posteriori) estimation is used to adapt the UBM model to represent the model of a specific speaker. The formulas are as follow:
In figure 5(a), the points marked by x are the training data of a specific
speaker, and the circles are Gaussian functions in the original UBM. The result of adaptation is shown in figure 5(b). Since the adaptation should be data-dependent, an adaptation coefficient is added. For each Gaussian, it is observed that, the smaller the portion of new data is, the more obvious the impact of them is.
Fig. 5 Two steps in model adaptation The speakers models are derived by updating the well-trained parameters in the UBM via adaptation. This provides a tighter coupling between the speakers model and UBM which not only produces better performance than decoupled models, but also allows for a fast-scoring technique. Because of the data-dependent property, not all Gaussian are modified. For each speaker, only the changed portions are recorded. Only very few mixtures contribute to the final score, because a single vector is close to only very few components of GMMs. The likelihood values can be approximated by calculating the nearest few components. The method effectively improves the performance since it is not affected by unseen acoustic events. 3.3.3 Handset Score Normalization: HNORM In order to alleviate the significant performance degradation caused by Handset variability in speaker recognition, we eliminate linear channel effects by channel compensation. Since it is hard to be done in the signal domain, the score domain is used for substitution. HNORM is exactly a handset compensation technique that operates in the score domain. Handset detectors are simple maximum likelihood classifiers modeled by handset-dependent GMM. They are applied as a pre-stage of HNORM in order to label the speech segment as CARB (carbon-button microphone handset) or ELEC (electrets microphone handset) based on the likelihood
values. Handset score normalization
Means and variances are used to describe how a hypothesized speakers model response to CARB and ELEC type speech. Two sets of them are shown above. Further, the log-likelihood rate score would be normalized using the following formula, in which HS(X) is the handset label for X.
Fig. 6 Example of HNORM normalization Figure 6 demonstrates the result of normalization by using HNORM. After HNROM is performed, the non-speaker score is distributed according to normal ~N(0,1). Using a single threshold for detection would lead to better performance. 3.3.4 Experiment Results The detection error tradeoff (DET) curve is drawn by adjusting the likelihood threshold. In Figure 7, the effect of different model size is examined, and it is shown that the margin is around 512 mixtures.
Fig. 7 DET curves of different mixtures size A plot demonstrating the performance of the GMM-UBM with and without using HNORM is presented in Figure 8. The result provides a confirmable evidence of using HNORM.
Fig. 8 The impact of using HNORM 3.3.5 Future Work The system should be refined for mismatched conditions. That is, it should be capable of dealing with the case that difference exists between the training and testing environments or conditions. Besides, more effort should be made for incorporating higher-level information.
4. SUPPORT VECTOR MACHINES 4.1. What is Support Vector Machine (SVM)?
An SVM is a two-class classifier constructed from sums of a kernel function K(.,.). In the formula shown above, ti are the ideal outputs with value 1 or -1 associated with class 0 or 1, respectively, and xi are support vectors. The value of f(x) is compared to a given threshold, and the classification could be done based on whether it is above or below the threshold. In the basic form, the SVM is a binary linear classifier. Following is an example demonstrating the orientated classification line in eight different cases. All points on one side are assigned to the class 1 while all points on the other side are assigned to the class 0. Note that the orientation is specified by the arrow.
Fig. 9 linear classifier with orientation As Figure 10 shown below, for a set of linearly separable data, more than
one solution for a discriminative classifier might be generated. A maximum margin criterion would be meted when applying SVM as a classifier. That is, the separating hyperplane is chosen so as to maximize the margin between the two classes. It is reminded that the points (circled ones) lying on the boundaries are the support vectors.
Fig. 10 Margin and support vectors The extension to nonlinear boundaries is achieved through the use of kernel functions, which implicitly defines a nonlinear mapping from an input space to an SVM feature space. Linear classification techniques are then applied in this potentially high-dimensional space. For example, there is an SVM, named Gaussian RBF, which can classify an arbitrarily large number of training points. The idea is indicate in Figure 11.
Fig. 11 RBF kernel Therefore, the core issue of SVM implementation is to find out an appropriate kernel function for a particular application. Besides, the SVM training process is to model the boundary between classes. 4.2. Using SVM in Speaker Verification The use of SVM in speaker verification systems is increasingly popular. Basically, speaker verification is inherently a two-class problem. That is, the
purpose of speaker verification is to determine the answer of a Yes-No question, telling whether the speaker is the one he or she claims to be. SVM classifiers are suitable for separating rather complex regions between two classes through an optimal, nonlinear decision boundary. Moreover, an SVM has many desirable properties, including the ability to classify sparse data without over-training and to make nonlinear decisions via kernel functions. However, being unable to handle the temporal structure of speech signals is its weakness. 4.3 Speaker Verification Using Sequence Discriminant Support Vector Machines In this paper, a text-independent speaker verification system based on SVMs is presented. The SVMs, using score-space kernels approach, allow direct classification of whole sequences. Two normalization techniques, whitening step and spherical normalization, are also addressed. 4.3.1 Sequence Discrimination Current state-of-the-art speaker verification systems, such as GMM and HMM, are usually operated at the frame-level. For instance, the overall sequence score in GMM-based system is obtained by averaging the likelihoods of each frame in the sequence. However, the speaker verification actually is concerned with sequence discrimination. This is because using frame-level approach might inadvertently discard relevant information and might become inefficient when the number of training frames is large. In order to carry out the concept of sequence discrimination, score-space kernel, a special kind of kernels that can map a complete sequence onto a single point in a high-dimensional space, is used. 4.3.2 Score-Space Kernel Discriminative classification of sequences is difficult since sequences have different lengths. Fortunately, score-space kernels enable SVMs to classify whole sequences by exploiting a set of parametric generative modes, such as GMMs. In this approach, a variable-length sequence is mapped onto a single point in a fixed-dimension space, the score-space. In this case, the original problem would become a sparse data problem, which is suitable for SVMs scheme. Furthermore, the concept of mapping each sequence to a feature space may be interpreted as an SVM kernel. It should be noted that the dimensionality of the score-space is equal to
the total number of parameters in the generative models, which usually have several thousand of parameters. 4.3.3 Normalization SVM are not invariant to linear transformations in feature space, so normalization of the feature vectors is desirable. The two stage of normalization are as below. (1) Whitening The score-vector components are normalized to zero mean and unit variance. That is, the basis vectors of the score-space are mapped to an orthonormal set. (2) Spherical normalization As illustrated in Figure 12 shown below, a spherical normalization is a transformation that maps each feature vector onto the surface of a unit hypersphere embedded in a space that has one dimension more than the feature vector itself. Spherical normalization can reduce the ill-conditions due to the large dynamic range; it also tackles the variability in the dynamic range of elements.
Fig. 12 Spherical normalization There are many different projection methods. Figure 13 visualizes three different projection concepts. (a) The orthographic projection is limited since the input data are restricted to lie within a small finite region directly beneath the hemisphere. (b) Although the sereographic projection doesnt suffer from the restriction in (a), the infinity points at both positive and negative directions would become the same point on the hypersphere after projection. (c) The modified stereographic projection doesnt suffer from the problem addressed in (b).
Fig. 13 Three projection methods 4.3.4 Experiment Results Compared to the GMM likelihood ratio baseline, the SVM approach without the use of spherical normalization reduced the average equal error rate by a relative amount of 9%. Spherical normalization enabled a much greater 34% relative reduction in the average equal error rate.
Fig. 14 DET curves 5. GMM + SVM 5.1 Basic idea Gaussian mixture models have been proven extremely successful for text-independent speaker recognition. Recently, a concept of GMM supervector has been proposed in order to compensate the speaker and channel variability. On the other hand, SVMs are a new effective method for
speaker recognition. A novel scheme generated by combining the two promising techniques is introduced in the next section. 5.2 GMM supervectors In this paper, two novel kernels for SVMs using GMM supervector concept are presented. 5.2.1 GMM supervectors An exciting area of recent work in GMM speaker recognition is the use of latent factor analysis to compensate for speaker and channel variability. This GMM supervector can be used to characterize the speaker and channel using eigenvoices and eigenchannels methods, respectively. As illustrated in Figure 15, a GMM supervector consisting of the stacked means of the mixture components. Given a speaker utterance, GMM-UBM training is performed by MAP adaptation of the means (discussed in section 3.3.)
Fig. 15 The procedure of generating GMM supervector 5.2.2 Kernels Suppose there are two utterances utta and uttb, with MAP-trained GMMs ga and gb, respectively. GMM supervector linear kernel The natural distance between the two utterances is the KL divergence:
However, the KL divergence does not satisfy the Mercer condition, which is the optimal condition, and thus is not suitable for being implemented on SVMs system. Another approach is done by approximation:
where
The inequality means that the corresponding divergence would be small if the distance between ma and mb is small. Under this approach, the resulting kernel is:
GMM L2 inner product kernel This kernel is motivated by the function space inner products.
By assuming that means form different mixture components are far apart, an approximation formula, which is showed below, can be achieved.
5.2.3 Experiment Results Both the equal error rate (EER) and the minimum decision cost value (minDCF) are provided as evaluation index. The DCF is calculating as follow:
where Cmiss = 10, CFA = 1, and P(target) = 0.01
Fig. 16 DET curves for comparison
Performance was found to be competitive with a standard GMM UBM system with adaptation, while the GMM supervector SVM has considerably less computational complexity. Furthermore, it is well suited for application of new channel compensation techniques. 5.2.4 Future Work It is reasonable to apply SVM technique in channel compensation for enhancement. Moreover, it should take much effort extending the approach to HMM MAP adaptation.
6. CONCLUSION In this report, we digest not only the basic ideas but also the novel researches associated with speaker recognition and verification technology.
After deeply examining these cutting-edge issues, we comprehend the overall systems both in theoretical and practical. Furthermore, the stage of paper reading help broaden our knowledge base with speech processing; while the stage of understanding and writing enhance our abilities thoroughly and is beneficial for pursuing integration with other research areas. REFERENCES [1] D.A. Reynolds, An Overview of Automatic Speaker Recognition technology, ICASSP, pp. 4072-4075, 2002. [2] D.A. Reynolds, T. F. Quatieri, and R. Dunn, Speaker verification using adapted Gaussian mixture models, Dig. Signal Process, vol. 10, no.1-3, pp. 1941, 2000. [3] V. Wan and S. Renals, Speaker verification using sequence discriminant support vector machines, IEEE Trans. Speech Audio Processing, Vol. 13, no. 2, pp. 203-210, Mar. 2005. [4] W.M. Campbell, D.E. Sturim, D.A. Reynolds, Support vector machines using GMM supervectors for speaker verification," Signal Processing Letters, IEEE, Vol. 13, pp. 308-311, May 2006 [5] J.P. Campbell, Speaker recognition: A tutorial, Proceeding of the IEEE, vol.85, pp. 1437-1462, September 1997 [6] CJ C Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 1998
APPENDIX: Work Contribution Section 1, 2, 3 Section 4, 5, 6

DSP Final Project

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSP Final Project

Uploaded by

Copyright:

Available Formats

Digital Speech Processing Final Project

rate in percentage) = FR (false reject rate in percentage).

2. SPEAKER VERIFICATION MODEL

3. GAUSSIAN MIXTURE MODELS 3.1. What is Gaussian Mixture Model (GMM)?

values. Handset score normalization

4. SUPPORT VECTOR MACHINES 4.1. What is Support Vector Machine (SVM)?

where Cmiss = 10, CFA = 1, and P(target) = 0.01

Fig. 16 DET curves for comparison

APPENDIX: Work Contribution Section 1, 2, 3 Section 4, 5, 6

You might also like