Speaker recognition

From Scholarpedia
Sadaoki Furui (2008), Scholarpedia, 3(4):3715. doi:10.4249/scholarpedia.3715 revision #64889 [link to/cite this article]
Jump to: navigation, search
Post-publication activity

Curator: Sadaoki Furui

Speaker recognition is the process of automatically recognizing who is speaking by using the speaker-specific information included in speech waves to verify identities being claimed by people accessing systems; that is, it enables access control of various services by voice (Furui, 1991, 1997, 2000). Applicable services include voice dialing, banking over a telephone network, telephone shopping, database access services, information and reservation services, voice mail, security control for confidential information, and remote access to computers. Another important application of speaker recognition technology is as a forensics tool.


Principles of Speaker Recognition

General Principles and Applications

Speaker identity is correlated with physiological and behavioral characteristics of the speech production system of an individual speaker. These characteristics derive from both the spectral envelope (vocal tract characteristics) and the supra-segmental features (voice source characteristics) of speech. The most commonly used short-term spectral measurements are cepstral coefficients and their regression coefficients. As for the regression coefficients, typically, the first- and second-order coefficients, that is, derivatives of the time functions of cepstral coefficients, are extracted at every frame period to represent spectral dynamics. These regression coefficients are respectively referred to as the delta-cepstral and delta-delta-cepstral coefficients.

Speaker Identification and Verification

Speaker recognition can be classified into speaker identification and speaker verification. Speaker identification is the process of determining from which of the registered speakers a given utterance comes. Speaker verification is the process of accepting or rejecting the identity claimed by a speaker. Most of the applications in which voice is used to confirm the identity of a speaker are classified as speaker verification.

In the speaker identification task, a speech utterance from an unknown speaker is analyzed and compared with speech models of known speakers. The unknown speaker is identified as the speaker whose model best matches the input utterance. In speaker verification, an identity is claimed by an unknown speaker, and an utterance of this unknown speaker is compared with a model for the speaker whose identity is being claimed. If the match is good enough, that is, above a threshold, the identity claim is accepted. A high threshold makes it difficult for impostors to be accepted by the system, but with the risk of falsely rejecting valid users. Conversely, a low threshold enables valid users to be accepted consistently, but with the risk of accepting impostors. To set the threshold at the desired level of customer rejection (false rejection) and impostor acceptance (false acceptance), data showing distributions of customer and impostor scores are necessary.

The fundamental difference between identification and verification is the number of decision alternatives. In identification, the number of decision alternatives is equal to the size of the population, whereas in verification there are only two choices, acceptance or rejection, regardless of the population size. Therefore, speaker identification performance decreases as the size of the population increases, whereas speaker verification performance approaches a constant independent of the size of the population, unless the distribution of physical characteristics of speakers is extremely biased.

There is also a case called “open set” identification, in which a reference model for an unknown speaker may not exist. In this case, an additional decision alternative, “the unknown does not match any of the models”, is required. Verification can be considered a special case of the “open set” identification mode in which the known population size is one. In either verification or identification, an additional threshold test can be applied to determine whether the match is sufficiently close to accept the decision, or if not, to ask for a new trial.

The effectiveness of speaker verification systems can be evaluated by using the receiver operating characteristics (ROC) curve adopted from psychophysics. The ROC curve is obtained by assigning two probabilities, the probability of correct acceptance (1 ? false rejection rate) and the probability of incorrect acceptance (false acceptance rate), to the vertical and horizontal axes respectively, and varying the decision threshold. The detection error trade-off (DET) curve is also used, in which false rejection and false acceptance rates are assigned to the vertical and horizontal axes respectively. The error curve is usually plotted on a normal deviate scale. With this scale, a speaker recognition system whose true speaker and impostor scores are Gaussians with the same variance will result in a linear curve with a slope equal to ? 1. The DET curve representation is therefore more easily readable than the ROC curve and allows for a comparison of the system’s performance over a wide range of operating conditions.

The equal-error rate (EER) is a commonly accepted overall measure of system performance. It corresponds to the threshold at which the false acceptance rate is equal to the false rejection rate.

Text-Dependent, Text-Independent and Text-Prompted Methods

Speaker recognition methods can also be divided into text-dependent (fixed passwords) and text-independent (no specified passwords) methods. The former require the speaker to provide utterances of key words or sentences, the same text being used for both training and recognition, whereas the latter do not rely on a specific text being spoken. The text-dependent methods are usually based on template/model-sequence-matching techniques in which the time axes of an input speech sample and reference templates or reference models of the registered speakers are aligned, and the similarities between them are accumulated from the beginning to the end of the utterance. Since this method can directly exploit voice individuality associated with each phoneme or syllable, it generally achieves higher recognition performance than the text-independent method.

There are several applications, such as forensics and surveillance applications, in which predetermined key words cannot be used. Moreover, human beings can recognize speakers irrespective of the content of the utterance. Therefore, text-independent methods have attracted more attention. Another advantage of text-independent recognition is that it can be done sequentially, until a desired significance level is reached, without the annoyance of the speaker having to repeat key words again and again.

Both text-dependent and independent methods have a serious weakness. That is, these security systems can easily be circumvented, because someone can play back the recorded voice of a registered speaker uttering key words or sentences into the microphone and be accepted as the registered speaker. Another problem is that people often do not like text-dependent systems because they do not like to utter their identification number, such as their social security number, within the hearing of other people. To cope with these problems, some methods use a small set of words, such as digits as key words, and each user is prompted to utter a given sequence of key words which is randomly chosen every time the system is used. Yet even this method is not reliable enough, since it can be circumvented with advanced electronic recording equipment that can reproduce key words in a requested order. Therefore, a text-prompted speaker recognition method has been proposed in which password sentences are completely changed every time.

Text-Dependent Speaker Recognition Methods

Text-dependent speaker recognition methods can be classified into DTW (dynamic time warping) or HMM (hidden Markov model) based methods.

DTW-Based Methods

In this approach, each utterance is represented by a sequence of feature vectors, generally, short-term spectral feature vectors, and the trial-to-trial timing variation of utterances of the same text is normalized by aligning the analyzed feature vector sequence of a test utterance to the template feature vector sequence using a DTW algorithm. The overall distance between the test utterance and the template is used for the recognition decision. When multiple templates are used to represent spectral variation, distances between the test utterance and the templates are averaged and then used to make the decision. The DTW approach has trouble modeling the statistical variation in spectral features.

HMM-Based Methods

An HMM can efficiently model the statistical variation in spectral features. Therefore, HMM-based methods have achieved significantly better recognition accuracies than DTW-based methods.

Text-Independent Speaker Recognition Methods

In text-independent speaker recognition, generally the words or sentences used in recognition trials cannot be predicted. Since it is impossible to model or match speech events at the word or sentence level, the following four kinds of methods have been investigated.

Long-Term-Statistics-Based Methods

Long-term sample statistics of various spectral features, such as the mean and variance of spectral features over a series of utterances, have been used. Long-term spectral averages are extreme condensations of the spectral characteristics of a speaker's utterances and, as such, lack the discriminating power of the sequences of short-term spectral features used as models in text-dependent methods.

VQ-Based Methods

A set of short-term training feature vectors of a speaker can be used directly to represent the essential characteristics of that speaker. However, such a direct representation is impractical when the number of training vectors is large, since the memory and amount of computation required become prohibitively large. Therefore, attempts have been made to find efficient ways of compressing the training data using vector quantization (VQ) techniques.

In this method, VQ codebooks, consisting of a small number of representative feature vectors, are used as an efficient means of characterizing speaker-specific features. In the recognition stage, an input utterance is vector-quantized by using the codebook of each reference speaker; the VQ distortion accumulated over the entire input utterance is used for making the recognition determination.

In contrast with the memoryless (frame-by-frame) VQ-based method, non-memoryless source coding algorithms have also been studied using a segment (matrix) quantization technique. The advantage of a segment quantization codebook over a VQ codebook representation is its characterization of the sequential nature of speech events. A segment modeling procedure for constructing a set of representative time normalized segments called “filler templates” has been proposed. The procedure, a combination of K-means clustering and dynamic programming time alignment, provides a means for handling temporal variation.

Ergodic-HMM-Based Methods

The basic structure is the same as the VQ-based method, but in this method an ergodic HMM is used instead of a VQ codebook. Over a long timescale, the temporal variation in speech signal parameters is represented by stochastic Markovian transitions between states. This method uses a multiple-state ergodic HMM (i.e., all possible transitions between states are allowed) to classify speech segments into one of the broad phonetic categories corresponding to the HMM states. The automatically obtained categories are often characterized as strong voicing, silence, nasal/liquid, stop burst/post silence, frication, etc.

The VQ-based method has been compared with the discrete/continuous ergodic HMM-based method, particularly from the viewpoint of robustness against utterance variations. It was found that the continuous ergodic HMM method is far superior to the discrete ergodic HMM method and that the continuous ergodic HMM method is as robust as the VQ-based method when enough training data is available. However, when little data is available, the VQ-based method is more robust than the continuous HMM method. Speaker identification rates using the continuous HMM were investigated as a function of the number of states and mixtures. It was shown that the speaker recognition rates were strongly correlated with the total number of mixtures, irrespective of the number of states. This means that using information on transitions between different states is ineffective for text-independent speaker recognition.

A technique based on maximum likelihood estimation of a Gaussian mixture model (GMM) representation of speaker identity is one of the most popular methods. This method corresponds to the single-state continuous ergodic HMM. Gaussian mixtures are noted for their robustness as a parametric model and for their ability to form smooth estimates of rather arbitrary underlying densities.

The VQ-based method can be regarded as a special (degenerate) case of a single-state HMM with a distortion measure being used as the observation probability.

Speech-Recognition-Based Methods

The VQ- and HMM-based methods can be regarded as methods that use phoneme-class-dependent speaker characteristics contained in short-term spectral features through implicit phoneme-class recognition. In other words, phoneme-classes and speakers are simultaneously recognized in these methods. On the other hand, in the speech-recognition-based methods, phonemes or phoneme-classes are explicitly recognized, and then each phoneme/phoneme-class segment in the input speech is compared with speaker models or templates corresponding to that phoneme/phoneme-class.

A five-state ergodic linear predictive HMM for broad phonetic categorization has been investigated. In this method, after frames that belong to particular phonetic categories have been identified, feature selection is performed. In the training phase, reference templates are generated and verification thresholds are computed for each phonetic category. In the verification phase, after phonetic categorization, a comparison with the reference template for each particular category provides a verification score for that category. The final verification score is a weighted linear combination of the scores for each category. The weights are chosen to reflect the effectiveness of particular categories of phonemes in discriminating between speakers and are adjusted to maximize the verification performance. Experimental results showed that verification accuracy can be considerably improved by this category-dependent weighted linear combination method.

A speaker verification system using 4-digit phrases has also been tested in actual field conditions with a banking application, where input speech was segmented into individual digits using a speaker-independent HMM. The frames within the word boundaries for a digit were compared with the corresponding speaker-specific HMM digit model and the Viterbi likelihood score was computed. This was done for each of the digits making up the input utterance. The verification score was defined to be the average normalized log-likelihood score over all the digits in the utterance.

A large vocabulary speech recognition system has also been used for speaker verification. With this approach a set of speaker-independent phoneme models were adapted to each speaker. Speaker verification consisted of two stages. First, speaker-independent speech recognition was run on each of the test utterances to obtain phoneme segmentation. In the second stage, the segments were scored against the adapted models for a particular target speaker. The scores were normalized by those with speaker-independent models. The system was evaluated using the 1995 NIST-administered speaker verification database, which consists of data taken from the Switchboard corpus. The results showed that this method did not out-perform Gaussian mixture models.

Text-Prompted Speaker Recognition

In this method, key sentences are completely changed every time. The system accepts the input utterance only when it determines that the registered speaker uttered the prompted sentence. Because the vocabulary is unlimited, prospective impostors cannot know in advance the sentence they will be prompted to say. This method not only accurately recognizes speakers, but can also reject an utterance whose text differs from the prompted text, even if it is uttered by a registered speaker. Thus, a recorded and played back voice can be correctly rejected.

This method uses speaker-specific phoneme models as basic acoustic units. One of the major issues in this method is how to properly create these speaker-specific phoneme models when using training utterances of a limited size. The phoneme models are represented by Gaussian-mixture continuous HMMs or tied-mixture HMMs, and they are made by adapting speaker-independent phoneme models to each speaker's voice.

In the recognition stage, the system concatenates the phoneme models of each registered speaker to create a sentence HMM, according to the prompted text. Then the likelihood of the input speech against the sentence model is calculated and used for speaker verification.

High-level Speaker Recognition

High-level features such as word idiolect, pronunciation, phone usage, prosody, etc. have also been successfully used in text-independent speaker verification. Typically, high-level-feature recognition systems produce a sequence of symbols from the acoustic signal and then perform recognition using the frequency and co-occurrence of symbols. In an idiolect approach, word unigrams and bigrams from manually transcribed conversations are used to characterize a particular speaker in a traditional target/background likelihood ratio framework. The use of support vector machines for performing the speaker verification task based on phone and word sequences obtained using phone recognizers has been proposed. The benefit of these features was demonstrated in the “NIST extended data” task for speaker verification; with enough conversational data, a recognition system can become “familiar” with a speaker and achieve excellent accuracy. The corpus was a combination of phases 2 and 3 of the Switchboard-2 corpora. Each training utterance in the corpus consisted of a conversation side that was nominally of length 5 minutes (approximately 2.5 minutes of speech) recorded over a land-line telephone. Speaker models were trained using 1 ? 16 conversation sides. These methods need utterances of at least several minutes long, much longer than those used in conventional speaker recognition methods.

Normalization and Adaptation Techniques

How can we normalize intra-speaker variation of likelihood (similarity) values in speaker verification? The most significant factor affecting automatic speaker recognition performance is variation in signal characteristics from trial to trial (inter-session variability, or variability over time). Variations arise from the speaker him/herself, from differences in recording and transmission conditions, and from noise. Speakers cannot repeat an utterance precisely the same way from trial to trial. It is well known that samples of the same utterance recorded in one session are much more highly correlated than tokens recorded in separate sessions. There are also long term trends in voices.

It is important for speaker recognition systems to accommodate these variations. Adaptation of the reference model as well as the verification threshold for each speaker is indispensable to maintaining a high recognition accuracy over a long period. In order to compensate for the variations, two types of normalization techniques have been tried ? one in the parameter domain, and the other in the distance/similarity domain. The latter technique uses the likelihood ratio or a posteriori probability. To adapt HMMs for noisy conditions, various techniques including the HMM composition (PMC: parallel model combination) method, have proved successful.

Parameter-Domain Normalization

As one typical normalization technique in the parameter domain, spectral equalization, the so-called “blind equalization” method, has been confirmed to be effective in reducing linear channel effects and long-term spectral variation. This method is especially effective for text-dependent speaker recognition applications using sufficiently long utterances. In this method, cepstral coefficients are averaged over the duration of an entire utterance, and the averaged values are subtracted from the cepstral coefficients of each frame (CMS; cepstral mean subtraction). This method can compensate fairly well for additive variation in the log spectral domain. However, it unavoidably removes some text-dependent and speaker-specific features, so it is inappropriate for short utterances in speaker recognition applications. It has also been shown that time derivatives of cepstral coefficients (delta-cepstral coefficients) are resistant to linear channel mismatches between training and testing.

Likelihood Normalization

A normalization method for likelihood (similarity or distance) values that uses a likelihood ratio has been proposed. The likelihood ratio is the ratio of the conditional probability of the observed measurements of the utterance given the claimed identity is correct, to the conditional probability of the observed measurements given the speaker is an impostor (normalization term). Generally, a positive log-likelihood ratio indicates a valid claim, whereas a negative value indicates an imposter. The likelihood ratio normalization approximates optimal scoring in Bayes’ sense.

This normalization method is, however, unrealistic because conditional probabilities must be calculated for all the reference speakers, which requires large computational cost. Therefore, a set of speakers, “cohort speakers”, who are representative of the population distribution near the claimed speaker has been chosen for calculating the normalization term. Another way of choosing the cohort speaker set is to use speakers who are typical of the general population. It was reported that a randomly selected, gender-balanced background speaker population outperformed a population near the claimed speaker.

A normalization method based on a posteriori probability has also been proposed. The difference between the normalization method based on the likelihood ratio and that based on a posteriori probability is whether or not the claimed speaker is included in the impostor speaker set for normalization; the cohort speaker set in the likelihood-ratio-based method does not include the claimed speaker, whereas the normalization term for the a posteriori-probability-based method is calculated by using a set of speakers including the claimed speaker. Experimental results indicate that both normalization methods almost equally improve speaker separability and reduce the need for speaker-dependent or text-dependent thresholding, compared with scoring using only the model of the claimed speaker.

A method in which the normalization term is approximated by the likelihood for a world model representing the population in general has also been proposed. This method has an advantage in that the computational cost for calculating the normalization term is much smaller than the original method since it does not need to sum the likelihood values for cohort speakers. A method based on tied-mixture HMMs in which the world model is made as a pooled mixture model representing the parameter distribution for all the registered speakers has been proposed. The use of a single background model for calculating the normalization term has become the predominate approach used in speaker verification systems.

Since these normalization methods neglect absolute deviation between the claimed speaker's model and the input speech, they cannot differentiate highly dissimilar speakers. It has been reported that a multilayer network decision algorithm makes effective use of the relative and absolute scores obtained from the matching algorithm.

A family of normalization techniques has been proposed, in which the scores are normalized by subtracting the mean and then dividing by standard deviation, both terms having been estimated from the (pseudo) imposter score distribution. Different possibilities are available for computing the imposter score distribution: Znorm, Hnorm, Tnorm, Htnorm, Cnorm and Dnorm (Bimbot et al., 2004). The state-of-the-art text-independent speaker verification techniques associate one or more parameterization level normalization approaches (CMS, feature variance normalization, feature warping, etc.) with world model normalization and one or more score normalizations.

Updating Models and A Priori Threshold for Speaker Verification

How to update speaker models to cope with the gradual changes in people’s voices is an important issue. Since we cannot ask every user to utter many utterances across many different sessions in real situations, it is necessary to build each speaker model based on a small amount of data collected in a few sessions, and then the model must be updated using speech data collected when the system is used.

How to set the a priori decision threshold for speaker verification is another important issue. In most laboratory speaker recognition experiments, the threshold is set a posteriori to the system’s equal error rate (EER). Since the threshold cannot be set a posteriori in real situations, we have to have practical ways to set the threshold before verification. It must be set according to the relative importance of the two errors, which depends on the application.

These two problems are intrinsically related each other. Methods for updating reference templates and the threshold in DTW-based speaker verification were proposed. An optimum threshold was estimated based on the distribution of overall distances between each speaker’s reference template and a set of utterances of other speakers (interspeaker distances). The interspeaker distance distribution was approximated by a normal distribution, and the threshold was calculated by the linear combination of its mean value and standard deviation. The intraspeaker distance distribution was not taken into account in the calculation, mainly because it is difficult to obtain stable estimates of the intraspeaker distance distribution from small numbers of training utterances. The reference template for each speaker was updated by averaging new utterances and the present template after time registration. These methods have been extended and applied to text-independent and text-prompted speaker verification using HMMs.

Model-based Compensation Techniques

Various model-based compensation techniques for “mismatch factors” including channel, additive noise, linguistic content and intra-speaker variation have recently been proposed (e. g., Fauve et al., 2007; Yin et al., 2007). Key developments include support vector machines (SVMs), associated nuisance attribute projection compensation (NAP) and factor analysis (FA). They have been shown to provide significant improvements in GMM-based text-independent speaker verification. These approaches involve estimating the variability from a large database in which each speaker is recorded across multiple sessions. The underlying hypothesis is that a low-dimensional “session variability” subspace exists with only limited overlap on speaker-specific information.

The goal of NAP is to project out a subspace from the original expanded space, where information has been affected by nuisance effects. This is performed by learning on a background set of recordings, without explicit labeling, from many different speakers’ recordings. The most straightforward approach is to use the difference between a given session and the mean across sessions for each speaker. This information is pooled across speakers to form a combined matrix. An eigen problem is solved on the corresponding covariance matrix to find the dimensions of high variability for the pooled set. The resulting vectors are used in a SVM framework. FA shares similar characteristics to that of NAP, and it operates on generative models with traditional statistical approaches, such as EM, to model intersession variability.


  • Bimbot, F. J., Bonastre, F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz D. and Reynolds, D. A. (2004) “A Tutorial on Text-Independent Speaker Verification,” EURASIP Journ. on Applied Signal Processing, pp. 430-451.
  • Fauve, B. G. B., Matrouf, D., Scheffer, N., and Bonastre, J.-F (2007) “State-of-the-Art Performance in Text-Independent Speaker Verification through Open-Source Software,” IEEE Trans. On Audio, Speech, and Language Process., 15, 7, pp. 1960-1968.
  • Furui, S. (1991) “Speaker-Independent and Speaker-Adaptive Recognition Techniques,” in Furui, S. and Sondhi, M. M. (Eds.) Advances in Speech Signal Processing, New York: Marcel Dekker, pp. 597-622.
  • Furui, S. (1997) “Recent Advances in Speaker Recognition”, Proc. First Int. Conf. Audio- and Video-based Biometric Person Authentication, Crans-Montana, Switzerland, pp. 237-252.
  • Furui, S. (2000) Digital Speech Processing, Synthesis, and Recognition, 2nd Edition, New York: Marcel Dekker.
  • Yin, S.-C., Rose, R. and Kenny, P. (2007) “A Joint factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification,” IEEE Trans. On Audio, Speech, and Language Process., 15, 7, pp. 1999-2010.

Internal references

  • Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.
  • Eugene M. Izhikevich (2006) Bursting. Scholarpedia, 1(3):1300.
  • James Meiss (2007) Dynamical systems. Scholarpedia, 2(2):1629.
  • Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.
  • Philip Holmes and Eric T. Shea-Brown (2006) Stability. Scholarpedia, 1(10):1838.

See Also

Auditory Scene Analysis, Biometric Authentication, Pattern Recognition, Speaker Variability Analysis, Speech Recognition

Personal tools

Focal areas