Speaker verification systems recognize an individual by verifying a claim of identity provided by the individual through an analysis of spoken utterances. In the context of a telecommunications system, for example, speaker verification may be employed to verify the identity of a caller who is charging a call to a credit or calling card. Generally, these systems operate by comparing extracted features of an utterance received from an individual who claims a certain identity to one or more prototypes of speech based on (i.e., "trained" by) utterances which have been previously provided by the identified person.
One significant problem which is frequently encountered in speaker verification systems in the telecommunication context is that a person who has trained a verification system does not always "sound the same" when undertaking a verification trial. Changes in a person's "sound" over time may be caused by, for example, changes in the characteristics of the telecommunications channel carrying the person's voice signals. These changes may be caused by no more than the use of different telephones for the training process and the verification trial. Naturally (and unfortunately), such changes often substantially degrade verification system performance. In fact, because of sensitivity to changing channel characteristics, or even to a speaker's loudness level, verification system performance may degrade to unacceptable levels.
More specifically, speaker recognition systems typically create a speaker-dependent hidden Markov model (HMM) for a given individual whose identity is to be capable of being verified, by performing training based on data often collected in a single enrollment session. The HMM, therefore, matches the probability density function ("pdf") of the training data perfectly. In a subsequent verification session, however, test data may be collected through a different telephone channel and handset. (Data collected during a training process will be referred to herein as "training data" or "training speech data," whereas data obtained during a verification session will be referred to herein as "test data" or "test speech data." In addition, the terms "training information" or "training speech information" will be used to denote information based on the training data, such as, for example, models.) Since the acoustic conditions may be different between the enrollment session and the verification session, a stochastic mismatch may occur between the set of test data and the set of data which was used to train the HMM. Speaker recognition performance is degraded by such a mismatch.
Mathematically, the above-described mismatch can be represented as a linear transform in the cepstral domain: EQU y=Ax+b, (1)
where x is a vector of the cepstral frame of a test utterance; A and b are the matrix and vector which, if properly estimated for the given test utterance, can be applied as shown to eliminate the mismatch; and y is the resultant transformed vector which matches the training data (See, e.g., R. J. Mammone et al., "Robust Speaker Recognition," IEEE Signal Processing Magazine, vol. 13, pp. 58-71, September 1996.) Geometrically, b represents a translation of the test data and A represents both a scale and a rotation thereof. (Note that when A is diagonal, it represents only a scaling operation.)
Prior art speaker verification systems have been limited in their ability to handle stochastic mismatch. For example, Cepstral mean subtraction has often been used for handling stochastic mismatch in both speaker and speech recognition applications. Viewed with respect to Equation (1), this technique essentially estimates b and assumes A to be an identity matrix. For example, in A. E. Rosenberg et al., "Cepstral Channel Normalization Techniques for HMM-based Speaker Verification," Proc. of Int. Conf. on Spoken Language Processing, pp. 1835-1838, 1994, the vector b was estimated by long term averaging, short term averaging, and a maximum likelihood (ML) approach. In A Sankar et al., "A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition," IEEE Trans. on Speech and Audio Processing, vol. 4, pp. 190-202, May, 1996, maximum likelihood approaches were used to estimate b, a diagonal A, and model parameters for HMMs for purposes of stochastic matching. Recently, a least-squares solution of the linear transform parameters (i.e., A and b) was briefly introduced in R. J. Mammone et al., "Robust Speaker Recognition," cited above. However, none of the prior art approaches to the stochastic mismatch problem have provided an efficient technique to adequately match the overall distribution of the test data with that of the training data based on a generalized linear transform.