The present invention relates generally to speaker recognition systems and, more particularly, to methods and apparatus for performing pattern-specific maximum likelihood transformations in accordance with such speaker recognition systems.
The goal of speaker recognition is to identify the properties of speech data, i.e., a sequence of data samples recorded over some transducer, that is hopefully uniquely modified by different speakers. Speaker recognition may be xe2x80x9ctext-dependent,xe2x80x9d where recognition depends on the accuracy of the actual words spoken, or xe2x80x9ctext-independent,xe2x80x9d where recognition does not depend on such accuracy. Typically, in text-independent speaker recognition, training data may be limited as in applications requiring on-line enrollment. For this case, diagonal Gaussian mixture models on Rn (n-dimensional cepstral coefficients) form appropriate data models. Given test vectors, measures of match between a model and the data are given by the likelihood of the data with respect to the model. Efficiency is also gained by requiring the Gaussians to have diagonal covariances, which saves model storage space and allows efficient calculation of likelihood based discriminant functions. Effectively, this is a method of model selection based on the covariance structure in localized regions of the data space.
However, when modeling data with diagonal Gaussians it can be shown that the unmodified cepstral feature space is sub-optimal (see, e.g., R. A. Gopinath, xe2x80x9cMaximum Likelihood Modeling with Gaussian Distributions For Classification,xe2x80x9d Proc. ICASSP""98, the disclosure of which is incorporated by reference herein), from the point of view of classification, in comparison to one which can be obtained by an invertible linear transformation. Techniques of feature and model space adaptation have been developed that allow information to be gained indirectly about covariance structure in localized regions of the data space, see, e.g., the above-referenced R. A. Gopinath article; and M. J. F. Gales, xe2x80x9cSemi-Tied Covariance Matrices,xe2x80x9d Proc. ICASSP""98, the disclosure of which is incorporated by reference herein. The adaptation involves finding transformations of feature space regions that allow efficient and accurate modeling of data. Moreover, related (and even unrelated) regions can be tied together during the adaptation to overcome the lack of training data to some extent. Related techniques, such as MLLR (Maximum Likelihood Linear Regression) have been used for speaker and environment adaptive training of HMMs (Hidden Markov Models) in LVCSR (Large Vocabulary Continuous Speech Recognition) systems, see, e.g., L. Polymenakos, P. Olsen, D. Kanrvesky, R. A. Gopinath, P. S. Gopalakrishnan and S. Chen, xe2x80x9cTranscription Of Broadcast Newsxe2x80x94Some Recent Improvements To IBM""s LVCSR System,xe2x80x9d Proc. ICASSP""98, the disclosure of which is incorporated by reference herein.
It would be highly desirable to provide a classification technique which exploits the indirect information about covariance structure and its ability to be estimated with a small amount of data.
The present invention provides such a classification technique which exploits the indirect information about covariance structure, mentioned above, and its ability to be estimated with a small amount of data. In accordance with pattern-specific maximum likelihood transformation (PSMLT) methodologies of the invention, speaker models in the recognition system database are either characterized solely by the information evident in a feature space transformation or by this transformation in conjunction with a Gaussian mixture model. Discriminant functions are provided for both cases.
More specifically, the present invention provides acoustic feature transformations to model the voice print of speakers in either a text-dependent or text-independent mode. Each transformation maximizes the likelihood of the speaker training data with respect to the resulting voice-print model in the new feature space. Speakers are recognized (i.e., identified, verified or classified) by appropriate comparison of the likelihood of the testing data in each transformed feature space and/or by directly comparing transformation matrices obtained during enrollment and testing. The technique""s effectiveness is illustrated in both verification and identification tasks for the telephony environment.
It is to be appreciated that, although presented in the particular case of speaker recognition, the principle of pattern-specific maximum likelihood transformations can be extended to a large number of pattern matching problems and, in particular, to other biometrics besides speech, e.g., face recognition and fingerprints.
Thus, in one aspect of the invention, a method for use in recognizing a provider of an input pattern-based signal may comprise the following steps. First, one or more feature vectors are extracted from an input pattern-based signal of an enrolling provider, the input pattern-based signal representing training data provided by the provider. A background model is then adapted using the one or more training feature vectors. Next, a training data transformation is generated using the adapted model and the one or more training feature vectors, the training data transformation and the adapted model comprising a pattern-specific representation of the enrolling provider of the input pattern-based signal. The steps of performing the training feature vector extraction, background model adaptation and training data transformation generation are preferably performed on one or more other enrolling providers of input pattern-based signals.
After such a training mode, recognition may be performed in the following manner. First, one or more feature vectors are extracted from an input pattern-based signal provided by a real-time provider, the input pattern-based signal representing test data provided by the provider. Scores are then computed for the one or more test feature vectors based on the respective training data transformations and adapted background models of at least a portion of the enrolled providers. Lastly, a recognition result is obtained based on the computed scores.
In another aspect of the invention, a method for use in recognizing a provider of an input pattern-based signal may comprise the following steps. First, one or more feature vectors are extracted from an input pattern-based signal of an enrolling provider, the input pattern-based signal representing training data provided by the provider. A training data transformation is then generated using an unadapted model and the one or more training feature vectors, the training data transformation comprising a pattern-specific representation of the enrolling provider of the input pattern-based signal. The steps of performing the training feature vector extraction and training data transformation generation are preferably performed on one or more other enrolling providers of input pattern-based signals.
After such a training mode, recognition may be performed in the following manner. First, one or more feature vectors are extracted from an input pattern-based signal provided by a real-time provider, the input pattern-based signal representing test data provided by the provider. A test data transformation is then generated using an unadapted model and the one or more test feature vectors, the test data transformation comprising a pattern-specific representation of the real-time provider of the input pattern-based signal. Next, scores are computed for the one or more test feature vectors by comparing the test data transformation with the respective training data transformations of at least a portion of the enrolled providers. Lastly, a recognition result is obtained based on the computed scores.