The present invention relates generally to pattern recognition, which includes automated speech and speaker recognition. In particular, it relates to a computer-implemented data processing method for measuring distance between collections of audio feature distributions or finite mixture models.
In automated speech recognition, input speech is analyzed in small time frames and the audio content of each time frame is characterized by what is known as a feature vector. A feature vector is essentially a set of N audio features associated with that frame. Such audio features are typically the different spectral or cepstral parameters corresponding to the audio of that frame. In an attempt to recognize a spoken word or phoneme, test data comprised of a feature vector or feature vector sequence is compared to models (prototypes) of the sound of known vocabulary words or phonemes. These comparisons are performed using a distance measure, which is a measure of closeness between a pair of elements under consideration. Thus, a given feature vector or feature vector sequence is recognized as that phoneme or word corresponding to the prototype that is the shortest distance away.
In a typical speech recognition system, a different speaker model is developed for each speaker using the system. Prior to using the system for the first time, a speaker is prompted to utter a predetermined sequence of words or sentences to thereby supply training data to the system. The training data is employed to develop a speaker-dependent model containing a set of user-specific prototypes. During subsequent use of the system, the user typically needs to first register his/her identity. The user's speech is then compared only to the corresponding prototypes. An obvious drawback to this technique is the inability to practically recognize speech within a conference of many speakers, for example, due to the impracticality of speaker registration prior to each utterance. Hence, there is a need for a practical method to implement automatic speaker recognition. Also, in a general use environment, it is desirable to eliminate the necessity of collecting training data for new users.