The present invention relates generally to speech recognition and more particularly to speaker adaptation whereby the parameters of a speech recognition model are revised to better recognize the speech of a new speaker.
Speech recognition systems may be speaker dependent or speaker independent. Speaker dependent systems are trained to understand what a single individual says, by being given a large number of examples of words uttered by that individual (these examples are called the "training data"). Speaker dependent systems tend to be very accurate for the individual they are trained on, and inaccurate for everybody else. Speaker independent systems are designed to be used by anybody who speaks the language of the application; typically, they are trained on data from many different people. The error rate for a speaker independent system, carrying out recognition on a speaker not in the training data, is roughly two to three times higher than the error rate for a comparable speaker dependent system carrying out recognition on the speaker it is trained on.
In an effort to improve performance, many speech recognition systems include facilities for performing speaker adaptation, whereby the speech recognition system is adjusted during use to reduce the error rate. There are basically three speaker adaptation approaches described in the current technical literature. These are:
(1) Speaker normalization (also called "transformation")--observations of the digitized signal generated by the new speaker feature vectors are transformed to resemble more closely observations from a reference speaker, for whom a speaker dependent system has been trained. In some instances the transformation is in the opposite direction: a reference pattern is transformed to resemble the data from the new speaker more closely. PA1 (2) Speaker clustering--observations of the new speaker are used to select a cluster of training speakers; each cluster is associated with a complete set of Hidden Markov Models (HMMs) trained only on the speakers in this cluster. Once the cluster most suitable for the speaker has been chosen, recognition is carried out using only HMMs from this cluster. PA1 (3) Model adaptation--certain HMM parameters are updated to reflect aspects of the adaptation data. The two most popular model adaptation techniques are maximum a posterior estimation (MAP) and maximum likelihood linear regression (MLLR).
While each of these adaptation techniques has proven to be beneficial, none is without some drawback. Generally speaking, the more effective adaptation techniques tend to require significant computational resources and also require a significant training effort on the part of the individual speaker.
The present invention brings an entirely new technique with which to carry out speaker and environment adaptation. The technique enables an initially speaker independent recognition system to quickly attain a performance level on new speakers and new acoustic environments that approach speaker dependent systems, without requiring large amounts of training data for each new speaker. We call our technique "eigenvoice adaptation." The technique employs an offline step in which a large collection of speaker dependent models is analyzed by principal component analysis (PCA), yielding a set of eigenvectors that we call "eigenvoice vectors" or "eigenvoices." This offline step is fairly computationally intensive, although it has to be performed only once. After that, each time the speech recognition system is used, it carries out a computationally inexpensive operation on adaptation data obtained from the new speaker, to obtain a vector in the space spanned by the eigenvoices. This new vector gives the adapted model for the new speaker.
More specifically, the present invention employs a maximum likelihood technique for placing the new vector into the space spanned by the eigenvoices. The maximum likelihood technique involves constructing an auxiliary function based on the observation data from the new speaker and also based on the knowledge of how the Hidden Markov Models are constructed. Using this auxiliary function, a maximum likelihood vector is obtained by taking derivatives and finding the local maxima. This maximum likelihood vector is thus inherently constrained within the space spanned by the eigenvoices and represents the optimal representation within that space for the new speaker given the available input speech data.
The maximum likelihood technique employed by the invention offers a number of important advantages. First, the adapted model constructed from the maximum likelihood vector always generates the optimal set of HMM models, given the quantity of observation data. Second, although the maximum likelihood technique involves some computation, the computational burden is quite inexpensive because the eigenvoice representation dramatically reduces the number of parameters needed to describe a person's speech. Whereas typical Hidden Markov Model representations involve thousands of floating point number parameters, the eigenvoice representation of the invention requires far fewer parameters; a typical embodiment might employ 25-100 parameters to represent a given speaker, although the system will work with even fewer parameters than these. Computational burden is also significantly reduced with the present invention because the eigenvectors are orthogonal, allowing the maximum likelihood computation to be performed by solving a set of linear equations that a computer can calculate quite readily.
Third, the observation data does not have to include examples of each and every sound unit prescribed by the Hidden Markov Models. Thus, the maximum likelihood technique will work even if data for some of the sound units are missing. In contrast, placing the new speaker's parameters in eigenspace using a projection operation requires the speaker to utter at least one example of each and every sound unit prescribed by the Hidden Markov Models. In practical terms, the maximum likelihood technique will allow construction of a robust adapted model based on a very short, and potentially incomplete, training session. The technique thus lends itself to speaker and environment adaptation applications where a large quantity of adaptation data may not be available. For example, the technique would work well in a speech-enabled interactive marketing system where the new speaker responds by telephone to system navigation prompts and the system adapts to the new speaker automatically as the speaker proceeds to navigate through the system to place an order.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.