The present invention relates generally to speech recognition and more particularly to speaker adaptation whereby the parameters of a speech recognition model are revised to better recognize the speech of a new speaker.
Speech recognition systems may be speaker dependent or speaker independent. Speaker dependent systems are trained to understand what a single individual says, by being given a large number of examples of words uttered by that individual (these examples are called the "training data"). Speaker dependent systems tend to be very accurate for the individual they are trained on, and inaccurate for everybody else. Speaker independent systems are designed to be used by anybody who speaks the language of the application; typically, they are trained on data from many different people. The error rate for a speaker independent system, carrying out recognition on a speaker not in the training data, is roughly two to three times higher than the error rate for a comparable speaker dependent system carrying out recognition on the speaker it is trained on.
In an effort to improve performance, many speech recognition systems include facilities for performing speaker adaptation, whereby the speech recognition system is adjusted during use to reduce the error rate. There are basically three speaker adaptation approaches described in the current technical literature. These are:
(1) Speaker normalization (also called "transformation")--observations of the digitized signal generated by the new speaker feature vectors are transformed to resemble more closely observations from a reference speaker, for whom a speaker dependent system has been trained. In some instances the transformation is in the opposite direction: a reference pattern is transformed to resemble the data from the new speaker more closely. PA1 (3) Model adaptation--certain HMM parameters are updated to reflect aspects of the adaptation data. The two most popular model adaptation techniques are maximum a posteriori estimation (MAP) and maximum likelihood linear regression (MLLR).
(2) Speaker clustering--observations of the new speaker are used to select a cluster of training speakers; each cluster is associated with a complete set of Hidden Markov Models (HMMs) trained only on the speakers in this cluster. Once the cluster most suitable for the speaker has been chosen, recognition is carried out using only HMMs from this cluster.
While each of these adaptation techniques has proven to be beneficial, none is without some drawback. Generally speaking, the more effective adaptation techniques tend to require significant computational resources and also require a significant training effort on the part of the individual speaker.
The present invention brings an entirely new technique with which to carry out speaker and environment adaptation. The technique enables an initially speaker independent recognition system to quickly attain a performance level on new speakers and new acoustic environments that approach speaker dependent systems, without requiring large amounts of training data for each new speaker. We call our technique "eigenvoice adaptation." The technique employs an offline step in which a large collection of speaker dependent models is analyzed by principal component analysis (PCA), yielding a set of eigenvectors that we call "eigenvoice vectors" or "eigenvoices." This offline step is fairly computationally intensive, although it has to be performed only once. After that, each time the speech recognition system is used, it carries out a computationally inexpensive operation on adaptation data obtained from the new speaker, to obtain a vector in the space spanned by the eigenvoices. This new vector gives the adapted model for the new speaker.
Unlike model adaptation techniques such as MAP and MLLR, most of the expensive computation occurs offline in the PCA step. This allows the invention to perform speaker or environment adaptation quite quickly and with little computational expense as the recognition system is being used.
Part of the power of the invention derives from the eigenvoice representation of the collective set of training speakers and of the new individual speaker for which the recognition system is being adapted. In other words, the eigenspace developed during the PCA step represents the collective speech traits of all the training speakers. The individual eigenvectors that define this n-dimensional space are each uncorrelated or orthogonal and are listed in order of importance for explaining variation in the data. Our experience has shown that the highest order eigenvector in this array may represent a male-female dimension. When this eigenvector receives a positive weight then the speaker is likely to be male; when this eigenvector receives a negative weight then the speaker is likely to be female. However, it should be understood that is the individual eigenvectors are not assigned a priori to any physical differences among speakers. Rather, the eigenvectors derive entirely from the training data when PCA is performed upon it.
As a new speaker uses the speech recognizer during adaptation, the model output parameters are constrained to be a linear combination of the previously determined eigenvoices. In other words, the speaker dependent model being trained on the new speaker must lie within the eigenvoice space previously defined by the training speakers. This is a comparatively inexpensive computational operation. The technique quickly generates a good speaker dependent model even if only a small quantity of adaptation speech is used. The technique thus lends itself to speaker and environment adaptation applications where a large quantity of adaptation data may not be available. For example, the technique would work well in a speech-enabled interactive marketing system where the new speaker responds by telephone to system navigation prompts and the system adapts to the new speaker automatically as the speaker proceeds to navigate through the system to place an order.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.