The present invention relates generally to speech recognition and more particularly to speaker adaptation whereby the parameters of a speech recognition model are revised to better recognize the speech of a new speaker.
Speech recognition systems may be speaker dependent or speaker independent. Speaker dependent systems are trained to understand what a single individual says, by being given a large number of examples of words uttered by that individual (these examples are called the xe2x80x9ctraining dataxe2x80x9d). Speaker dependent systems tend to be very accurate for the individual they are trained on, and inaccurate for everybody else. Speaker independent systems are designed to be used by anybody who speaks the language of the application; typically, they are trained on data from many different people. The error rate for a speaker independent system, carrying out recognition on a speaker not in the training data, is roughly two to three times higher than the error rate for a comparable speaker dependent system carrying out recognition on the speaker it is trained on.
In an effort to improve performance, many speech recognition systems include facilities for performing speaker adaptation, whereby the speech recognition system is adjusted during use to reduce the error rate. There are basically three speaker adaptation approaches described in the current technical literature. These are:
(1) Speaker normalization (also called xe2x80x9ctransformationxe2x80x9d)xe2x80x94observations of the digitized signal generated by the new speaker feature vectors are transformed to resemble more closely observations from a reference speaker, for whom a speaker dependent system has been trained. In some instances the transformation is in the opposite direction: a reference pattern is transformed to resemble the data from the new speaker more closely.
(2) Speaker clusteringxe2x80x94observations of the new speaker are used to select a cluster of training speakers; each cluster is associated with a complete set of Hidden Markov Models (HMMs) trained only on the speakers in this cluster. Once the cluster most suitable for the speaker has been chosen, recognition is carried out using only HMMs from this cluster.
(3) Model adaptationxe2x80x94certain HMM parameters are updated to reflect aspects of the adaptation data. The two most popular model adaptation techniques are maximum a posteriori estimation (MAP) and maximum likelihood linear regression (MLLR).
While each of these adaptation techniques has proven to be beneficial, none is without some drawback. Generally speaking, the more effective adaptation techniques tend to require significant computational resources and also require a significant training effort on the part of the individual speaker.
The present invention brings an entirely new technique with which to carry out speaker normalization and speaker and environment adaptation. The technique enables an initially speaker independent recognition system to quickly attain a performance level on new speakers and new acoustic environments that approach speaker dependent systems, without requiring large amounts of training data for each new speaker. We call our technique xe2x80x9ceigenvoice adaptation.xe2x80x9d We have discovered that eigenvoice adaptation can be applied in a variety of different contexts, as will be illustrated herein through some specific examples.
In general, eigenvoice adaptation involves an advantageous dimensionality reduction that can greatly improve the speed and efficiency at which speaker and environment adaptation is performed. Dimensionality reduction refers to a mapping of high-dimensional space onto low-dimensional space. A variety of different techniques may be used to effect dimensionality reduction. These include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Factor Analysis (FA), Singular Value Decomposition (SVD) and other transformations that apply reduction criteria based on variance.
Unlike other adaptation techniques described in the literature, our eigenvoice adaptation techniques apply dimensionality reduction to a set of complete speaker models in order to find basis vectors spanning the space of these speaker models. By way of illustration, a large collection of speaker models is analyzed in an offline step using dimensionality reduction to yield a set of eigenvectors that we call xe2x80x9ceigenvoice vectorsxe2x80x9d or xe2x80x9ceigenvoices.xe2x80x9d This offline step is fairly computationally intensive, although it has to be performed only once. After that, each time the speech recognition system is used, it carries out a computationally inexpensive operation on adaptation data obtained from the new speaker, to obtain a vector in the space spanned by the eigenvoices. This new vector gives the adapted model for the new speaker.
Part of the power of the invention derives from the eigenvoice representation of the collective set of training speakers and of the new individual speaker for which the recognition system is being adapted. In other words, the eigenspace developed during the dimensionality reduction step represents the collective speech traits of all the training speakers. The individual eigenvectors that define this n-dimensional space each contain different information and may be represented, for example, as members of an ordered list or array.
Computational burden is significantly reduced with the present invention because the eigenvectors are orthogonal, allowing subsequent computations to be performed by solving a set of linear equations that a computer can calculate quite readily.
Placing a new speaker within eigenspace can be accomplished a number of different ways. Although simple geometric projection can be used to place the new speaker into eigenspace, we have developed an improved technique that we call Maximum Likelihood Eigenvoice Decomposition (MLED) for placing the new vector into the space spanned by the eigenvoices. The maximum likelihood technique involves constructing a probability function based on the observation data from the new speaker and also based on the knowledge of how the Hidden Markov Models are constructed. Using this probability function, a maximum likelihood vector is obtained by taking derivatives and finding the local maxima. This maximum likelihood vector is thus inherently constrained within the space spanned by the eigenvoices and is a good representation within that space for the new speaker given the available input speech data.
Our eigenvoice adaptation techniques give superior results when a good training set of accurate speaker-dependent models is used as the basis for dimensionality reduction. Therefore, according to one aspect of the invention the speaker-dependent models may be obtained and enhanced prior to dimensionality reduction using auxiliary adaptation techniques. Such techniques include Maximum A Posteriori estimation (MAP) and other transformation-based approaches, such as Maximum Likelihood Linear Regression (MLLR).
According to another aspect of the invention, the eigenvoice adaptation technique is applied to develop an initial adapted model and this model is then further improved using auxiliary adaptation techniques, such as those described above. Often the best results may be obtained by applying the MLED technique first and then one of these auxiliary adaptation techniques.
The eigenvoice adaptation techniques discussed so far have involved dimensionality reduction applied to a collective set of training speakers. Yet another aspect of the invention involves application of dimensionality reduction to the set of transformation matrices resulting from a transformation-based adaptation technique such as MLLR. In this approach, each training speaker is used to estimate a set of transformation matrices from a speaker-independent model (using MLLR, for example). The set of transformation matrices for each training speaker is then vectorized (turned into a high-dimensional supervector). A dimensionality reduction technique is then applied to the set of supervectors to yield a low-dimensional set of eigenvectors we call xe2x80x9ceigentransform vectorsxe2x80x9d or xe2x80x9ceigentransforms.xe2x80x9d
To adapt to a new speaker quickly, the system assumes the new speaker""s transformation matrices are located in the subspace spanned by the eigentransforms and then applies the resulting transforms to the speaker independent model.
The dimensionality-reducing jump into eigenspace affords considerable flexibility and computational economy. We have found, for example, that statistical processing techniques may be applied in the low-dimensional eigenspace itself. Therefore, in accordance with another aspect of the invention, a statistical process such as Bayesian estimation may be performed in eigenspace as a way of better locating where to place a new speaker within eigenspace. Prior knowledge (from the training speakers, for example) about what areas of speaker space are densely or thinly populated is used to refine estimates of where to locate the new speaker within eigenspace.
In practical terms, the eigenvoice adaptation techniques described here will allow construction of a robust adapted model based on a very short, and potentially incomplete, training session. These techniques thus lend themselves to speaker and environment adaptation applications where a large quantity of adaptation data may not be available. For example, the techniques would work well in a speech-enabled interactive marketing system where the new speaker responds by telephone to system navigation prompts and the system adapts to the new speaker automatically as the speaker proceeds to navigate through the system to place an order.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.