The present invention relates to speech recognition and more particularly to a speech recognition system which transforms data to improve discrimination and which further allows input from different sources to be transformed onto a common vector space.
The goal of automatic speech recognition (ASR) systems is to determine the lexical identity of spoken utterances. The recognition process, also referred to as classification, begins with the conversion of the acoustical signal into a stream of spectral vectors or frames which describe important characteristics of the signal at specified times. Classification is attempted by first creating reference models which describe some aspect of the behavior of sequences of spectral frames corresponding to different words. A wide variety of models have been developed, e.g., Hidden Markov Models (HMM), but they all share the property that they describe the temporal characteristics of spectra typical to particular words. The sequence of spectra arising from an unknown or input utterance is compared to such models, and the success with which different models predict the behavior of the input frames determines the putative identity of the utterance.
A variety of spectral descriptions have been developed, e.g., filter-banks and linear predictive coding (LPC) spectra, and most share the property that they describe the signal by estimating the energy in different frequency ranges. While speech sounds have characteristic spectral behavior, the classification process is made difficult by the presence of a number of sources of variability. Marked spectral inconsistency can arise, for example, due to the presence of additive background noise, the inconsistency with which a speaker produces the same utterance on different occasions, the dialectal and physiological differences between speakers, including differences due to speaker gender, as well as the differences due to different microphones, preprocessing methods and the acoustic environment.
A separate problem with typical spectral processing is one of efficiency. The components (or channels) of most spectral vectors are not independent and thus the inclusion of the whole vector is in some sense redundant. Since the cost of much subsequent processing is usually influenced by the number of spectral components, the channel correlation can be problematic. The problem is compounded if the basic spectral stream is supplemented by other vector descriptors of the signal, for example various spectral time-derivatives.
Given the joint problems of intrinsic variability and inefficiency of spectral representations, it would be desirable to apply some modification to the spectral vectors so as to minimize the effects of variability on classification accuracy while at the same time improving efficiency of the representation.
It has proved difficult to discover a transformation of this type which would maximize word recognition accuracy. However, methods do exist which improve classification performance on the frame level. The best known such method is Fisher's Discriminant Analysis, also known as Linear Discriminant Analysis (LDA). LDA assumes that a collection of spectral vectors is labeled with a class identity. Each frame-level recognition class is thus associated with a collection of exemplar vectors. It is possible to concisely describe the separability of the classes by deriving an average within-class and an a between-class scatter matrix from the class distributions. The goal of the LDA process is then to obtain a matrix transformation which maximizes the quotient of the between- to within-class scatter matrix determinants, and thus to minimize the within-class variance and maximize the between-class separation.
The LDA transformation consists of a rotation and scaling of the spectral space onto a new set of axes which not only maximize class discriminability but which are also orthogonal and can be ordered according to their contribution to discrimination. The ordering permits the discarding of dimensions which do not contribute significantly to frame-level recognition. This generally improves performance by eliminating variability in the signal, as well as improving the efficiency of the representation.
The LDA method has been applied with some success to the preprocessing of spectral frames in speech recognition. It has been found that it is frequently possible to improve word-level recognition performance in this manner even though the transform is trained at a frame-level.
There are, however, significant problems with the basic LDA procedure. First, the proper definition of frame-level classes is not obvious, given that ultimately the classes of concern are at the word or even sentence level. Second, and more important, is the limitation that LDA can perform only one linear transformation on the signal. The consequence is that a single set of new axes are obtained for the whole data space. However, in typical ASR applications both the training and testing speech data can arise from a number of different sources (e.g., different microphones, speakers of different gender, etc.), whose effect on the speech data can be approximated as a separate rotation and scaling of some underlying but directly inaccessible, pure acoustic manifestation. This means that if considered separately, the data from each distinct source would give rise to a different and incompatible LDA transformation. A single average linear transform which would attempt to accommodate all the sources would produce poor results.
A particular example of a source effect is the difference due to speaker gender. Speech spectra produced by male and female speakers pronouncing the same phonetic target can differ considerably due to the distinct physiology of the vocal tract. It is known that gender differences in spectra involve a frequency shift, but the relationship is thought to be quite complex and has not yet been adequately characterized. Including the spectra from both genders in the LDA frame-level classes considerably increases the within-class variance and class overlap and makes it more difficult to find an effective set of new axes.
A similar problem may arise if portions of the training and testing data have been processed with dissimilar front ends, including filter-banks and microphones, as well as dissimilar acoustic environments (e.g., quiet vs. noisy).
Improved performance can be achieved if the heterogeneous data are kept separate and a specific LDA transform is obtained for each set. This solution, however, is not acceptable for two reasons. First, doing so effectively reduces the amount of data available for training. Recognition performance is closely tied to the quantity of training data and reducing this amount is undesirable. Second, training source-specific transforms separately would mean that the data from each source would then be mapped onto a unique and mutually incompatible set of output dimensions, with the consequence that a separate set of reference models would have to be produced for every condition. In a large vocabulary system this may entail a great deal of storage space and unacceptable delays due to having to reload reference models if the source changes.
Among the several objects of the present invention it may be noted the provision of a speech recognition system providing increased accuracy of recognition; the provision of such a system which accommodates input from different sources, e.g., male and female speakers; the provision of such a system which employs a vocabulary of models which are relatively compact; the provision of such a system in which vocabulary models are trained using data from a variety of sources; the provision of such a system which is highly reliable and which is of relatively simple and inexpensive implementation. Other objects and features will be in part apparent and in part pointed out hereinafter.