1. Field of the Invention
The present invention relates to pattern recognition, and more particularly relates to an apparatus and method suitable for robust pattern recognition, such as, for example, robust recognition of speech, handwriting or optical characters under challenging conditions.
2. Brief Description of the Prior Art
In the field of pattern recognition, different types of feature vectors may be used to model a given set of observation vectors. For example, in the field of speech recognition, different types of spectral-based feature vectors can be employed, such as mel-frequency cepstral vectors, perceptual linear predictive coefficients (PLP), maximum-likelihood based linearly transformed features, formant-based features, and the like. The acoustic models which are used to model these different feature spaces may produce different types of decoding errors, and their accuracy for classifying vowels, fricatives, and other consonants may vary. Furthermore, the type of signal processing scheme which is used (e.g., LDA, PLP, Cepstra, factor analysis, transformed features, etc.) may determine the robustness of these models under varying noise conditions. Similar comments would apply to other types of pattern recognition problems, such as handwriting recognition or optical character recognition under conditions which are challenging for those types of recognition.
In the past, in the field of speech recognition, multi-scale systems have been explored where each stream operates on different time windows. Such multi-scale systems have been discussed in the paper “Using Multiple Time Scales in a Multi-Stream Speech Recognition System” as authored by S. Dupont et al., and presented at Eurospeech '97, held in Greece September 1997 (proceedings pages 3–6). In the paper “Data-derived Non-linear Mapping for Feature Extraction in HMM,” as set forth in the Proceedings of the Workshop on Automatic Speech Recognition and Understanding held in Colorado in December 1999, authors H. Hermansky et al. trained an MLP to map the feature spaces to the log-likelihoods of phonemes and the combination scheme involved the averaging of the features prior to orthogonalization. In the NIST-based ROVER scheme, a voting mechanism is used after an initial decoding pass to combine the best output from each model. In the paper “Heterogenious Measurements and Multiple Classifiers for Speech Recognition,” by A. Halberstadt et al., presented at ICSLP '98 (Sydney, Australia 1998), a hierarchical architecture for combining classifiers for speech recognition was presented.
Selection of acoustic features for robust speech recognition has been the subject of research for several years. In the past, algorithms which use feature vectors from multiple frequency bands, or employ techniques to switch between multiple feature streams, have been reported in the literature to handle robustness under different acoustic conditions. The former approach is discussed in a paper by K. Paliwal, entitled “Spectral Subband Centroid Features for Speech Recognition,” presented at the ICASSP '98 in Seatle, Wash., May 1998 (proceedings pages 617–20). The latter approach is set forth in a paper by L. Jiang entitled “Unified Decoding and Feature Representation for Improved Speech Recognition,” which was presented at Eurospeech '99 in Budapest, 1999 (proceedings pages 1331–34).
In order to increase speech recognition accuracy, the use of information content in features extracted from Bark-spaced multiple critical frequency bands of speech has been proposed in the aforementioned paper by Paliwal, and in the paper by H. Hermansky et al. entitled “Tandem Connectionist Feature Extraction for Conventional HMM Systems” as presented at ICASSP 2000 in Istanbul, Turkey in May 2000. Typically, most of these feature streams contain complimentary information and an efficient combination of these streams would not only result in increased recognition accuracy, but would also serve as a technique to select the feature stream that best represents the acoustics at the given time frame or segment. The overall performance of the final acoustic model, which is a combination of acoustic models based on several features spaces, depends on how well the error patterns from these streams compliment one another and how much redundant information they possess. This is further discussed in the paper by H. Bourlard entitled “Non-stationary Multi-Channel (Multi Stream) Processing Towards Robust and Adaptive ASR,” at pages 1–10 of the Proceedings of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions, which was held in Finland in 1999. In some cases, even when the performance of one of the streams is not so robust or is far worse than the best system, it may contain hidden characteristic information that becomes more valuable when the two streams are merged.
The various prior art schemes may substantially increase the computational load during decoding, or may not optimally combine the different feature vectors, or select the best from among multiple feature vectors.
In view of the foregoing, there is a need in the prior art for an apparatus and method for robust pattern recognition which permits computationally efficient combination of multiple feature spaces. Furthermore, it would be desirable if such apparatus and method could provide both a weighted, normalized maximum likelihood combination scheme and a rank-based maximum likelihood combination scheme.