The present invention relates generally to signal processing systems and methods, and more particularly to signal classification systems.
Automatic signal recognition, such as automatic speech recognition (ASR), by computer is a particularly difficult task. Despite an intensive world-wide research effort for over forty years, existing ASR technology still has many limitations. Moderate success has been achieved for controlled environment, small vocabulary, limited scope applications. Moving beyond these limited applications is difficult because of the complexity of the ASR process.
In the ASR process for a large vocabulary system, the speech input begins as a thought in the speaker's mind and is converted into an acoustic wave by his vocal apparatus. This acoustic wave enters the ASR machine through a transducer/converter which changes the acoustic wave from pressure variations into a representative stream of numbers for subsequent computer processing. This number stream is grouped into successive time intervals or segments (typically 10-20 milliseconds). A feature extraction procedure is applied to each interval. The features are a set of parameters that describe the characteristics of the interval. Their exact definition depends upon the particular ASR method. The features can be used to classify the groups into subword units, usually phonemes. A classification procedure is applied to the resulting sequence to produce words for the text output. This is the general ASR procedure; specific systems vary in the features and classification methods used.
The variation in speakers' acoustic production compounds the classification complexity. Different speakers pronounce sounds differently and at different voice pitches. Even the same sound spoken by the same speaker will vary from instance to instance. In addition, a transducer (such as a microphone), captures and adds to the signal other sources besides the speaker, such as room noise, room echo, equipment noise, other speakers, etc. Grouping the data into time intervals for feature analysis assumes that the signal is stationary throughout the interval with the changes only occurring at the boundaries. This is not strictly true; in fact, the validity of the assumption varies with the type of speech sound. This assumption causes variation in the feature extraction process. Since speech is a continuous process breaking up the sounds into a finite number of subword units will also contribute phonological variation. There is no simple, direct, consistent relationship between the spoken word input and the analysis entities used to identify it.
Generally, there have been three approaches to ASR: acoustic-phonetic, pattern recognition, and artificial intelligence (Fundamentals of Speech Recognition, L. Rabiner and B. H. Juang, Prentice-Hall, Inc., 1993., p. 42). The acoustic-phonetic approach attempts to identify and use features that directly identify phonemes. The features are used to segment and label the speech signal and directly produce a phoneme stream. This approach assumes that a feature set exists such that definitive rules can be developed and applied to accurately identify the phonemes in the speech signal and therefore determine the words with a high degree of certainty. Variance in the speech signal fatally weakens this assumption.
The pattern matching approach has been most successful to date. The features are usually based upon a spectral analysis of speech wave segments. Reference patterns are created for each of the recognition units, usually several for each unit to cover variation. The reference patterns are either templates or some type of statistical model such as a Hidden Markov Model (HMM). An unknown speech segment can be classified by its "closest" reference pattern. Specific implementations differ in use of models versus templates, type of recognition unit, reference pattern creation methods, and classification (or pattern recognition) methods.
Pattern matching ASR systems integrate knowledge from several sources prior to making the final output decision. Many systems typically use a language model. A language model improves recognition by providing additional constraints at the word level; word pair probabilities (bigrams), word triplet probabilities (trigrams), allowable phrases, most likely responses, etc. depending on the application. Knowledge sources can be integrated either bottom up or top down. In the bottom up approach, lower level processes precede higher level processes with the language model applied at the final step. In the top down method, the model generates word hypotheses and matches them against the input speech signal.
The best performing large vocabulary systems to date are top down pattern matchers that use HMMs with Gaussian mixture output distributions to model phonemes. Processing begins when an entire phrase is input. A language model is used to generate candidate phrases. The canonical phonetic pronunciation of each candidate phrase is modeled by connected HMM phonetic models that produce a sequence of feature probability distributions. These distributions are compared to the features of the input speech phrase and the most likely candidate phrase is selected for output. High performance on large vocabularies requires large amounts of computational capacity in both memory and time; real time speech recognition is not currently possible on a desktop system without significant performance compromises. Other drawbacks include sensitivity to the amount of training data, sensitivity of reference patterns to speaking environment and transmission channel characteristics, and non-use of specific speech knowledge.
Artificial intelligence is a collection of implementation techniques rather than a separate ASR approach. They are generally of two types, expert systems and neural networks. Expert systems provide a systematic method to integrate various knowledge sources through development and application of rules. They are best suited for the acoustic-phonetic approach. Neural networks were originally developed to model interactions within the brain. They come in many varieties but they are pattern recognizers which require training to determine network parameter values. They can model non-linear relationships and generalize, that is classify, patterns not in the training data. Neural networks have been successfully used in ASR to classify both phonemes and words.
There is, therefore, a need for a signal processing and classification system that achieves increased performance in time, accuracy, and overall effectiveness. Moreover, there is a need for a signal processing and classification system that provides highly accurate, real-time, speaker independent voice recognition on a desktop computer.