The present invention relates to speech recognition, and more particularly, to systems for and methods of transforming an acoustical speech signal into a linguistic stream of phonetics, words and other speech components.
In general, speech recognition is a multi-layered process. Typically, a speech recognition system analyzes a raw acoustic waveform over time, and applies some complex algorithm to extract a stream of linguistic units (e.g., the phonetics, words, etc.). The term “stream” may also be referred to herein as a “sequence” or a “series,” and the term “linguistic units” may also be referred to herein as “phonetic estimates.” The term “acoustic waveform” may also be referred to herein as “acoustic signal,” “audio signal,” or “audio waveform.” A speech recognition system may further apply various sources of linguistic constraints, so that the utterance may be finally interpreted within a practical context.
Of all of the processes and associated technologies used for speech recognition, the transformation of an acoustic signal to a linguistic stream has been the most difficult, and remains the technology gatekeeper for practical applications. The problem is essentially one of pattern recognition, and shares many of the challenges of handwriting recognition, OCR and other visual recognition technologies. The process that transforms an acoustic signal to a linguistic stream is referred to herein as the “core speech recognizer”.
There have been three primary strategies for approaching the problem of realizing the core speech recognizer: (1) the statistical approach, (2) the feature approach and (3) the perceptual or bio-modeling approach. Each approach is summarized below.
(1) Statistical Recognition
The statistical recognition approach involves first reducing the incoming data stream to its essential, most basic components, then applying algorithms to examine thousands, or in some cases millions, of statistical hypotheses to find the most likely spoken word-string. The framework used area most commonly (and nearly universally) is known as Hidden Markov Modeling (hereinafter referred to as “HMM”).
(2) Recognition by Linguistic Features
This approach is based on the idea that the study of linguistics has accumulated a vast body of knowledge about the acoustic features that correspond to the phonetics of human language. Once these features are characterized and estimated, a system can integrate them statistically to derive the best guess as to the underlying spoken utterance.
The feature approach has not been very successful. However, the Jupiter system at MIT has successfully combined the statistical method with a feature-based front end. While this class of recognition system remains in an experimental stage, it performs well in limited domains.
(3) Biomodeling Human Perception: Partial Approaches
Humans are the only example we have of a working, efficient speech recognizer. Thus, it makes sense to try to mimic how the human brain recognizes speech. This “bio-modeling” approach may be the most challenging, as there is no definitive scientific knowledge for how humans recognize speech.
One approach to bio-modeling has been to use what is known about the inner ear, and design preprocessors based on physiological analogs. The preprocessors may be used to modify the raw acoustic signal to form a modified signal. The preprocessors may then provide the modified signal into standard pattern recognizers. This approach has yielded some limited success, primarily with regard to noise immunity.
Artificial Neural Nets (hereinafter referred to as “ANNs”) fit somewhat into this category as well. ANNs have become a significant field of research, and provide a class of pattern recognition algorithms that have been applied to a growing set of problems. ANNs emphasize the enormous connectivity that is found in the brain.
HMM: The Standard Prior Art Technology
The essence of the HMM idea is to assume that speech is ideally a sequence of particular and discrete states, but that the incoming raw acoustic data provides only a distorted and fuzzy representation of these pristine states. Hence the word “hidden” in “Hidden Markov Modeling.” For example, we know that speech is a series of discrete words, but the representation of that speech within the acoustic signal may be corrupted by noise, or the words may not have been clearly spoken.
Speech comprises a collection of phrases, each phrase includes a series of words, and each word includes components called phonemes, which are the consonants and vowels. Thus, a hierarchy of states may be used to describe speech. At the lowest level, for the smallest linguistic unit chosen, the sub-states are the actual acoustic data. Thus, if an HMM system builds up the most likely representation of the speech from bottom to top, each sub-part or super-part helping to improve the probabilities of the others, the system should be able to just read off the word and phrase content at the top level.
The real incoming acoustic signal is continuous, however, and does not exist in discrete states. The first solution to this problem was to use a clustering algorithm to find some reasonable states that encompass the range of input signals, and assign a given datum to the nearest one. This was called VQ, or Vector Quantization. VQ worked, to a limited extent, but it turned out to be much better to assign only a probability that the given datum belonged to a state, and it might perhaps belong to some other state or states, with some probability. This algorithm goes by the name of Continuous Density HMM.
Continuous Density HMM is now the most widely used algorithm. There are many choices for how to implement this algorithm, and a particular implementation may utilize any number of preprocessors, and may be embedded into a complex system.
The HMM approach allows a large latitude for choosing states and hierarchies of states. There is a design trade-off between using phonetics or words as the base level. Words are less flexible, require more training data, and are context dependent, but they can be much more accurate. Phonetics allows either a large vocabulary or sets of small and dynamic vocabularies. There is also a trade-off between speaker-dependent (i.e., speaker-adaptive) systems, which are appropriate for dictation, and speaker-independent systems, which are required for telephone transactions. Since individuals speak differently, HMM needs to use a large number of states to reflect the variation in the way words are spoken across the user population. A disadvantage to prior art systems that use HMM is a fundamental trade-off between functionality for (1) many words or (2) many people.
Challenges to Automatic Speech Recognition (ASR)
A publicly accessible recognition system must maintain its accuracy for a high percentage of the user population.                “Human adaptation to different speakers, speaking styles, speaking rates, etc., is almost momentarily [i.e. instantaneous]. However, most so-called adaptive speech recognizers need sizable chunks of speech to adapt.” (Pols, Louis C. W., Flexible, robust, and efficient human speech recognition, Institute of Phonetic Sciences, University of Amsterdam, Proceedings 21 (1997), 1-10)        
Variation in users includes age, gender, accent, dialect, behavior, motivation, and conversational strategy.
A publicly accessible speech recognition system must also be robust with respect to variations in the acoustical environment. One definition of environmental robustness of speech recognition is maintaining a high level of recognition accuracy in difficult and dynamically-varying acoustical environments. For telephone transactions, variations in the acoustical environment may be caused by variations in the telephone itself, the transmission of the voice over the physical media, and the background acoustical environment of the user.                “Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners.” (Brian E. D. Kingsbury, Perceptually-inspired signal processing strategies for robust speech recognition in reverberant environments, PhD thesis, UC Berkeley, 1998. )        
In many respects, adverse effects on the acoustic signal to be recognized are getting worse with new communications technology. Speaker-phone use is becoming more common, which increases the noise and the effect of room acoustics on the signal. The speech signal may be degraded by radio transmission on portable or cellular phones. Speech compression on wire-line and cellular networks, and increasingly, on IP-telephony (i.e., voice-over-IP), also degrades the signal. Other sources of background noise include noise in the car, office noise, other people talking, and TV and radio.                “One of the key challenges in ASR research is the sensitivity of ASR systems to real-world levels of acoustic interference in the speech input. Ideally, a machine recognition system's accuracy should degrade in the presence of acoustic interference in the same way a human listener's would: gradually, gracefully and predictably. This is not true in practice. Tests on different state-of-the-art ASR systems carried out over a broad range of different vocabularies and acoustic conditions show that automatic recognizers typically commit at least ten times more errors than human listeners.” (Brian E. D. Kingsbury, Perceptually-inspired signal processing strategies for robust speech recognition in reverberant environments, PhD thesis, UC Berkeley, 1998. )        “While a lot of progress has been made during the last years in the field of Automatic Speech recognition (ASR), one of the main remaining problems is that of robustness.        
Typically, state-of-the-art ASR systems work very efficiently in well-defined environments, e.g. for clean speech or known noise conditions. However, their performance degrades drastically under different conditions. Many approaches have been developed to circumvent this problem, ranging from noise cancellation to system adaptation techniques.” (K. Weber. Multiple time scale feature combination towards robust speech recognition. Konvens, 5. Konferenz zur Verarbeitung naturlicher Sprache, (to appear), 2000. IDIAP{RR 00-22 7})
Changes Needed to Optimize ASR
The ability of an ASR to integrate information on many time scales may be important.                “Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the explicit use of such long time-span units is comparatively unusual in automatic speech recognition systems for English.” (S. L. Wu, B. E. D. Kingsbury, N. Morgan, and S. Greenberg, Incorporating information from syllable-length time scales into automatic speech recognition, ICASSP, pages 721-724, 1998. )        
The ability to generalize to new conditions of distortion and noise would be of great importance:                “The recognition accuracy of current automatic speech recognition (ASR) systems deteriorates in the presence of signal distortions caused by the background noise and the transmission channel. Improvement in the recognition accuracy in such environments is usually obtained by re-training the systems or adaptation with data from the new testing environment.” (S. Sharma, Multi-Stream Approach To Robust Speech Recognition, OGI Ph.D. Thesis, April 1999, Portland, USA.)        
It may be important to integrate information from many different aspects or features of the acoustic signal:                “One of the biggest distinctions between machine recognition and human perception, is the flexible multi-feature approach taken by humans versus the fixed and limited feature approach by pattern recognition machines.” (Pols, Louis C. W., Flexible, robust, and efficient human speech recognition, Institute of Phonetic Sciences, University of Amsterdam, Proceedings 21 (1997), 1-10. )        
Or again:                “Human listeners generally do not rely on one or a few properties of a specific speech signal only, but use various features that can be partly absent (‘trading relations’), a speech recognizer generally is not that flexible. Humans can also quickly adapt to new conditions, like a variable speaking rate, telephone quality speech, or somebody having a cold, using pipe speech, or having a heavy accent. This implies that our internal references apparently are not fixed, as they are in most recognizers, but are highly adaptive.” (Pols, Louis C. W., Flexible, robust, and efficient human speech recognition, Institute of Phonetic Sciences, University of Amsterdam, Proceedings 21 (1997), 1-10. )        “However, if progress is to be made against the remaining difficult problems [of ASR], new approaches will most likely be necessary.” (Herve Bourlard, Hynek Hermansky, Nelson Morgan, Towards increasing speech recognition error rates, Speech Communication 18, pp.205-231, 1996. )        
It is an object of the present invention to substantially overcome the above-identified disadvantages and drawbacks of the prior art.