The ability to vocally converse with a computer is a grand and worthy goal of hundreds of researchers, universities and institutions all over the world. Such a capability is widely expected to revolutionize communications, learning, commerce, government services and many other activities by making the complexities of technology transparent to the user. In order to converse, the computer must first recognize what words are being said by the human user and then must determine the likely meaning of those words and formulate meaningful and appropriate ongoing responses to the user. The invention herein addresses the recognition aspect of the overall speech understanding problem.
It is well known that the human vocal system can be roughly approximated as a source driving a digital (or analog) filter; see, e.g., M. Al-Akaidi, “Simulation model of the vocal tract filter for speech synthesis”, Simulation, Vol. 67, No. 4, p. 241–246 (October 1996). The source is the larynx and vocal chords and the filter is the set of resonant acoustic cavities and/or resonant surfaces created and modified by the many movable portions (articulators) of the throat, tongue, mouth/throat surfaces, lips and nasal cavity. These include the lips, mandible, tongue, velum and pharynx. In essence, the source creates one or both of a quasi-periodic vibration (voiced sounds) or a white noise (unvoiced sounds) and the many vocal articulators modify that excitation in accordance with the vowels, consonants or phonemes being expressed. In general, the frequencies between 600 to 4,000 Hertz contain the bulk of the necessary acoustic information for human speech perception (B. Bergeron, “Using an intraural microphone interface for improved speech recognition”, Collegiate Microcomputer, Vol. 8, No. 3, pp. 231–238 (August 1990)), but there is some human-hearable information all the way up to 10,000 hertz or so and some important information below 600 hertz. The variable set of resonances of the human vocal tract are referred to as formants and are indicated as F1, F2 . . . In general, the lower frequency formants F1 and F2 are usually in the range of 250 to 3,000 hertz and contain a major portion of human-hearable information about many articulated sounds and phonemes. Although the formants are principle features of human speech, they are by far not the only features and even the formants themselves dynamically change frequency and amplitude, depending on context, speaking rate, and mood. Indeed, only experts have been able to manually determine what a person has said based on a printout of the spectrogram of the utterance—and even this analysis contains best-guesses. Thus, automated speech recognition is one of the grand problems in linguistic and speech sciences. In fact, only the recent application of trainable stochastic (statistics-based) models using fast micro-processors (e.g., 200 Mhz or higher) has resulted in 1998's introduction of inexpensive continuous speech (CS) software products. In the stochastic models used in such software, referred to as Hidden Markov Models (HMMs), the statistics of varying annunciation and temporal delivery are statistically captured in oral training sessions and made available as models for the internal search engine(s).
Major challenges to speech recognition software and systems development progress have historically been that (a) continuous speech (CS) is very much more difficult to recognize than single isolated-word speech and (b) different speakers have very different voice patterns from each other. The former is primarily because in continuous speech, we pronounce and enunciate words depending on their context, our moods, our stress state, and on the speed with which we speak. The latter is because of physiological, age, sex, anatomical, regional accent, and other reasons. Furthermore, another major problem has been how to reproducibly get the sound (natural speech) into the recognition system without loss or distortion of the information it contains. It turns out that the positioning of and type of microphone(s) or pickups one uses are critical. Head-mounted oral microphones, and the exact positioning thereof, have been particularly thorny problems despite their superior frequency response. Some attempts to use ear pickup microphones (see, e.g., Bergeron, supra) have shown fair results despite the known poorer passage of high frequency content through the bones of the skull. This result sadly speaks volumes to the positioning difficulty implications of mouth microphones which should give substantially superior performance based on their known and understood broader frequency content.
Recently, two companies, IBM and Dragon Systems, have offered commercial PC-based software products (IBM ViaVoice™ and Dragon Naturally Speaking™) that can recognize continuous speech with fair accuracy after the user conducts carefully designed mandatory training or “enrollment” sessions with the software. Even with such enrollment, the accuracy is approximately 95% under controlled conditions involving careful microphone placement and minimal or no background noise. If, during use, there are other speakers in the room having separate conversations (or there are reverberant echoes present), then numerous irritating recognition errors can result. Likewise, if the user moves the vendor-recommended directional or noise-canceling microphone away, or too far, from directly in front of the lips, or speaks too softly, then the accuracy goes down precipitously. It is no wonder that speech recognition software is not yet significantly utilized in mission-critical applications.
The inventors herein address the general lack of robustness described above in a manner such that accuracy during speaking can be improved, training (enrollment) can be a more robust if not a continuous improvement process, and one may speak softly and indeed even “mouth words” without significant audible sound generation, yet retain recognition performance. Finally, the inventors have also devised a means for nearby and/or conversing speakers using voice-recognition systems to automatically have their systems adapted to purposefully avoid operational interference with each other. This aspect has been of serious concern when trying to insert voice recognition capabilities into a busy office area wherein numerous interfering (overheard) conversations cannot easily be avoided.
The additional and more reproducible artificial excitations of the invention may also be used to increase the acoustic uniqueness of utterances-thus speeding up speech recognition processing for a given recognition-accuracy requirement. Such a speedup could, for example, be realized from the reduction in the number of candidate utterances needing software-comparison. In fact, such reductions in utterance identification possibilities also improve recognition accuracy as there are fewer incorrect conclusions to be made.
Utterance or speech-recognition practiced using the invention may have any purpose including, but not limited to: (1) talking to, commanding or conversing with local or remote computer, computer-containing products, telephony products or speech-conversant products (or with other persons using them); (2) talking to or commanding a local or remote system that converts recognized speech or commands to recorded or printed text or to programmed actions of any sort (e.g.: voice-mail interactive menus, computer-game control systems); (3) talking to another person(s) locally or remotely-located wherein one's recognized speech is presented to the other party as text or as a synthesized voice (possibly in his/her different language); (4) talking to or commanding any device (or connected person) discretely or in apparent silence; (5) user-identification or validation wherein security is increased over prior-art speech fingerprinting systems due to the additional information available in the speech signal or even the ability to manipulate artificial excitations oblivious to the user; (6) allowing multiple equipped speakers to each have their own speech recognized free of interference from the other audible speakers (regardless of their remote locations or collocation); (7) adapting a users “speech” output to obtain better recognition-processing performance as by adding individually-customized artificial content for a given speaker and making that content portable if not network-available. (This could also eliminate or minimize retraining of new recognition systems by new users.)