The present invention is directed generally to voice recognition, and, more particularly to a means and method for enhancing or replacing the natural excitation of a living body""s vocal tract by artificial excitation means.
The ability to vocally converse with a computer is a grand and worthy goal of hundreds of researchers, universities and institutions all over the world. Such a capability is widely expected to revolutionize communications, learning, commerce, government services and many other activities by making the complexities of technology transparent to the user. In order to converse, the computer must first recognize what words are being said by the human user and then must determine the likely meaning of those words and formulate meaningful and appropriate ongoing responses to the user. The invention herein addresses the recognition aspect of the overall speech understanding problem.
It is well known that the human vocal system can be roughly approximated as a source driving a digital (or analog) filter; see, e.g., M. Al-Akaidi, xe2x80x9cSimulation model of the vocal tract filter for speech synthesisxe2x80x9d, Simulation, Vol. 67, No. 4, p. 241-246 (October 1996). The source is the larynx and vocal chords and the filter is the set of resonant acoustic cavities and/or resonant surfaces created and modified by the many movable portions (articulators) of the throat, tongue, mouth/throat surfaces, lips and nasal cavity. These include the lips, mandible, tongue, velum and pharynx. In essence, the source creates one or both of a quasi-periodic vibration (voiced sounds) or a white noise (unvoiced sounds) and the many vocal articulators modify that excitation in accordance with the vowels, consonants or phonemes being expressed. In general, the frequencies between 600 to 4,000 Hertz contain the bulk of the necessary acoustic information for human speech perception (B. Bergeron, xe2x80x9cUsing an intraural microphone interface for improved speech recognitionxe2x80x9d, Collegiate Microcomputer, Vol. 8, No. 3, pp. 231-238 (August 1990)), but there is some human-hearable information all the way up to 10,000 hertz or so and some important information below 600 hertz. The variable set of resonances of the human vocal tract are referred to as formants and are indicated as F1, F2 . . . . In general, the lower frequency formants F1 and F2 are usually in the range of 250 to 3,000 hertz and contain a major portion of human-hearable information about many articulated sounds and phonemes. Although the formants are principle features of human speech, they are by far not the only features and even the formants themselves dynamically change frequency and amplitude, depending on context, speaking rate, and mood. Indeed, only experts have been able to manually determine what a person has said based on a printout of the spectrogram of the utterance and even this analysis contains best-guesses. Thus, automated speech recognition is one of the grand problems in linguistic and speech sciences. In fact, only the recent application of trainable stochastic (statistics-based) models using fast microprocessors (e.g., 200 Mhz or higher) has resulted in 1998""s introduction of inexpensive continuous speech (CS) software products. In the stochastic models used in such software, referred to as Hidden Markov Models (HMMs), the statistics of varying annunciation and temporal delivery are statistically captured in oral training sessions and made available as models for the internal search engine(s).
Major challenges to speech recognition software and systems development progress have historically been that (a) continuous speech (CS) is very much more difficult to recognize than single isolated-word speech and (b) different speakers have very different voice patterns from each other. The former is primarily because in continuous speech, we pronounce and enunciate words depending on their context, our moods, our stress state, and on the speed with which we speak. The latter is because of physiological, age, sex, anatomical, regional accent, and other reasons. Furthermore, another major problem has been how to reproducibly get the sound (natural speech) into the recognition system without loss or distortion of the information it contains. It turns out that the positioning of and type of microphone(s) or pickups one uses are critical. Head-mounted oral microphones, and the exact positioning thereof, have been particularly thorny problems despite their superior frequency response. Some attempts to use ear pickup microphones (see, e.g., Bergeron, supra) have shown fair results despite the known poorer passage of high frequency content through the bones of the skull. This result sadly speaks volumes to the positioning difficulty implications of mouth microphones which should give substantially superior performance based on their known and understood broader frequency content.
Recently, two companies, IBM and Dragon Systems, have offered commercial PC-based software products (IBM ViaVoice(trademark) and Dragon Naturally Speaking(trademark)) that can recognize continuous speech with fair accuracy after the user conducts carefully designed mandatory training or xe2x80x9cenrollmentxe2x80x9d sessions with the software. Even with such enrollment, the accuracy is approximately 95% under controlled conditions involving careful microphone placement and minimal or no background noise. If, during use, there are other speakers in the room having separate conversations (or there are reverberant echoes present), then numerous irritating recognition errors can result. Likewise, if the user moves the vendor-recommended directional or noise-canceling microphone away, or too far, from directly in front of the lips, or speaks too softly, then the accuracy goes down precipitously. It is no wonder that speech recognition software is not yet significantly utilized in mission-critical applications.
The inventors herein address the general lack of robustness described above in a manner such that accuracy during speaking can be improved, training (enrollment) can be a more robust if not a continuous improvement process, and one may speak softly and indeed even xe2x80x9cmouth wordsxe2x80x9d without significant audible sound generation, yet retain recognition performance. Finally, the inventors have also devised a means for nearby and/or conversing speakers using voice-recognition systems to automatically have their systems adapted to purposefully avoid operational interference with each other. This aspect has been of serious concern when trying to insert voice recognition capabilities into a busy office area wherein numerous interfering (overheard) conversations cannot easily be avoided.
The additional and more reproducible artificial excitations of the invention may also be used to increase the acoustic uniqueness of utterances-thus speeding up speech recognition processing for a given recognition-accuracy requirement. Such a speedup could, for example, be realized from the reduction in the number of candidate utterances needing software-comparison. In fact, such reductions in utterance identification possibilities also improve recognition accuracy as there are fewer incorrect conclusions to be made.
Utterance or speech-recognition practiced using the invention may have any purpose including, but not limited to: (1) talking to, commanding or conversing with local or remote computers, computer-containing products, telephony products or speech-conversant products (or with other persons using them); (2) talking to or commanding a local or remote system that converts recognized speech or commands to recorded or printed text or to programmed actions of any sort (e.g.: voice-mail interactive menus, computer-game control systems); (3) talking to another person(s) locally or remotely-located wherein one""s recognized speech is presented to the other party as text or as a synthesized voice (possibly in his/her different language); (4) talking to or commanding any device (or connected person) discretely or in apparent silence; (5) user-identification or validation wherein security is increased over prior art speech fingerprinting systems due to the additional information available in the speech signal or even the ability to manipulate artificial excitations oblivious to the user; (6) allowing multiple equipped speakers to each have their own speech recognized free of interference from the other audible speakers (regardless of their remote locations or collocation); (7) adapting a users xe2x80x9cspeechxe2x80x9d output to obtain better recognition-processing performance as by adding individually-customized artificial content for a given speaker and making that content portable if not network-available. (This could also eliminate or minimize retraining of new recognition systems by new users.)
In accordance with the present invention, a means and method are disclosed for enhancing or replacing the natural excitation of the human vocal tract by artificial excitation means wherein the artificially created acoustics present additional spectral, temporal or phase data useful for (1) enhancing the machine recognition robustness of audible speech or (2) enabling more robust machine-recognition of relatively inaudible mouthed or whispered speech. The artificial excitation may be arranged to be audible or inaudible, may be designed to be non-interfering with another users similar means, may be used in one or both of a vocal content-enhancement mode or a complimentary vocal tract-probing mode and may be used for the recognition of audible or inaudible continuous speech or isolated spoken commands.
Specifically, an artificial acoustic excitation means is provided for acoustic coupling into a functional vocal tract working in cooperation with a speech recognition system wherein the artificial excitation coupling characteristics provide(s) information useful to the identification of speech by the system.
The present invention extends the performance and applicability of speech-recognition in the following ways:
(1) Improves speech-recognition accuracy and/or speed for audible speech;
(2) Eliminates recognition-interference (accuracy degradation) due to competing speakers or voices, (e.g., as in a busy office with many independent speakers);
(3) Newly allows for voice-recognition of silent or mouthed/whispered speech (e.g., for discretely interfacing with speech-based products and devices); and
Improves security for speech-based user-identification or user-validation
In essence, the human vocal tract is artificially excited, directly or indirectly, to produce sound excitations, which are articulated by the speaker. These sounds, because they are artificially excited, have far more latitude than the familiar naturally excited voiced and aspirated human sounds. For example, they may or may not be audible, may excite natural vocal articulators (audibly or inaudibly) and/or may excite new articulators (audibly or inaudibly).
Artificially excited xe2x80x9cspeechxe2x80x9d output may be superimposed on normal speech to increase the raw characteristic information content. Artificially excited output may be relatively or completely inaudible thus also allowing for good recognition-accuracy while whispering or even mouthing words. Artificial content may help discern between competing speakers thus-equipped, whether they are talking to each other or are in separate cubicles. Artificial content may also serve as a user voiceprint.
Systems taking advantage of this technology may be used for continuous speech or command-style discrete speech. Such systems may be trained using one or both of natural speech and artificial speech.
The artificial excitations may incorporate any of several features including: (a) broadband excitation, (b) narrow band excitation(s) such as a harmonic frequency of a natural formant, (c) multiple tones wherein the tones phase-interact with articulation (natural speech hearing does not significantly involve phase), (d) excitations which are delivered (or processed) only as a function of the success of ongoing natural speech recognition, and (e) excitations which are feedback-optimized for each speaker.
The user need not be aware of the added acoustic information nor of it""s processing.
Consumer/business products incorporating the technology may include computers, PCs, office-wide systems, PDAs, terminals, telephones, games, or any speech-conversant, speech-controlled or sound-controlled appliance or product. For the discrete inaudible option, such products could be used in public with relative privacy. Additional police, military and surveillance products are likely.