The drive to make human-computer interaction more user-friendly and the desire to support any kind of human-human communication and cooperation by computers has led researchers to develop machines capable of recognizing and understanding spoken input. Speech recognition systems have been built which provide good results when speech is spoken audibly with normal vocal effort in relatively quiet environments.
Speech recognition systems do much worse at trying to recognize barely audible speech, i.e. whispered speech or speech uttered with a very low vocal effort. That is because the typical (human and computer) speech perception mechanism requires the speech signal to be transmitted through air and being perceived by capturing the air pressure changes. Consequently, the speech signal is required to be above certain decibel thresholds to have a significant impact on the surrounding air particles.
Furthermore, speech recognition systems which are based on this mechanism cannot recognize non-audible or silently mouthed speech since no air pressure changes can be measured on this kind of speech. As a result, those applications which rely on confidential user input, such as password restricted access systems, are difficult to drive with speech. Also, any kind of confidential conversation, for example phone calls in public places, are in danger of being eavesdropped. In general, speech-based human-machine interaction or human-human communication cannot be pursued without being audible to bystanders and thus potentially be disturbing to the surroundings.
Speech recognition systems also do much worse at trying to recognize air-transmitted signals of speech which is uttered in a noisy environment, such as in a car, in an airport, or in a crowded restaurant. That is because any non-speech related sound and noise of the speaker's environment is transmitted through the same surrounding air and thus overlaps with the relevant speech signal. Current solutions such as close-speaking and directional microphones rely on audible speech. Furthermore, solutions such as microphone arrays are less mobile and expensive. Although algorithms such as noise cancellation, beamforming, and source separation are under heavy investigation, the speech recognition performance in the presence of noise is still suboptimal.
Thus, the need arises to provide mechanisms to overcome the limitations of speech recognition applications due to the problem of confidentiality and/or disturbance, and the lack of robustness in noisy environments.
Based on the various types of speech recognition systems currently sold in the marketplace, such systems may be divided into five types: recognition engines, command & control systems, dictation systems, and special purpose speech-driven applications, and speech translation.
Recognition engines are primarily software, which take a spoken utterance as input, and produce a text hypothesis as output. The utterance has to be audibly spoken since speech is usually captured with an air transmission microphone. Recognition engines may be either discrete-word or continuous-speech systems. With discrete-word systems, the recognizer can only recognize a single word at a time, so the user must pause between each word until the recognizer produces the output for the last word spoken.
Command & Control recognition systems are generally used for single-word or few-word commands used to affect and control some system with speech. Most have a small, fixed vocabulary of between ten and one hundred commands that can be recognized in any one situation. No Command & Control systems can be used in a silent, confidential speech-driven mode. Instead the general use of such a system is either disturbing to the public or all audible commands might get overheard by any bystander, a particularly serious problem for verification, authentication, and access control systems.
In contrast to Command & Control systems, dictation recognition systems must handle a very large vocabulary, like tens or hundreds of thousands of possible words. For dictation systems, useful for applications such as SMS text messaging or emails the lack of robustness in noisy environments is a major challenge. There are no continuously spoken large vocabulary recognition engine or dictation system that handles spoken speech recorded by myoelectric signals, which remain unaffected by surrounding noise.
There are also commercially available special purpose speech-driven applications. Such applications have been developed for particular purposes such as pocket-size personal digital assistants. In such scenarios, the user may carry the device along the whole day, and continuously use it to take notes, generate reports, and complete forms. Particular prominent fields are medical and legal fields. Here, the filed information is often confidential. Furthermore, the constant usage of such carry-along systems may be disturbing to coworkers.
A speech translation system recognizes the speech input spoken in one language, translates the recognized words into another language, and speaks the translated words aloud. This application enables users to have their personal translator at their disposal. However, if translation is performed instantly, the listener is confronted with two interfering speech sources, the mother tongue of the original speaker and the audible translated speech. A translation system that handles spoken speech recorded by myoelectric signals would allows for processing of silently mouthed speech and thus would no longer confuse listeners. There are no speech translation systems that handle speech recorded by myoelectric signals.
Another translation application is the Phraselator that is currently used by the Department of Defense. It primarily consists of a microphone, an automatic speech recognition module, a language translation module, and a synthesizer with loudspeaker. The system is routinely used in military environments which are typically very noisy. The system's performance is known to dramatically degrade in the presence of noise. The performance can be somewhat improve when directional microphones are applied and the user holds the microphone close to the mouth of the speaker. However, not the best directional microphones give satisfying performance in noisy military environments. Thus, there is a great need for an apparatus that is immune to the surrounding noise such as in military and other environments.
Despite the various benefits a conventional speech-driven interface provides to humans, there are three major drawbacks. First, the audible (i.e. acoustic) speech signal prohibits a confidential conversation with or through a device. Also, talking can be extremely disturbing to others, especially in libraries or during meetings. Second, the speech recognition performance degrades drastically in adverse environmental conditions such as in restaurants, cars, or trains. Acoustic model adaptation can compensate for these effects to some degree. However, the pervasive nature of mobile phones challenges this approach. Performance is also poor when sound production limitations occur, such as under water. Third, conventional speech-driven interfaces cannot be used by speech handicapped people, for example those without vocal cords.
Conventional speech recognition systems routinely use air-conducting type microphones for converting speech into electric signals. These microphones pick up the acoustic signal traveling through air away from the sound source in any direction. If several sound sources exist, such as in noisy environments, directional microphones may allow distinguishing separate sound sources. However, no such microphone can differentiate between sounds coming from human voices or from surrounding noise. Reverberation, distance from the source, and moving sources add to the complexity of overlapping sounds.
Bone-conducting microphones use the fact that the sound vibration during the speaking act provokes vibrations of the bones and the skin of the body, especially of the skull. Although the quality of the bone-conducted signal is not equivalent to the air-conducted signal, it carries information that is good enough to reproduce spoken information. Several bone-conducting microphones are available on the market. These are all worn externally creating an indirect contact with the bone at places like the scalp, ear canal, mastoid bone, throat, tooth, cheek bone, and temples. With the exception of teeth microphones, all bone-conducting microphones have to compensate for the information loss resulting from the presence of skin that stretches between the bones and the sensor. Therefore, the sensors have to apply some pressure which is very discomforting for the wearer, especially for all-day usage. For some users, scalp-microphones can lead to headaches, ear-canal microphones to ear infections, and throat microphones even may provoke a strangulation sensation. Tooth microphones interfere with the speaking act and are therefore difficult to apply in speech communication devices.
All of the above microphones require some vocal effort made during the speaking act since the sound transmission relies either on measurable pressure changes caused by moving air particles or relies on the vibration of the human body due to the vibration of the vocal cords. Consequently, none of these microphones would be able to capture silently spoken or mouth speech, i.e. speech without any vocal effort.
A new alternative is a microphone which picks up electrical potential differences resulting from muscle activity. Speech is produced by muscles of the articulatory apparatus. Thus, capturing the articulatory muscles by surface electromyography, the resulting signals contain information relevant for the interpretation of speech. This process works even if the speech is produced silently, i.e. without vocal effort.
In A. D. C, Chan, K. Englehart, B. Hudgins, and D. F. Lovely, “Hidden markov model classification of myolectric signals in speech,” Engineering in Medicine and Biology Magazine, IEEE, vol. 21, pp. 143-146, 9 2002, the authors proved that the myoelectric signal (MES) from articulatory face muscles contains sufficient information to discriminate words. This holds even when the words are spoken non-audibly, i.e. when no acoustic signal is produced. See C. Jorgensen, D. Lee, and S. Agabon, “Sub auditory speech recognition based on emg/epg signals,” in Proceedings of the International Joint Conference on Neural Networks, 2003.
To date, the practicability of MES based speech recognition is still limited. First, the surface electrodes require a physical contact with the speaker's skin. Second, experiments are still restricted to isolated word recognition. Third, today's systems are far from being robust, since they only work in matching training and test conditions. Just like conventional speech recognizers, the MES based systems are heavily influenced by speaker dependencies, such as speaking style, speaking rate, and pronunciation idiosyncrasies. Beyond that, the myoelectric signal is affected by even slight changes in electrode positions, temperature or tissue properties. See “Selected topics in surface electromyography for use in the occupational setting: Expert perspective,” 3 1992, DHHS(NIOSH) Publication No 91-100. Such a phenomenon is referred to as “session dependence” in analogy to the “channel dependence” of a conventional speech recognizer resulting from the microphone quality, the environmental noise, and the signal transmission of the acoustic signal. The loss in performance caused by session dependence in MES based speech recognition is significantly higher than that resulting from channel conditions in conventional systems. Despite this, only session dependent MES based speech recognition systems have been developed so far.