The present invention relates to personal mobile computing devices. More particularly, the present invention relates to an apparatus, system and method for enhancing speech recognition in mobile computing devices.
Mobile devices are small electronic computing devices sometimes referred to as personal digital assistants (PDAs). Many of such mobile devices are handheld devices, or palm-size devices, which comfortably fit within the hand. One commercially available mobile device is sold under the trade name HandHeld PC (or H/PC) having software provided by Microsoft Corporation of Redmond, Wash.
Generally, the mobile device includes a processor, random access memory (RAM), and an input device such as a keyboard and a display, wherein the keyboard can be integrated with the display, such as a touch sensitive display. A communication interface is optionally provided and is commonly used to communicate with a desktop computer. A replaceable or rechargeable battery powers the mobile device. Optionally, the mobile device can receive power from an external power source that overrides or recharges the built-in battery, such as a suitable AC or DC adapter, or a powered docking cradle.
In one common application, the mobile device is used in conjunction with the desktop computer. For example, the user of the mobile device may also have access to, and use, a desktop computer at work or at home. The user typically runs the same types of applications on both the desktop computer and on the mobile device. Thus, it is quite advantageous for the mobile device to be designed to be coupled to the desktop computer to exchange information with, and share information with, the mobile device.
As the mobile computing device market continues to grow, new developments can be expected. For example, mobile devices can be integrated with cellular or digital wireless communication technology to provide a mobile computing device which also functions as a mobile telephone. Thus, cellular or digital wireless communication technology can provide the communication link between the mobile device and the desktop (or other) computer. Further, speech recognition can be used to record data or to control functions of one or both of the mobile computing device and the desktop computer, with the user speaking into a microphone on the mobile device and with signals being transmitted to the desktop computer based upon the speech detected by the microphone.
Several problems arise when attempting to perform speech recognition, at the desktop computer, of words spoken into a remote microphone such as a microphone positioned on a mobile device. First, the signal-to-noise ratio of the speech signals provided by the microphone drops as the distance between the microphone and the user's mouth increases. With a typical mobile device being held in a user's palm up to a foot from the user's mouth, the resulting signal-to-noise ratio drop may be a significant speech recognition obstacle. Also, internal noise within the mobile device lowers the signal-to-noise ratio of the speech signals due to the close proximity of the internal noise to the microphone which is typically positioned on a housing of the mobile device. Second, due to bandwidth limitations of digital and other wireless communication networks, the speech signals received at the desktop computer will be of lower quality, as compared to speech signals from a desktop microphone. Thus, with different desktop and telephony bandwidths, speech recognition results will vary when using a mobile computing device microphone instead of a desktop microphone.
The aforementioned problems are not limited to speech recognition performed at a desktop computer. In many different speech recognition applications, it is very important, and can be critical, to have a clear and consistent audio input representing the speech to be recognized provided to the automatic speech recognition system. Two categories of noise which tend to corrupt the audio input to the speech recognition system are ambient noise and noise generated from background speech. There has been extensive work done in developing noise cancellation techniques in order to cancel ambient noise from the audio input. Some techniques are already commercially available in audio processing software, or integrated in digital microphones, such as universal serial bus (USB) microphones.
Dealing with noise related to background speech has been more problematic. This can arise in a variety of different, noisy environments. For example, where the speaker of interest is talking in a crowd, or among other people, a conventional microphone often picks up the speech of speakers other than the speaker of interest. Basically, in any environment in which other persons are talking, the audio signal generated from the speaker of interest can be compromised.
One prior solution for dealing with background speech is to provide an on/off switch on the cord of a headset or on a handset. The on/off switch has been referred to as a “push-to-talk” button and the user is required to push the button prior to speaking. When the user pushes the button, it generates a button signal. The button signal indicates to the speech recognition system that the speaker of interest is speaking, or is about to speak. However, some usability studies have shown that this type of system is not satisfactory or desired by users. Thus, incorporating this type of feature in a mobile device may produce unsatisfactory results.
In addition, there has been work done in attempting to separate background speakers picked up by microphones from the speaker of interest (or foreground speaker). This has worked reasonably well in clean office environments, but has proven insufficient in highly noisy environments.
In yet another prior technique, a signal from a standard microphone has been combined with a signal from a throat microphone. The throat microphone registers laryngeal behavior indirectly by measuring the change in electrical impedance across the throat during speaking. The signal generated by the throat microphone was combined with the conventional microphone and models were generated that modeled the spectral content of the combined signals.
An algorithm was used to map the noisy, combined standard and throat microphone signal features to a clean standard microphone feature. This was estimated using probabilistic optimum filtering. However, while the throat microphone is quite immune to background noise, the spectral content of the throat microphone signal is quite limited. Therefore, using it to map to a clean estimated feature vector was not highly accurate. This technique is described in greater detail in Frankco et al., COMBINING HETEROGENEOUS SENSORS WITH STANDARD MICROPHONES FOR NOISY ROBUST RECOGNITION, Presentation at the DARPA ROAR Workshop, Orlando, Fla. (2001). In addition, wearing a throat microphone is an added inconvenience to the user.