A speech-based user interface acquires speech input from a user for further processing. Typically, the speech acquired by the interface is processed by an automatic speech recognition system (ASR). Ideally, the interface responds only to the user speech that is specifically directed at the interface, but not to any other sounds.
This requires that the interface recognizes when it is being addressed, and only responds at that time. When the interface does accept speech from the user, the interface must acquire and process the entire audio signal for the speech. The interface must also determine precisely the start and the end of the speech, and not process signals significantly before the start of the speech and after the end of the speech. Failure to satisfy these requirements can cause incorrect or spurious speech recognition.
A number of speech-based user interfaces are known. These can be roughly categorized as follows.
Push-to-Talk
With this type of interface, the user must press a button only for the duration of the speech. Thus, the start and end of speech signals are precisely known, and the speech is only processed while the button is pressed.
Hit-to-Talk
Here, the user briefly presses a button to indicate the start of the speech. It is the responsibility of the interface to determine where the speech ends. As with push-to-talk interface, the hit-to-talk interface also attempts to ensure that speech is only when the button is pressed.
However, there are a number of situations where the use a button may be impossible, inconvenient, or simply unnatural, for example, any situation where the user's hands are otherwise occupied, the user is physically impaired, or the interface precludes the inclusion of a button. Therefore, hands-free interfaces have been developed.
Hands-Free
With hands-free speech-based interfaces, the interface itself determines when speech starts and ends.
Of the three types of interface, the hands-free interface is arguably the most natural, because the interface does not require an express signal to initiate or terminate processing of the speech. In most conventional hands-free interfaces, only the audio signal acquired by the primary sensor, i.e., the microphone, is analyzed to make start and end of the speech decisions.
However, the hands-free interface is the most difficult to implement because it is difficult to determine automatically when the interface is being addresses by just the user, and when the speech starts and ends. This problem becomes particularly difficult when the interface operates in a noisy or reverberant environment, or in an environment where there is additional unrelated speech.
One conventional solution uses “attention words.” The attention words are intended to indicate expressly the start and/or end of the speech. Another solution analyzes an energy profile of the audio signal. Processing begins when there is a sudden increase in the energy, and stops when the energy decreases. However, this solution can fail in a noisy environment, or an environment with background speech.
A zero crossing rates of the audio signal can also be used. The zero-crossings occur when the speech signal changes between positive and negative. When the energy and zero-crossings are at predetermined levels, speech is probably present.
Another class of solutions uses secondary sensors to acquire secondary measurements of the speech signal, such as a glottal electromagnetic sensor (GEMS), a physiological microphone (P-mic), a bone conduction sensors, and an electroglottographs. However all of the above secondary sensors need to be mounted on the user of the interface. This can be inconvenient in any situation where it is difficult to forward the secondary signal to the interface. That is, the user may need to be ‘tethered’ to the interface.
An ideal secondary sensor for a hands-free, speech-based interface should be able to operate at a distance from the user. Video cameras could be used as effective far-field sensors for detecting speech. Video images can be used for face detection and tracking, and to determine when the user is speaking. However, cameras are expensive, and detecting faces and recognizing moving lips is tedious, difficult and error prone.
Another secondary sensor uses the Doppler effect. An ultrasonic transmitter and receiver are deployed at a distance from the user. A transmitted ultrasonic signal is reflected by the face of the user. As user speaks parts of the face move, which changes the frequency of the reflected signal. Measurements obtained from the secondary sensor are used in conjunction with the audio signal acquired by the primary sensor to detect when the user speaks.
In addition to being usable at a distance from the user, the Doppler sensor differs from conventional secondary sensors in another, crucial way. The measurements provided by conventional current secondary sensors are usually linearly related to the speech signal itself. The GEMS sensor provides measurements of the excitation function to the vocal tract. The signals acquired by P-mics, throat microphones and bone-conduction microphones are essentially a filtered versions of the speech signal itself.
In contrast, the signal acquired by the Doppler sensor is not linearly related to the speech signal. Rather, the signal expresses information related to the movement of the face while speaking. The relationship between facial movement and the speech is not obvious, and certainly not linear.
However, the Doppler sensors use a support vector machine (SVM) to classify the audio signal as speech or non-speech. The classifier must first be trained off-line on joint speech and Doppler recordings. Consequently, the performance of the classifier is highly dependent on the training data used. It may be that different speakers articulate speech in different ways, e.g., depending on gender, age, and linguistic class. Therefore, it may be difficult to train the Doppler-based secondary sensor for a broad class of users. In addition, that interface requires both a speech signal and the Doppler signal for speech activity detection.
Therefore, it desired to provide a speech activity sensor that does not require training of a classifier. It is also desired to detect speech only from the Doppler signal, without using any part of the concomitant audio signal. Then, as an advantage, the detection process can be independent of background “noise,” be it speech or any other spurious sounds.