1. Field of the Invention
This invention relates to voice activity detection.
2. Related Art
There are many automated systems that depend on the detection of speech for operation, for instance automated speech systems and cellular radio coding systems. Such systems monitor transmission paths from users' equipment for the occurrence of speech and, on the occurrence of speech, take appropriate action. Unfortunately transmission paths are rarely free from noise. Systems which are arranged simply to detect activity on the path may therefore incorrectly take action if there is noise present.
The usual noise that is present is line noise (i.e. noise that is present irrespective of whether or not a signal is being transmitted) and background noise from a telephone conversation, such as a dog barking, the sound of the television, the noise of a car's engine etc.
Another source of noise in communications systems is echo. For instance, echoes in a public switch telephone network (PSTN) are essentially caused by electrical and/or acoustic coupling e.g. at the four wire to two wire interface of a conventional exchange box; or the acoustic coupling in a telephone handset, from earpiece to microphone. The acoustic echo is time variant during a call due to the variation of the airpath, i.e. the talker altering the position of their head between the microphone and the loudspeaker. Similarly in telephone kiosks, the interior of the kiosk has a limited damping characteristic and is reverberant which results in resonant behaviour. Again this causes the acoustic echo path to vary if the talker moves around the kiosk or indeed with any air movement. Acoustic echo is becoming a more important issue at this time due to the increased use of hands free telephones. The effect of the overall echo or reflection path is to attenuate, delay and filter a signal.
The echo path is dependent on the line, switching route and phone type. This means that the transfer function of the reflection path can vary between calls since any of the line, switching route and the handset may change from call to call as different switch gear will be selected to make the connection.
Various techniques are known to improve the echo control in human-to-human speech communications systems. There are three main techniques. Firstly insertion losses may be added into the talker's transmission path to reduce the level of the outgoing signal. However the insertion losses may cause the received signal to become intolerably low for the listener. Alternatively, echo suppressors operate on the principle of detecting signal levels in the transmitting and receiving path and then comparing the levels to determine how to operate switchable insertion loss pads. A high attenuation is placed in the transmit path when speech is detected on the received path. Echo suppressors are usually used on longer delay connections such as international telephony links where suitable fixed insertion losses would be insufficient.
Echo cancellers are voice operated devices which use adaptive signal processing to reduce or eliminate echoes by estimating an echo path transfer function. An outgoing signal is fed into the device and the resulting output signal subtracted from the received signal. Provided that the model is representative of the real echo path, the echo should theoretically be cancelled. However, echo cancellers suffer from stability problems and are computationally expensive. Echo cancellers are also very sensitive to noise bursts during training.
One example of an automated speech system is the telephone answering machine, which records messages left by a caller. Generally, when a user calls up an automated speech system, a prompt is played to the user which prompt usually requires a reply. Thus an outgoing signal from the speech system is passed along a transmission line to the loudspeaker of a user's telephone. The user then provides a response to the prompt which is passed to the speech system which then takes appropriate action.
It has been proposed that allowing a caller to an automated speech system to interrupt outgoing prompts from the system greatly enhances the usability of the system for those callers who are familiar with the dialogue of the system. This facility is often termed "barge in" or "over-ridable guidance".
If a user speaks during a prompt, the spoken words may be preceded or corrupted by an echo of the outgoing prompt. Essentially isolated clean vocabulary utterances from the user are transformed into embedded vocabulary utterances (in which the vocabulary word is contaminated with additional sounds). In automated speech systems which involve automated speech recognition, because of the limitations of current speech recognition technology, this results in a reduction in recognition performance.
If a user has never used the service provided by the automated speech system, the user will need to hear the prompts provided by the speech generator in their entirety. However, once a user has become familiar with the service and the information that is required at each stage, the user may wish to provide the required response before the prompt has finished. If a speech recogniser or recording means is turned off until the prompt is finished, no attempt will be made to recognise a user's early response. If, on the other hand, the speech recogniser or recording means is turned on all the time, the input would include both the echo of the outgoing prompt and the response provided by the user. Such a signal would be unlikely to be recognisable by a speech recogniser. Voice activity detectors (VADs) have therefore been developed to detect voice activity on the path.
Known voice activity detectors rely on generating an estimate of the noise in an incoming signal and comparing an incoming signal with the estimate which is either fixed or updated during periods of non-speech. An example of such a voice activated system is described in U.S. Pat. No. 5,155,760 and U.S. Pat. No. 4,410,763.
Voice activity detectors are used to detect speech in the incoming signal, and to interrupt the outgoing prompt and turn on the recogniser when such speech is detected. A user will hear a clipped prompt. This is satisfactory if the user has barged in. If however the voice activity detector has incorrectly detected speech, the user will hear a clipped prompt and have no instructions on to how to proceed with the system. This is clearly undesirable.