Voice activity detection (VAD) is the art of detecting the presence of voice activity, generally human speech, in audio signals. Voice activity detection is used in a wide range of systems handling audio signals for example systems dealing with: telecommunication, speech recognition, speaker verification, speaker identification, speaker segmentation, voice recording, noise suppression and others. In a telecommunication system voice activity detection can be used to implement different sampling rates based on the voice activity level detected, for example to raise/reduce the bandwidth when dealing with audio segments containing human speech. A speaker verification/identification system can be simplified by limiting processing to audio segments containing speech. A noise suppression system can use voice activity detection for comparing between segments with speech activity relative to segments without speech activity. In voice recording systems voice activity detection can be used to reduce the required storage space by limiting the recording to meaningful information (e.g. segments with speech activity).
Many voice controlled systems and/or applications are intended to receive voices from a single person or single group of people, and would function better if they actually receive only the voice or voices of the intended people, for example:
1. Speaker verification systems such as used by banks to authenticate the customer;
2. Voice activated appliances, which are trained to recognize specific voices and/or commands; and
3. Telephone tapping devices, which are interested in recording voices of specific people.
Likewise in telephone conversations any background noise or voices of other people not participating in the conversation can be considered noise, for example:
1. When talking on a speakerphone with other people talking in the background;
2. When talking on a public telephone on a noisy street;
3. When talking on a mobile telephone in a noisy environment;
4. In a call center with many agents speaking to different callers in the same room;
5. When talking on the telephone and not interested that the party on the other end will identify the speakers location, for example with a loudspeaker giving announcements in the background;
6. When conducting a conference call in a closed room and a person that is not participating in the conversation enters the room to deliver a verbal message to one of the participants.
Some systems attempt to transfer voice and eliminate noise in order to improve efficiency in dealing with the signal. In some cases more sophisticated input devices (e.g. extra microphones and/or sensors) are used in order to help differentiate between different speakers and/or noise.
U.S. patent application publication No. 2005/0033572 published Feb. 10, 2005 the disclosure of which is incorporated herein by reference describes apparatus and method of a voice recognition system for an audio-visual system. The system receives reflected sounds from an audio-visual system, noise and a user's voice and is configured to isolate the user's voice and compare it to voice patterns that belong to at least one model.
Japanese patent No. 11-154998 from Jun. 8, 1999 the disclosure of which is incorporated herein by reference, describes registering a voice print of a speaker, then during transmission a microphone collects a signal comprising the speakers voice and ambient noise. The signal is input to a comparing filter that extracts the voice of the speaker from the signal by comparing to the registered voice print.
There is however a basic problem in implementing a system as suggested in the Japanese patent. In implementing a system for determining if a specific audio signal is voice and if it matches a specific voice pattern of a specific speaker, statistical methods are used, providing a probability level of conformity. The determination is not an absolute process wherein a real-time signal being generated is passed through a processor, which instantaneously provides a clean output signal that includes only the speech of a specific speaker. The above determination requires, statistical analysis of each part of the evolving audio signal to determine if the part contains the specific speakers voice or not. In some cases further evaluation of the evolving audio signal may reverse a previous determination, for example an audio segment which was initially determined to probably be a specific speaker may later be determined not to be the specific speaker or vice versa. Generally, instantaneous transfer of the audio will introduce a high level of error in the output signal, leading to portions of the speech being cut off or transfer of a large portion of the background noise. In contrast the greater the delay introduced before providing a determination the more accurate the decision tends to be, however providing a determination with a delay of more than a small amount (e.g. more than 100 mili-seconds) will result in a conversation of unacceptable quality.