Voice activity detection (VAD) is generally a technique for detecting voice activities in a signal. Voice activity detection is also known as speech activity detection or simply speech detection. A VAD apparatus detects, in communication channels, the presence or absence of the voice activities, also referred to as active signals, such as speech or music. Networks thus can decide to compress a transmission bandwidth in periods where active signals are absent, or perform other processing according to whether there is an active signal or not. In the VAD, a feature parameter or a set of feature parameters extracted from an input audio signal is compared to corresponding threshold values, in order to determine whether the input audio signal is an active signal or not.
There have been many parameters proposed for the VAD. In general, energy based parameters are known to provide good performance. Thus, in recent years, as a kind of energy based parameters, sub-band signal to noise ratio (SNR) based parameters have been widely used for the VAD. No matter what feature parameter or feature parameters are used by a voice activity detector, these kind of parameters exhibit a weak speech characteristic at the offsets of speech bursts, thus increasing the possibility of mis-detecting speech offsets.
Usually, in order to ensure a correct detection of speech offsets, a conventional voice activity detector performs some special processing at speech offsets. A conventional way to do this special processing is to apply a “hard” hangover to a VAD decision at speech offsets, wherein a first group of frames detected as inactive by the voice activity detector at the speech offsets is forced to be active. Another possibility is to apply a “soft” hangover to the VAD decision at the speech offsets. In applying a soft hangover, the VAD decision threshold at the speech offsets is adjusted to favour speech detection for the first several offset frames of the audio signal. Accordingly, in this conventional voice activity detector, when the input signal is a non speech offset signal, the VAD decision is made in a normal way, while in an offset state the VAD decision is made in a way favouring speech detection.
Although the application of a hard hangover process in order to ensure a correct detection of the speech offsets can successfully help to diminish the possibility of a mis-detection at speech offsets, the hard hangover scheme lacks efficiency. Many real inactive frames may be unnecessarily forced to be active, thus decreasing the VAD overall performance. On the other hand, although a soft hangover processing scheme as used, for instance, by the ITU-T (International Telecommunication Union Telecommunication Standardization Sector) G.718 standardized voice activity detector improves the hangover efficiency to a higher level, the VAD performance can still be improved.