Speech signals that are transmitted by speech communication devices will often be corrupted to some extent by noise which interferes with and degrades the performance of coding, detection and recognition algorithms.
A variety of different voice activity detectors and detection methods have been developed in order to detect speech periods in input signals which comprise both speech and noise components. Such devices and methods have application in areas such as speech coding, speech enhancement and speech recognition.
The simplest form of voice activity detection is an energy based method in which the power of an input signal is assessed in order to determine if speech is present (i.e. an increase in energy indicates the presence of speech). Such a technique works well where the signal to noise ratio is high but becomes increasingly unreliable in the presence of noisy signals.
A voice activity detection method based on the use of a statistical model is described in “A Statistical Model Based Voice Activity Detection” by Sohn et al [IEEE Signal Processing Letters Vol 6, No 1, January 1999]. The statistical model described uses a model for noise and speech to calculate a likelihood ratio (LR) statistic (where LR=[probability speech is present]/[probability speech is absent]). The LR statistic so calculated is then compared to a threshold value in order to decide whether the speech signal (or section thereof) under analysis contains speech.
The Sohn et al technique was modified in “Improved Voice Activity Detection Based on a Smoothed Statistical Likelihood Ratio” by Cho et al, In Proceedings of ICASSP, Salt Lake City, USA, vol. 2, pp 737-740, May 2001. The modified version of the technique proposes the use of a smoothed likelihood ratio (SLR) in order to alleviate detection errors that might otherwise be encountered at speech offset regions.
In order to calculate LR (or SLR) the above statistical methods both require the use of an existing noise power estimate. This noise estimate is obtained using the LR/SLR calculated during previous iterations of the analysis frames.
There thus exists a feedback mechanism within the above described statistical methods in which the likelihood ratio is calculated using an existing noise estimate which is in turn calculated using a previously derived likelihood ratio value. Such a feedback mechanism can result in an accumulation of errors which impacts upon the overall performance of the system.
As noted above the likelihood ratio that is calculated is compared to a threshold value in order to decide if speech is present. However, the likelihood ratios calculated in the above techniques can vary over the order of 60 dB or more. If there are large variations in the noise in the input signal then the threshold value may become an inaccurate indicator of the presence of speech and system performance may decrease.
It is therefore an object of the present invention to provide a voice activity detection method and apparatus that substantially overcomes or mitigates the above mentioned problems with the prior art.