Technical Field
The present invention relates generally to speech recognition and, in particular, to a multi-pass speech activity detection strategy to improve automatic speech recognition.
Description of the Related Art
Speech activity detection (SAD) is a first step in automatic speech recognition (ASR) tasks. This step is essential to identify regions in the audio signal that include speech. The identified regions are then “decoded” by a speech recognition engine to produce word sequences corresponding to the acoustic signal. In matched acoustic conditions, while the performance of ASR is often reasonably good, there are significant degradations in previously unseen noise. One potential reason for these degradations is the inclusion of noise/noisy regions in the estimation of feature statistics and transforms from the audio signal prior to decoding. For example, the means and variances for mean-variance normalization of acoustic features used with various acoustic models can significantly change if music or non-speech acoustic events such as door bangs are included in the data used for the estimation of these feature statistics. Thus, there is a need for improved speech activity detection.