1. Field of the Invention
The present invention relates to speech recognition, and more particularly to a system and method for discriminating speech (silence) using a log-likelihood ratio and pitch.
2. Description of the Related Art
Voice activity detection (VAD) is an integral and significant part of a variety of speech processing systems, comprising speech coding, speech recognition, and hands-free telephony. For example, in wireless voice communication, a VAD device can be incorporated to switch off the transmitter during the absence of speech to preserve power or to enable variable bit rate coding to enhance capacity by minimizing interference. Likewise, in speech recognition applications, the detection of voice (and/or silence) can be used to indicate a conceivable switch between dictation and command-and-control (CandC) modes without explicit intervention.
For the design of VAD, efficiency, accuracy, and robustness are among the most important considerations. Many prevailing VAD schemes have been proposed and used in different speech applications. Based on the operating mechanism, they can be categorized into a threshold-comparison approach, and a recognition-based approach. The advantages and disadvantages are briefly discussed as follows.
The underlying basis of a threshold-comparison VAD scheme is that it extracts some selected features or quantities from the input signal and then compare these values with some thresholds. (See, e.g., K. El-Maleh and P. Kabal, xe2x80x9cComparison of Voice Activity Detection Algorithms for Wireless Personal Communications Systemsxe2x80x9d, Proc. IEEE Canadian Conference on Electrical and Computer Engineering, pp. 470-473, May 1997; L. R. Rabiner, et al., xe2x80x9cApplication of an LPC Distance Measure to the Voiced-Unvoiced-Silence Detection Problem,xe2x80x9d IEEE Trans. on ASSP, vol. ASSP-25, no. 4,pp. 338-343, August 1977; and M. Rangoussi and G. Carayannis, xe2x80x9cHigher Order Statistics Based Gaussianity Test Applied to On-line Speech Processing,xe2x80x9d In Proc. of the IEEE Asilomar Conf., pp. 303-307, 1995.) These thresholds are usually estimated from noise-only periods and updated dynamically.
Many early detection schemes used features like short-term energy, zero crossing, autocorrelation coefficients, pitch, and LPC coefficients (See, e.g., L. R. Rabiner, et al. as cited above). VAD schemes in modern systems in wireless communication, such as GSM (global system for mobile communications) and CDMA (code division multiple access), apply adaptive filtering, sub-band energy comparison (See, e.g., K. El-Maleh and P. Kabal as cited above), and/or high-order statistics (See, e.g., M. Rangoussi and G. Carayannis as cited above).
A major advantage of the threshold-comparison VAD approach is efficiency as the selected features are computationally inexpensive. Also, they can achieve good performance in high-SNR environments. However, all these arts rely on either empirically determined thresholds (fixed or dynamically updated), the stationarity assumption of background noise, or the assumption of symmetry distribution process. Therefore, there are two issues to be addressed, including robustness in threshold estimation and adaptation, and ability to handle non-stationary and transient noises (See, e.g., S. F. Boll, xe2x80x9cSuppression of Acoustic Noise in Speech Using Spectral Subtraction,xe2x80x9d IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. ASSP-27, No. 2, pp. 113-120, April 1979).
For recognition-based VAD, the recent advances in speech recognition technology have enabled its widespread use in speech processing applications. The discrimination of speech from background silence can be accomplished using speech recognition systems. In the recognition-based approach, very accurate detection of speech/noise activities can be achieved with the use of prior knowledge of text contents.
However, this recognition-based operation may be too expensive for computation-sensitive applications, and therefore, it is mainly used for off-line applications with sufficient resources. Furthermore, it is language-specific and the quality highly depends on the availability of prior knowledge of text. Therefore, this kind of approach needs special consideration for the issues of computational resources and language-dependency.
Therefore, a need exists for a system and method which overcomes the deficiencies of the prior art, for example, the lack of robustness in threshold estimation and adaptation, the lack of the ability to handle non-stationary and transient noises and language-dependency. A further need exists for a model-based system and method for speech/silence detection using cepstrum and pitch.
A system and method for voice activity detection, in accordance with the invention includes the steps of training speech/noise Gaussian models by inputting data including frames of speech and noise, and deciding if the frames of the input data include speech or noise by employing a log-likelihood ratio test statistic and pitch. The frames of the input data are tagged based on the log-likelihood ratio test statistic and pitch characteristics of the input data as being most likely noise or most likely speech. The tags are counted in a plurality of frames to determine if the input data is speech or noise.
In other methods, the step of deciding if the frames of the input data include speech or noise by employing a log-likelihood ratio test statistic may include the steps of determining a first probability that a given frame of the input data is noise, determining a second probability that the given frame of the input data is speech and determining a LLRT statistic by taking a difference between the logarithms of the first probability from the second probability. The step of determining a first probability may include the step of comparing the given frame to a model of Gaussian mixtures for noise. The step of determining a second probability may include the step of comparing the given frame to a model of Gaussian mixtures for speech.
In still other methods, the step of tagging the frames of the input data based on the log-likelihood ratio test statistic and pitch characteristics may include the step of tagging the frames according to an equation Tag(t)=f(LLRT, pitch) where Tag(t)=1 when a hypothesis that a given frame is noise is rejected and Tag(t)=0 when a hypothesis that a given frame is speech is rejected. The program storage device as recited in claim 11, wherein the step of counting the tags in a plurality of frames to determine if the input data is speech or noise includes the step of providing a smoothing window of N frames to provide a normalized cumulative count between adjacent frames of the N frames and to smooth transitions between noise and speech frames. The step of providing a smoothing window of N frames may include the formula: w(t)=exp (xe2x88x92xcex1t), where w(t) is the smoothing window, t is time, and xcex1 is a decay constant. The step of providing a smoothing window of N frames may include the formula: w(t)=1/N, where w(t) is the smoothing window, and t is time. The step of providing a smoothing window of N frames may include w(t)=1 for t=0 and otherwise w(t)=0, where w(t) is the smoothing window, and t is time. The step of counting the tags may include the steps of comparing a normalized cumulative count to a first threshold and a second threshold, if the normalized cumulative count is above or equal to the first threshold and the current tag is most likely speech, the input data is speech and if the normalized cumulative count is below to the second threshold and the current tag is most likely noise, the input data is noise. The methods may be performed by a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform the method steps.
A method for training voice activity detection systems, in accordance with the invention, includes the.steps of inputting training data, the training data including both noise and speech, aligning the training data in a forced alignment mode to identify speech and noise portions of the training data, labeling the speech portions and the noise portions, clustering the noise portions to achieve noise Gaussian mixture densities to be employed as noise models, and clustering the speech portions to achieve speech Gaussian mixture densities to be employed as speech models.
The methods may be performed by a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform the method steps. The step of aligning the training data in a forced alignment mode to identify speech and noise portions of the training data may be performed by employing a speech decoder. The step of clustering the noise portions may include clustering the noise portions in accordance with a plurality of noise ambient environments.