Signal detection is an important element in many applications. Examples of such applications include, but are not restricted to, speech detection, speech recognition, speech coding, noise adaptation, speech enhancement, microphone arrays and echo-cancellation. In some instances, a simple frame level decision (e.g., yes or no) of whether a desired signal is present or absent is sufficient for the application. However, even with simple decisions, decision-making criteria or requirements can vary from application to application and/or for an application, based on current circumstances. For example, with source localization, it is typically important to employ a system that mitigates rendering false positives or false detections (classifying a noise-only frame as a speech frame), whereas in speech coding a high speech detection rate (e.g., rendering true positives) at the cost of an increased number of false positives commonly is acceptable and desirable.
In other instances, a simple determination of whether a desired signal is present or absent is insufficient. With these applications, it is often necessary to estimate a probability of the presence of speech in one or more frames and/or associated time-frequency bins (atoms, units). A threshold can be defined and utilized in connection with the estimated probability to facilitate deciding whether the desired signal is present. An ideal system is one that generates calibrated probabilities that accurately reflect the actual frequency of occurrence of the event (e.g., presence of a desired signal). Such a system can optimally make decisions based on utility theory and combine decisions from independent sources utilizing simple rules. Furthermore, the ideal system should be simple and light on resource consumption.
Conventionally, many signal detection approaches that detect the presence of a desired signal or estimate its probability at the frame level have been proposed. One popular technique is to utilize a likelihood ratio (LR) test that is based on Gaussian, or normal distribution models. For example, a voice activity detector can be implemented utilizing an LR test. Such a voice activity detector typically employs a short-term spectral representation of the signal. In some implementations of this idea, a smoothed signal-to-noise ratio (SNR) estimate of respective frames can be used as an intermediate representation. Unfortunately, this technique, as well as other LR-based techniques, suffers from threshold selection and LR scores do not easily translate to true class probabilities. In order to convert from LR scores to true class probabilities, additional information such as prior probabilities of the hypotheses, for example, are required. Furthermore, such techniques typically assume that both the noise and the desired signal (e.g., speech) have normal distributions with zero mean, which can be an overly restrictive assumption. Conventional techniques that attempt to improve LR tests employ larger mixtures of models, which typically are computationally expensive.
Some detection systems render desired signal/no desired signal decisions at the frame level (e.g., they estimate a 0/1 indicator function) and smooth the decisions over time to arrive at a crude estimate of the probabilities. Some of these techniques utilize hard and/or soft voting mechanisms on top of the indicator functions estimated at the time-frequency atom level. A technique that is frequently utilized to estimate probabilities is a linear estimation model: ρ=A+BX; where ρ is the probability, X is the input (e.g., one or more LR scores or observed features like energies), and A and B are the parameters to be estimated. One such probability estimator, even though not explicitly formulated this way, adopts the linear model and utilizes the log of smoothed energy as the input. However, this linear model can render probabilities greater than 1 or less than 0 and a variance of error in estimation depends on the input (e.g., one or more variables).