The present invention generally relates to an apparatus and a concomitant method for processing a signal having two or more signal components. More particularly, the present invention detects the presence of a desired signal component, e.g., a speech component, in a signal using a decision function that is adaptively updated.
In real world environments, many observed signals are typically composites of a plurality of signal components. For example, if one records an audio signal within a moving vehicle, the measured audio signal may comprise a plurality of signal components, such as audio signals attributed to the tires rolling on the surface of the road, the sound of wind, sounds from other vehicles, speech signals of people within the vehicle and the like. Furthermore, the measured audio signal is non-stationary, since the signal components vary in time as the vehicle is traveling.
In such real world environments, it is often advantageous to detect the presence of a desired signal component, e.g., a speech component in an audio signal. Speech detection has many practical applications, including but not limited to, voice or command recognition applications. However, speech detection methods are usually based on discriminating the total or component-wise signal power. For example, the component-wise signal powers are combined into a predefined ad-hoc decision function, which then generates a decision whether the current frame contains speech or not.
However, there are at least several difficulties associated with ad-hoc decision functions. First, ad-hoc decision functions often require the adjustment of a threshold which often is suboptimal for time-varying Signal-to-Noise Ratio (SNR). Second, it has been noted that many ad-hoc decision functions tend to falsely detect speech during long non-speech periods.
Therefore, a need exists in the art for detecting the presence of a desired signal component, e.g., a speech component, in a non-stationary signal using a decision function that is adaptively updated.
The present signal processing system detects the presence of a desired signal component by applying a probabilistic description to the classification and tracking of the various signal components (e.g., desired versus non-desired signal components) in an input signal. Namely, an N mixture model (e.g., a dual mixture where N=2) is used, where the model densities capture N signal components, e.g., two signal components having speech and non-speech features that are observed in the past, e.g., past audio frames. Classification of a new frame is then simply a matter of computing the likelihood that the new frame corresponds to either class. In turn, an optimal threshold can be adaptively generated and updated.