Digital voice processing is used in a number of applications for different purposes. Some of the more commercial applications involve data compression and encoding, speech recognition, and speech detection. These applications are in demand in enterprises such as telecommunications, recording arts and the entertainment industry, security and identification enterprises, etc.
Generally all of these applications involve receiving an audio signal, sampling the audio signal to derive a digital representation, extracting overlapping frames of consecutive samples, and then decomposing the frames in a digital time domain representation into (relatively) uncorrelated components. It has been recognized that sampling a voice signal within an order of magnitude of 10 KHz (10,000 samples per second), and providing a frame size that corresponds to a time window within an order of magnitude of 10 milliseconds may be satisfactory, depending on the specific application. There are many known transforms for decomposing a frame of samples into a plurality of independent components. The most common of these include the frequency-domain transforms such as the Fourier transform, and the discrete cosine transform (DCT), wavelet decomposition transforms such as the standard wavelet transform (SWT), and adaptive transforms like the Karhunen-Löeve Transform.
The Fourier Transform decomposes the samples in the window into frequency components. While the Fast Fourier transform can be performed quickly, the resulting frequency spectrum has disadvantages in that it has a predefined fixed resolution. Decomposition into the time-frequency domain, (e.g. by the DCT) provides frequency spectrum information relative to a given time. The DCT in particular is a low complexity decomposition technique that can provide an excellent basis for producing highly uncorrelated components.
Wavelet decomposition transforms the time domain signal into corresponding wavelets. Wavelets are mathematical functions that are useful for representing discontinuities. Adaptive basis transformations, such as the KLT continuously tweak the basis functions into which the signal is decomposed, in an effort to maximize the capacity of the basis functions to represent the signal. While these decomposition techniques (and more besides) all provide sufficiently independent components, each has its own computational complexity, and each set of components provides its own accuracy of representation with respect to a given signal domain. Accordingly each of these decompositions may be useful in different types of applications, for use in different environments.
Removing noise from a noise-contaminated voice signal is a well known problem in this field. In substantially all applications it is useful to remove noise. Typically noise is not appreciated by telephone users, media users, etc., and is known to interfere with voice identification. Moreover the transmission of noise-contaminated data, or the encoding of noise on storage media is inefficient. The filtering of digital data to remove the noise is therefore widely recognized to be of value.
One feature of audio data that makes the filtering difficult is that the voice signal is punctuated with silence. Speakers typically pause between words or sentences and at other times when required to produce the sounds they make, e.g. before a plosive, after a stop, etc. The reason that this makes noise filtering difficult is that unless the silent and voice-active intervals are detected, the same filtering function cannot be applied unless a relatively poor quality of filtering is acceptable. Typically a voice activity detector (VAD) is used to classify a frame as either voice-active or silent.
Of course, discriminating between noise and voice at a VAD is not significantly easier than the separation of noise from the voice signal. These problems are strongly analogous. Known techniques for accomplishing this are very complex or have a low reliability, or both. The prior art methods have typically used a model based on a Gaussian noise distribution, and a Gaussian voice distribution, and use statistical analysis of energy distribution to separate the noise from the voice.
What is needed is an efficient and reliable way of separating noise from a noise-contaminated signal, especially noise from a noise-contaminated voice signal.