Over the next years, speech is expected to become one of the most used input modalities for interacting with computer systems. In addition to keystrokes, mouse clicks, and visible body gestures, speech can improve the way that users interact with computerized systems. Processed speech can be recognized to discern what we say, and even who we are. Speech signals are increasingly being used to gain access to computer systems, and to operate the systems using voiced commands and information.
If the speech signals are "clean," and produced in an acoustically pristine environment, then the task of processing the signals to produce good results is relatively straightforward. However, as we use speech in a larger variety of different environments to interact with systems, for example, offices, homes, roadside telephones, or for that matter anywhere where we can carry a cellular phone, compensating for acoustical differences in these environments becomes a dominant problem in order to provide robust speech processing.
Generally, two types of effects can cause clean speech to become "dirty." The first effect is distortion of the speech signals themselves. The acoustic environment can distort audio signals in an innumerable number of ways. Signals can unpredictably be delayed, advanced, duplicated to produce echoes, change in frequency and amplitude, and so forth. In addition, different types of telephones, microphones and communication lines can introduce yet another set of different distortions.
The second soiling effect is "noise." Noise is due to additional signals in the speech frequency spectrum that are not part of the original speech. Noise can be introduced by other people talking in the background, office equipment, cars, planes, the wind, and so forth. Thermal noise in the communications channels can also add to the speech signals. The problem of processing dirty speech is compounded by the fact that the distortions and noise can change dynamically over time.
Generally, robust speech processing includes the following steps. In a first step, digitized speech signals are partitioned into time aligned portions (frames) where acoustic features can generally be represented by linear predictive coefficient (LPC) "feature" vectors. In a second step, the vectors can be cleaned up using environmental acoustic data. That is, processes are applied to the vectors representing dirty speech signals so that a substantial amount of the noise and distortion is removed. The cleaned-up vectors, using statistical comparison methods, more closely resemble similar speech produced in a clean environment. Then in a third step, the cleaned feature vectors can be presented to a speech processing engine which determines how the speech is going to be used. Typically, the processing relies on the use of statistical models or neural networks to analyze and identify speech signal patterns.
In an alternative approach, the feature vectors remain dirty. Instead, the pre-stored statistical models or networks which will be used to process the speech are modified to resemble the characteristics of the feature vectors of dirty speech. This way a mismatch between clean and dirty speech, or their representative feature vectors can be reduced.
By applying the compensation on the processes (or speech processing engines) themselves, instead on the data, i.e., the feature vectors, the speech analysis can be configured to solve a generalized maximum likelihood problem where the maximization is over both the speech signals and the environmental parameters. Although such generalized processes have improved performance, computationally, they tend to be more intensive. Consequently, prior art applications requiring real-time processing of dirty speech signals are more inclined to condition the signal, instead of the processes, leading to less than satisfactory results.
Compensated speech processing has become increasingly more sophisticated in recent years. Some of the earliest processes use ceptral mean normalization (CMN) and relative spectral (RASTA) methods. These methods are two versions of the same mean substraction method. There, the idea is to subtract an estimate of the measured speech from incoming frames of speech. Classical CMN subtracts the mean representing all of the measured speech from each speech frame, while RASTA subtracts a "lag" estimate of the mean from each frame.
Both the CMN and the RASTA methods compensate directly for differences in channels characteristics resulting in improved performance. Because both methods use a relatively simple implementation, they are frequently used in many speech processing systems.
A second class of efficient compensation methods relies on stereo recordings. One recording is taken with a high performance microphone for which the speech processing system has already been trained, another recording is taken with a target microphone to be adapted to the system. This approach can be used to provide a boot-strap estimate of speech statistics for retraining. Stereo-pair methods that are based on simultaneous recordings of both the clean and dirty speech are very useful for this problem.
In a probabilistic optimum filtering (POF) method, a vector codebook (VQ) is used. The VQ describes the distribution of mel-frequency ceptral coefficients (MFCC) of clean speech combined with a codeword dependent multi-dimensional transversal filter. The purpose of the filter is to acquire temporal correlations between frames of speech displaced in time. POF "learns" the parameters of each frame dependent VQ filter (a matrix) and each environment using a minimization of a least-squares error criteria between the predicted and measured speech.
Another known method, Fixed Codeword Dependent Ceptral Normalization (FCDCN), similar to the POF method, also uses a VQ representation for the distribution of the clean speech ceptrum vectors. This method computes codeword dependent correction vectors based on simultaneously recorded speech. As an advantage, this method does not require a modeling of the transformation from clean to dirty speech. However, in order to achieve this advantage, stereo recording is required.
Generally, these speech compensation methods do not make any assumptions about the environment because the effect of the environment on the ceptral vectors is directly modeled using stereo recordings.
In one method, Codeword Dependent Ceptral Normalization (CDCN), the ceptra of clean speech signals are modeled using a mixture of Gaussian distributions where each Gaussian can be represented by its mean and covariance. The CDCN method analytically models the effect of the environment on the distribution of the clean speech ceptra.
In a first step of the method, the values of the environmental parameters (noise and distortion) are estimated to maximize the likelihood of the observed dirty ceptrum vectors. In a second step, a minimum mean squared estimation (MMSE) is applied to discover the unobserved ceptral vectors of the clean speech given the ceptral vectors of the dirty speech.
The method typically works on a sentence-by-sentence or batch basis, and, therefore, needs fairly long samples (e.g., a couple of seconds) of speech to estimate the environmental parameters. Because of the latencies introduced by the batching process, this method is not well suited for real-time processing of continuous speech signals.
A parallel combination method (PMC) assumes the same models of the environment as used in the CDCN method. Assuming perfect knowledge of the noise and channel distortion vectors, the method tries to transform the mean vectors and the covariance matrices of the acoustical distribution of hidden Markov models (HHM) to make the HHM more similar to an ideal distribution of the ceptra of dirty speech.
Several possible alternative techniques are known to transform the mean vectors and covariance matrices. However, all these variations of the PMC require prior knowledge of noise and channel distortion vectors. The estimation is generally done beforehand using different approximations. Typically, samples of isolated noise are required to adequately estimate the parameters of the PMC. These methods have shown that distortion in the channel effects the mean of the measured speech statistics, and that the effective SNR at a particular frequency controls the covariance of the measured speech.
Using a vector Taylor series (VTS) method for speech compensation, this fact can be exploited to estimate the dirty speech statistics given clean speech statistics. The accuracy of VTS method depends on the size of the higher order terms of the Talyor series approximation. The higher order terms are controlled by the size of the covariance of the speech statistics.
With VTS, the speech is modeled using a mixture of Gaussian distributions. By modeling the speech as a mixture, the covariance of each individual Gaussian is smaller than the covariance of the entire speech. In order for VTS to work, it can be shown that the mixture model is necessary to solve the maximization step. This is related to the concept of sufficient richness for parameter estimation.
In summary, the best known compensation methods base their representations for the probability density function p(x) of clean speech feature vectors on a mixture of Gaussian distributions. The methods work in batch mode, i.e., the methods needs to "hear" a substantial amount of signal before any processing can be done. The methods usually assume that the environmental parameters are deterministic, and therefore, are not represented by a probability density function. Lastly, the methods do not provide for an easy way to estimate the covariance of the noise. This means that the covariance must first be learned by heuristic methods which are not always guaranteed to converge.
It is desired to provide a speech processing system where clean speech signals can naturally be represented. In addition, the system should work as a filter so that continuous speech can be processed as it is received without undue delays. Furthermore, the filter should adapt itself as environmental parameters which turn clean speech dirty change over time.