1. Field of the Invention (Technical Field)
The present invention relates to speech enhancement methods, apparatuses, and computer software, particularly for noisy environments.
2. Description of Related Art
Note that the following discussion refers to a number of publications by author(s) and year of publication, and that due to recent publication dates certain publications are not to be considered as prior art vis-a-vis the present invention. Discussion of such publications herein is given for more complete background and is not to be construed as an admission that such publications are prior art for patentability determination purposes.
Enhancement of noisy speech remains an active area of research due to the difficulty of the problem. Standard methods such as spectral subtraction, iterative Wiener filtering can increase signal-to-noise-ratio (SNR) or improve perceptual evaluation of speech quality (PESQ) scores but at the expense of other distortions such as musical artifacts. Other methods have recently been proposed, such as the generalized subspace method, which can deal with non-white additive noise. With all of these methods, PESQ can be improved by as much as 0.6 for speech with 10 to 30 dB input SNR. The effectiveness of these methods deteriorates rapidly below 5 dB input SNR.
Gaussian Mixture Models (GMMs) of a speaker's mel-frequency cepstral coefficient (MFCC) vectors have been successfully used for over a decade in speaker recognition (SR) systems. Due to the non-deterministic aspects of speech, it is desirable to model each acoustic class with a Gaussian probability density function since the actual sound produced for the same acoustic class will vary from instance to instance. Since GMMs can model arbitrary distributions, they are well suited to modeling speech for speech recognition (SR) systems, whereby each acoustic class is modeled by a single component density.
The use of cepstral- or GMM-based systems for speech enhancement has only recently been investigated. Compared to most speech enhancement algorithms, which do not require clean speech signals for training, recent research has assumed the availability of a clean speech signal to build user-dependent models to enhance noisy speech.
Kundu et al., “GMM based Bayesian approach to speech enhancement in signal/transform domain”, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pp. 4893-4896, April 2008, build a GMM of vectors containing time-domain samples (speech frames) from a group of speakers during the training stage. In the enhancement stage, the minimum mean-square error (MMSE) estimate of each noisy speech frame is computed, relying on the time-domain independence of the noise and speech. The authors report up to 11 dB improvement in output SNR for low input SNR (−5 to 10 dB) with additive white Gaussian noise.
Kundu et al., “Speech Enhancement Using Intra-frame Dependency in DCT Domain”, in Proc. European Signal Processing Conference (EUSIPCO), August 2008, extended their work whereby a discrete cosine transform (DCT) is used to decorrelate the time-domain samples. The decorrelated samples of the speech frame can then be split into subvectors for individual modeling by a GMM. The authors achieved 6-10 dB improvement in output SNR and 0.2-0.8 PESQ improvement for input SNRs of 0 to 10 dB for a variety of noise types.
Mouchtaris et al., “A spectral conversion approach to single-channel speech enhancement”, IEEE Trans. Audio, Speech, Language Process., vol. 15, no. 4, pp. 1180-1193, May 2007, build a GMM of a distribution of vectors containing the line spectral frequencies (LSFs) for the (assumed) jointly Gaussian speech and noisy speech. Enhancement for a separate speaker and noise pair is estimated based on a probabilistic linear transform, and the enhanced LSFs are used to estimate a linear filter for speech synthesis (iterative Wiener or Kalman filter). The authors report an output average segmental SNR value from 3-13 dB for low input SNR (−5 to 10 dB) with additive white Gaussian noise.
Deng et al., “Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features”, IEEE Trans. Speech Audio Process., vol. 12, no. 3, pp. 218-233, May 2004, use MFCCs and Δ-MFCCs to model clean speech, a separate recursive algorithm to estimate the noise, and construct a linearized model of the nonlinear acoustic environment via a truncated Taylor series approximation (using an iterative algorithm to compute the expansion point). Results are measured by improvement in speech recognition accuracy, with word recognition rates between 54% and 99% depending on noise type and SNR.
The present invention provides a two-stage speech enhancement technique which uses GMMs to model the MFCCs from clean and noisy speech. A novel acoustic class mapping matrix (ACMM) allows the invention to probabilistically map the identified acoustic class in the noisy speech to an acoustic class in the underlying clean speech. Finally, the invention uses the identified acoustic classes to estimate the clean MFCC vector. Results show that one can improve PESQ in environments as low as −10 dB input SNR.
Other arguably related references include the following:
A. Acero, U.S. Pat. No. 7,047,047, “Non-Linear Observation Model for Removing Noise from Corrupted Signals”, relates to a speech enhancement system to remove noise from a speech signal. The method estimates the noise, clean speech, and the phase between the clean speech and noise as three hidden variables. The model describing the relationship between these hidden variables is constructed in the log Mel-frequency domain. Many assumptions are invoked to allow the determination of closed-form solutions to the conditional probabilities and minimum mean square error (MMSE) estimators for the hidden variables. The use of the noise-reduced feature vectors for reconstruction of the enhanced speech signal for human listening is not addressed. This system operates in the log mel frequency domain rather than in the mel frequency cepstral domain. One of the benefits of the present invention is that it can operate directly in the cepstral domain, allowing for utilization of excellent acoustic modeling of that particular domain. Acero's system explicitly computes an estimate of the noise signal, whereas the present invention models the perturbation to the clean speech features due to noise. Furthermore, the removal of noise (speech enhancement) in Acero's system uses distinctly different methods. Since the present invention operates in a different feature domain (mel-frequency cepstrum rather than mel-frequency spectrum), it cannot make many of the assumptions of the Acero system. Rather, the invention statistically modifies the MFCCs of the noisy signal. The statistical modification of the MFCCs is based on the target statistics of the GMM of the MFCCs from the clean training speech signal. Finally, the use of the noise-reduced feature vectors for reconstruction of the enhanced speech signal for human listening is not addressed in Acero's system.
A. Acero, U.S. Pat. No. 7,165,026—“Method of Noise Estimation Using Incremental Bayes Learning”, addresses the estimation of noise from a noisy speech signal. The present invention does not rely on an estimate of noise but rather on a model of the perturbations to clean speech due to noise. This patent does not directly address the use of a noise estimate for speech enhancement, but invokes U.S. Pat. No. 7,047,047 (described above) as one example of a methodology to make use of the noise estimate.
M. Akamine, U.S. Patent Pub. No. 2007/0276662, “Feature-Vector Compensating Apparatus, Feature-Vector Compensating Method, and Computer Product”, describes a method for compensating (enhancing) speech in the presence of noise. In particular, the method describes a means to compute compensating vectors for a plurality of noise environments. Given noisy speech, the degree of similarity to each of the known noise environments is computed, and this estimate of the noise environment is used to compensate the noisy feature vector. Moreover, a weighted average of compensated feature vectors can be used. The specific compensating (enhancement) method targeted by this invention is the SPLICE (Stereo-based Piecewise Linear Compensation for Environments) method, which makes use of the Mel-frequency Cepstral Coefficients (MFCCs) as well as delta and delta-delta MFCCs as acoustic feature vectors. Automatic speech recognition and speaker recognition are the specific applications targeted by the invention. The reconstruction of the enhanced speech signal for human listening is not addressed in Akamine's system. The use of the SPLICE method for compensation of the acoustic feature vectors (not covered by this publication but invoked as the targeted method of feature vector compensation) relies on the use of stereo audio recordings. The present invention uses single channel (i.e., one microphone) recordings for enhancement of speech. Furthermore, the SPLICE algorithm computes a piecewise linear approximation for the relationship between noisy speech feature vectors and clean speech feature vectors, invoking assumptions regarding the probability density functions of the feature vectors and the conditional probabilities. The present invention estimates the clean speech feature vectors by means of a novel acoustic class mapping matrix relating the individual component densities in the GMM for the clean speech and noisy model (modeling the perturbation of the clean speech cepstral vectors due to noise). The reconstruction of the enhanced speech signal for human listening is not addressed in Akamine's system, but rather this publication is targeting automatic speech or speaker recognition.
M. Akamine, U.S. Patent Pub. No. 2007/0260455, “Feature-Vector Compensating Apparatus, Feature-Vector Compensating Method, and Computer Program Product”, describes a method for compensating (enhancing) speech in the presence of noise. This publication is very similar to the inventor's other publication discussed above. However, this publication uses a Hidden Markov Model (HMM) for a different determination of the sequence of noise environments in each frame than was used in the other publication.
A. Bayya, U.S. Pat. No. 5,963,899, “Method and System for Region Based Filtering of Speech”, describes a speech enhancement system to remove noise from a speech signal. The method divides the noisy signal into short time frames, classifies the underlying sound type, chooses a filter from a predetermined set of filterbanks, and adaptively filters the signal in order to remove noise. The classification of sound type is based on training the system using an artificial neural network (ANN). The above system operates entirely in the time-domain and this is stressed in the applications. That is, the system operates on the speech wave itself whereas our system extracts mel-frequency cepstral coefficients (MFCCs) from the speech and operates on these. There are many speech enhancement methods that operate in the time-domain whereas the present invention is the first to operate in the MFCC-domain, which is a much more powerful approach. Although both systems are trained to “recognize” sound types, the methods of training, classification, and definition of “types” are very different. In Bayya's system the sound types are phonemes such as vowels, fricatives, nasals, stops, and glides. The operator of the system must manually segment a clean speech signal into these types and train the ANN on these types a head of time. The noisy signal is then split up into frames and each frame is classified according to the ANN. In the present invention, one trains a Gaussian Mixture Model (GMM), which is a statistical model and very different from an ANN. The present invention is automatically trained in that one simply presents a user's clean speech signal and a parallel noisy version is automatically created and the model trained on both time-aligned signals. The present invention is user-dependent in that the model is trained for a single person who uses the system. Although Bayya's method is trained, their system is user-independent. The model of the present invention is not based on a few sound types at the level of phoneme but on much finer acoustic classes based on statistics of the Gaussian distribution of these acoustic classes. The present invention preferably uses between 15-40 acoustic classes and a Bayesian classifier of MFCCs in order to determine the underlying acoustic class in the noisy signal, which is significantly different than Bayya's invention. Based on the classification by the ANN, Bayya's system then chooses a filterbank and adaptively filters the noisy speech signal. The present invention preferably employs no noise-reduction filters (neither filterbanks nor adaptive filters) but rather statistically modifies the MFCCs of the noisy signal. The statistical modification of the MFCCs is based on the target statistics of the GMM of the MFCCs from the clean training speech signal. Finally, in Bayya's system the enhanced speech signal is “stitched” together by simply overlapping and adding the time-domain speech frames. The present invention employs a more elaborate method of reconstructing the speech signal since it operates in the MFCC-domain. The present invention also provides a new method to invert the MFCCs back into the speech waveform based on inverting each of the steps in the MFCC process.
H. Bratt, U.S. Patent Pub. No. 2008/0010065, “Method and Apparatus for Speaker Recognition”, describes a system for speaker recognition (SR) that is for recognizing a speaker based on their voice signal. This publication does not address enhancing a speech signal, i.e., removing noise for human listening which is the subject of the present invention.
J. Droppo, U.S. Pat. No. 7,418,383, “Noise Robust Speech Recognition with a Switching Linear Dynamic Model”, describes a method for speech recognition (i.e., speech-to-text) in the presence of noise using Mel-frequency cepstral coefficients as a model of acoustic features and a switching linear dynamic model for the time evolution of speech. The inventors describe a means to model the nonlinear manner in which noise and speech combine in the Mel-frequency cepstral coefficient domain as well as algorithms for reduced computational complexity for determination of the switching linear dynamic model. Since this method specifically targets automatic speech recognition, the reconstruction of the enhanced speech for human listening is not addressed in this patent. This system uses a specific model (Switching Linear Dynamic Model) for the time evolution of speech. The present invention does not invoke any model of the time-evolution of speech. The nonlinear model describing the relationship between clean speech and the noise is different than in the present invention. Firstly, the present invention models the relationship between the clean speech and the noisy signal rather than the relationship between the clean speech and the noise as in Droppo's invention. Secondly, the present invention models the perturbations of the clean feature vectors due to noise in terms of a novel acoustic class mapping matrix based on a probabilistic estimate of the relationship between individual Gaussian mixture components in the clean and noisy speech. Droppo's system estimates the clean speech and noise by invoking assumptions regarding the probability density functions (PDFs) of the speech and noise models, as well as the PDFs of the joint distributions of speech and noise. Droppo's system uses the minimum mean square error (MMSE) estimator, which the present invention preferably does not use under the preferred constraints (using the noisy and clean speech rather than the noise and clean speech). Furthermore, Droppo's invention does not address the reconstruction of the enhanced speech for human listening.
B. Frey, U.S. Pat. No. 7,451,083, “Removing Noise from Feature Vectors”, describes a system for speech enhancement, i.e., the removal of noise from a noisy speech signal. Separate Gaussian mixture models (GMMs) are used to model the clean speech, the noise, and the channel distortion. Moreover, the relationship between the observed noisy signal and the clean speech, noise, and channel distortion is modeled via a non-linear relationship. In the training stage, the difference between the computed noisy signal (invoking the non-linear relationship) and the measured noisy signal is computed. An estimate of the clean speech feature vectors given the noisy speech feature vectors is determined by computing the most likely combination of clean speech, noise, and channel distortion given the models (GMMs) previously computed. The difference between the computed noisy signal and the measured noisy signal is used to further refine the estimate of the clean speech feature vector. This patent does not address the use of the enhanced feature vectors for human listening. This system does not enhance speech to improve human listening of the signal as the present invention does nor does it convert the MFCCs back to a speech waveform as required for human listening. In the present invention we also create a GMM of clean speech. In the present invention, however, one does not assume access to the noise (or channel distortion), and thus one does not explicitly model the noise. Rather, one models the noisy speech signal with a separate GMM. One then links the two GMMs (clean and noisy) via a novel mapping matrix thus solving a major problem in how one can relate the two GMMs to each other. In Frey's system, the clean speech, noise, and channel distortion are all estimated by means of computing the most likely combination of speech, noise, and channel distortion (by means of a joint probability density function). The present invention also estimates a clean MFCC vector from the noisy one but does not use a maximum likelihood calculation over the combinations of speech and noise. These estimates are used in addition to the nonlinear model of the mixing of speech, noise, and channel distortion to estimate the clean speech feature vectors. The present invention rather uses the probabilistic mapping between noisy and clean acoustic classes (individual GMM component densities) provided by a novel acoustic class mapping matrix and modification of the noisy cepstral vectors to have statistics matching the clean acoustic classes.
Y. Gong, U.S. Pat. No. 6,633,842, “Sequential Determination of Utterance Log-Spectral Mean By Maximum a Posteriori Probability Estimation”, describes a system for improving automatic speech recognition (ASR), i.e., speech to text when the speech signal is subject to noise. This patent does not address enhancing a speech signal, i.e., removing noise for human listening. This patent is for a system that modifies a Gaussian Mixture Model (GMM) trained on MFCCs derived from clean speech so that one has a GMM for the noisy speech. To do this, the inventor adds an estimate of the noise power spectrum to the clean speech power spectrum, converts the estimated noisy speech spectrum to MFCC coefficients, and modifies the clean GMM parameters accordingly. The inventor's point of having two GMMs—one for clean speech and one for noisy speech—is to apply a standard statistical estimator equation so that one may estimate the clean speech feature vector. By using an estimate of the clean speech feature vector instead of the actual noisy feature vector, ASR may be improved in noisy environments. The above system creates a new a GMM for noisy speech so that it can be used in a machine-based ASR—this system does not enhance speech to improve human listening of the signal nor does it convert the MFCCs back to a speech waveform as required for human listening. In the present invention one also creates a GMM of noisy speech. In the present invention, however, one does not estimate the noise power spectrum but rather creates a noisy speech signal, extracts MFCCs, and builds a GMM from scratch—one does not modify the clean GMM. One then links the two GMMs (clean and noisy) via a novel mapping matrix, thus solving a major problem in how one can relate the two GMMs to each other. The invention also estimates a clean MFCC vector from the noisy one but does not use a conditional estimator. One cannot assume that the component densities of the GMMs are jointly Gaussian and thus the present invention resorts to a novel, non-standard estimator.
Y. Gong, U.S. Pat. No. 7,062,433, “Method of Speech Recognition with Compensation for Both Channel Distortion and Background Noise”, describes a system for improving automatic speech recognition (ASR), i.e., speech to text when the speech signal is subject to channel distortions and noise background. This patent does not address enhancing a speech signal, i.e., removing noise for human listening. The patent is directed to a system that modifies Hidden Markov Models (HMMs) trained on clean speech. To do this, the inventors add the mean of the MFCCs of the clean training signal to each of the models and subtract the mean of the MFCCs of the estimate of the noise background from each of the models. By doing this, the models are adapted for ASR in noisy environments and thus improved word recognition. The system modifies HMMs (based on clean versus noisy speech) used in a machine-based ASR—this system does not enhance speech to improve human listening of the signal nor does it convert the MFCCs back to a speech waveform as required for human listening. In Gong's work, the models for the ASR system are modified (by simple addition and subtraction of mean vectors) and not the MFCCs themselves as in the present invention. Furthermore, with the present invention direct enhancement of MFCCs includes modifications based on the covariance matrix and weights of component densities of the GMM of the MFCCs and not just the mean vector. In Gong's system, the mean MFCC vector is computed from an estimate signal whereas in the present invention the statistics of the noisy signal are first computed through a training session involving a synthesized noisy signal. In Gong's work there is no training session based on a noisy signal. Finally, in Gong's work there is no description of using the system for enhancement of noisy speech—it is only used for compensating a model in ASR when the signal is noisy.
H. Jung, U.S. Patent Pub. No. 2009/0076813, “Method for Speech Recognition using Uncertainty Information for Sub-bands in Noise Environment and Apparatus Thereof”, describes a system for improving automatic speech recognition (ASR), i.e., speech-to-text in the presence of noise. This patent does not address enhancing a speech signal, i.e., removing noise for human listening. The invention uses sub-bands and weights those frequency bands with less noise more so than those with more noise. In doing so, better ASR can be achieved. In this publication, no attempt is made to remove noise or modify models.
S. Kadambe, U.S. Pat. No. 7,457,745, “Method and Apparatus for Fast On-Line Automatic Speaker/Environment Adaptation for Speech/Speaker Recognition in the Presence of Changing Environments”, describes a system for automatic speech recognition (ASR) and speaker recognition (SR) that can operate in an environment where the speech sounds are distorted. The underlying speech models are adapted or modified based on incorporating the parameters of the distortion into the model. By modifying the models, no additional training is required in the noisy environment and ASR/SR accuracy is improved. This system does not enhance speech to improve human listening of the signal as in the present invention nor does it convert the MFCCs back to a speech waveform as required for human listening.
K. Kwak, U.S. Patent Pub. No. 2008/0065380, “On-line Speaker Recognition Method and Apparatus Thereof”, describes a system for speaker recognition (SR) that is for identifying a person by the voice signal. This patent does not address enhancing a speech signal, i.e., removing noise for human listening. The work contained in this publication is reminiscent of that published by D. Reynolds et al., “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Trans. Signal Process., vol. 3, no. 1, pp. 72-83, January 1995. Although the inventors describe using a Wiener filter to remove noise from the signal prior to identification, this publication has nothing to do with removing noise from a speech signal for purposes of enhancing speech for human listening.
E. Marcheret, U.S. Patent Pub. No. 2007/0033042, “Speech Detection Fusing Multi-Class Acoustic-Phonetic, and Energy Features”, describes a method for detection of the presence of speech in a noisy background signal. More specifically, this method involves multiple feature spaces for determination of speech presence, including mel-frequency cepstral coefficients (MFCCs). A separate Gaussian mixture model (GMM) is used to model silence, disfluent sounds, and voiced sounds. A hidden Markov model (HMM) is also used to model the context of the phonemes. This method does not address the enhancement of noisy speech, but only the detection of speech in a noisy signal. In Marcheret's system the sound types are broad phonetic classes such as silence, unvoiced, and voiced phonemes. It is unclear from the publication whether the operator of the system must manually segment speech into silence, unvoiced, and voiced frames for training. Each of these broad phonetic classes is modeled by a separate GMM. In the present invention, one also trains a GMM, but the system is automatically trained in that one simply presents a user's clean speech signal and a parallel noisy version is automatically created and the model trained on both time-aligned signals. The model of the present invention is not based on a few sound types at the level of phoneme but on much finer acoustic classes based on statistics of the Gaussian distribution of these acoustic classes. The present invention preferably uses between 15-40 acoustic classes. Furthermore, the present invention is not targeted to the detection of speech in a noisy signal but for the enhancement of that noisy speech.
M. Seltzer, U.S. Pat. No. 7,454,338, “Training Wideband Acoustic Models in the Cepstral Domain Using Mixed-Bandwidth Training Data and Extended Vectors for Speech Recognition”, describes a method to compute wideband acoustic models from narrow-band (or mixed narrow- and wide-band) training data. This method is described to operate in both the spectrum and cepstrum; in both embodiments, the method provides a means to estimate the missing high-frequency spectral components induced by use of narrowband (telephone channel) recordings. This method does not address enhancing a speech signal, i.e., removing noise for human listening.
J. Wu, U.S. Patent Pub. No. 2005/0182624, “Method and Apparatus for Constructing a Speech Filter Using Estimates of Clean Speech and Noise”, describes a means to enhance speech in the presence of noise. The clean speech and noise are estimated from the noisy signal and used to define filter gains. These filter gains are used to estimate the clean spectrum from the noisy spectrum. The use of both Mel-frequency cepstral coefficients and regular cepstral coefficients (no Mel weighting) are both addressed as possible acoustic feature vectors. The observed noisy feature vector sequence is used to estimate the noise model (possibly a single Gaussian) in a maximum likelihood sense. The clean speech model is a Gaussian mixture model (GMM). Estimates of the clean speech and noise are determined from the noisy signal with a minimum mean square error (MMSE) estimate. The clean speech and noise estimates (in the cepstral domain) are taken back to the spectral domain. These spectral estimates are smoothed over time and frequency and are used to estimate Wiener filter gains. This Wiener filter is used to filter the original noisy spectral values to generate the spectrum of clean speech. This clean spectrum can be used either to reconstruct the original signal or to generate clean MFCCs for automatic speech recognition. The present invention makes no assumption concerning the noise, but rather models the perturbation of the clean speech due to the noise. Furthermore, Wu's invention estimates the clean speech in the spectral domain by means of a Wiener filter applied to the noisy spectrum. The present invention estimates the clean speech in the cepstrum by a novel acoustic class mapping matrix relating the individual component densities in the GMM for the clean speech and noisy model (modeling the perturbation of the clean speech cepstral vectors due to noise). One of the benefits to the present invention is that it can operate directly in the cepstral domain, allowing for utilization of the excellent acoustic modeling of that particular domain. While both methods make use of Mel-frequency cepstral coefficients and Gaussian mixture models to model clean speech, this is a commonly accepted means for acoustic modeling, specifically for automatic speech recognition as targeted by Wu's invention. Furthermore, Wu uses the minimum mean square error (MMSE) estimator for clean speech and noise. With the present invention, using the noisy and clean speech rather than the clean speech and noise, one cannot rely on the use of a MMSE estimator for estimation of the clean speech. Rather, one uses knowledge of the relationship between individual component densities in the GMM for both clean and noisy speech to modify the noisy MFCCs to have statistics closer to the anticipated clean speech component density. Finally, while the patent does mention that the clean spectrum estimate can be used to reconstruct speech, specifics of this reconstruction are not addressed. Rather, the focus of Wu's invention appears to be the use of the clean spectrum for subsequent computation of clean MFCCs for use in automated speech recognition. Furthermore, the present invention does not make use of any smoothing over time or frequency as does Wu in his invention.