1. Technical Field
The present disclosure generally relates to a method and an apparatus for audio signal enhancement in a reverberant environment.
2. Related Art
Reverberation is essentially the multi-path problem of the acoustic signal and occurs in a completely or partially enclosed environment in which acoustic waves trapped in the enclosure repeatedly reflect of the surface of the enclosure. When a speech signal is captured by a microphone in a reverberated environment, the speech signal not only contains the direct component of the speech, but may also contain a reverberation component which interferes with the direct component of speech as well as any background noise component from the environment which may be picked up by the microphone. The background component may include white noise, noise of background cooling systems such as cooling fans, clock noise, harmonics of clock noise, and so forth.
While a human ear may be relatively immune to the effects of reverberation, typical automatic speech recognition (ASR) engines would suffer the impact of the reverberation as the ASR accuracy in a reverberated environment could typically drop between twenty to thirty percent. If a person says “I want to play”, the current ASR engine may have difficulty recognizing the phrase since the effect of “want” may jump into “to”, and the effect of “to” may jump into “play”. If the environment is highly reverberated, the effect of “I want to” may all jump into “play”. While the background noise may be easy to remove, the reverberation on the other hand may be much more difficult to eliminate as hundreds of multi-path speech signals could be reflected into a microphone when the speech is continuous. Therefore, various endeavors in the field of speech have been made to identify and cancel the effect of reverberation.
One such endeavor is disclosed in a research paper by Bradford W. Gillespie et al. titled “SPEECH DEREVERBERATION VIA MAXIMUM-KURTOSIS SUBBAND ADAPTIVE FILTERING” which is hereby incorporated by reference for all purposes. In this research paper, the microphone signal is processed using a modulated complex lapped transform (MCLT), in which the subband filters are adapted to maximize the kurtosis of the linear prediction (LP) residual of the reconstructed speech. The key concept of this research paper is to control the adaptive subband filters not by a mean-square error criterion, but by kurtosis metric of LP residuals.
Linear prediction (LP) is a mathematical technique from which the future values of a speech signal could be estimated based on a linear function of previous samples. After the process of inverse filtering, and the remaining LP values after the subtraction of the filtered signal referred to as the LP residual or LP residue. The LP residue contains information about the excitation source of speech production. In other words, the LP residue is considered to contain nearly the pure excitation source since it has removed unwanted artifacts of the vocal track. A paper published 1975 by “John Makhoul” titled “LINEAR PREDICTION: A TUTORIAL REVIEW” discloses a technique for modeling and calculating of the LP residual and is hereby incorporated by reference.
In the recent research in the field, the characteristics of kurtosis in LP residual have been utilized for removing reverberation. Kurtosis is a measure of the “peak-ness” of the probability distribution of a real-valued random variable. In a similar way to the concept of “skew-ness”, kurtosis characterizes the shape of a probability distribution function (PDF). For example, if the shape of a plotted histogram of a random variation is completely Gaussian, then the random variable would have a kurtosis value equals to zero.
It has been observed that the probability distribution function (PDF) of the LP residual for clean speech components is sub-Gaussian whereas the corresponding PDF for the reverberated components is approximately Gaussian. Thus, the LP residual for the reverberated segments exhibits higher entropy than that of the clean segments. Therefore, one method could be to utilize the aforementioned characteristics of the kurtosis of the LP residual by developing an adaptive algorithm which maximizes the kurtosis of the LP residual. In other words, a blind de-convolution filter could be searched to make the LP residual as far from being Gaussian as possible.
This particular method could be characterized as follows. First, a reverberant speech is inputted into an adaptive inverse filter which is aimed to remove the effect of reverberation. A LP analysis is then performed for the output of the adaptive inverse filter. Next, the gradient of the Kurtosis is calculated based on the output of the LP analysis. The result of the Gradient of Kurtosis is then fed back to the Adaptive Inverse filter to adjust the filter coefficients of the Adaptive Inverse filter accordingly. Essentially, this particular method is based on maximizing the kurtosis of the LP residual of the output speech signal.
Another approach to removing effects of reverberation is presented in a research paper by Kshitiz Kumar titled GAMMATONE SUB-BAND MAGNITUDE-DOMAIN DEREVERBERATION FOR ASR, which is hereby incorporated by references for all purposes. This particular method is based on performing non-negative matrix factorization (NMF) processing on an input speech signal in the GammaTone magnitude spectral domain. For this method, a reverberated speech is assumed to be the convolution of a clean speech and a room response; therefore by factoring the reverberated speech using a least-squares error criterion into a clean speech and a filter by using the non-negatively and the sparsity of the speech as constraints, the room response can be estimated iteratively.
A NMF processing technique in the GammaTone frequency domain could be explained as followed. Assuming that an input speech signal is captured. The input speech signal is first pre-emphasized with a causal filter, and then is windowed. Next, FFT analysis is performed to the windowed signal, and then a GammaTone transformation is performed by applying a GammaTone filter to the FFT signal. A GammaTone filter is a linear filter described by an impulse response that is the product of a gamma distribution and sinusoidal tone and is a widely used model of auditory filters in the auditory system. Next, NMF processing is performed to the signal after GammaTone transformation, and the NMF decomposition is directly applied individually to each of the FFT channels. A pseudo-inverse of the GammaTone filter is then applied to the NMF processed signal to obtain the processed Fourier frequency components, and then the frequency components can be converted back to the time domain to obtain the final output speech signal.