In the present document reference will be made to the following documents:
[1] C. Breithaupt, T. Gerkmann, and R. Martin, “Cepstral smoothing of spectral filter gains for speech enhancement without musical noise,” IEEE Signal Processing Letters, vol. 14, no. 12, pp. 1036-1039, December 2007.
[2] C. Breithaupt, T. Gerkmann, and R. Martin, “A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing,” IEEE ICASSP, pp. 4897-4900, April 2008.
Many successful speech enhancement algorithms work in the short-time discrete Fourier transform (DFT) domain. A drawback of DFT based speech enhancement algorithms is that they yield unnatural sounding structured residual noise, often referred to as musical noise. Musical noise occurs, e.g. if in a noise-only signal frame single Fourier coefficients are not attenuated due to estimation errors, while all other coefficients are attenuated. The residual isolated spectral peaks in the processed spectrum correspond to sinusoids in the time domain and are perceived as tonal artifacts of one frame duration. Especially when speech enhancement algorithms operate in non-stationary noise environments unnatural sounding residual noise remains a challenge.
Recently, a selective temporal smoothing of parameters of speech enhancement algorithms in the cepstral domain has been proposed [1, 2] that reduces residual spectral peaks without affecting the speech signal. In [1] the algorithms based on cepstro-temporal smoothing (CTS) are compared to state-of-the-art speech enhancement algorithms in terms of listening experiments. In [1] it is shown that CTS yields an output signal of higher quality especially in babble noise, and that the number of spectral outliers in the processed noise is less than with state-of-the-art algorithms. In the literature it is shown that CTS yields an output signal of increased quality when applied as a post processor in a speaker separation task. However, due to the non-linear log-transform inherent in the cepstral transform, a temporal smoothing yields a certain bias as compared to a smoothing in the linear domain. This bias results in an output signal with reduced power. While the reduced signal power has only a minor influence on the results of listening experiments, instrumental measures are often sensitive to a change in signal power. Thus, instrumental measures may indicate a reduced signal quality if CTS is applied, while listening experiments indicate a clear increase in quality.
In [2] CTS is applied to a maximum likelihood estimate of the speech power to replace the well-known decision-directed a-priori signal-to-noise ratio (SNR) estimator. It is shown that a CTS of the speech power may yield consistent improvements in terms of segmental SNR, noise reduction and speech distortion if a bias correction is applied.