As introduced in recent years, great efforts have been focused on the task of continuous speech recognition (CSR), and significant advances in the state-of-the-art have been achieved. CSR is becoming a preferred user interface for mobile applications, often in “difficult” acoustic environments. One of the main challenges is the estimation and modeling of robust-to-noise speech features that can enhance the automatic speech recognition (ASR) performance in noisy environments.
In this context, many methods have been proposed for robust ASR feature extraction. These methods are distinguished into two large clusters: either extract noise-robust features or post-process the extracted features to suppress some of the noise introduced. Micro-modulation features capture the fine-grain formant frequency variations and are extremely robust-to-noise. It is also quite common to post-process features by smoothing, e.g. mean subtraction, variance normalization, and ARMA filtering (MVA) or RASTA filtering, and by feature transformations like heteroscedastic discriminant analysis (HDA) and/or maximum likelihood linear transform (MLLT). Especially this last scheme is widely adopted in most of state-of-the-art large vocabulary conversational speech recognition (LV-CSR) systems.