Contemporary speaker recognition and speech recognition systems typically employ Mel-frequency cepstrum coefficients as the feature representation of human speech. Mel-frequency cepstrum coefficients are usually derived by digitizing human speech and applying a shifting window to obtain short-term frames to satisfy the stationary signal assumption. For each of such frames, compute the FFT (Fast Fourier Transform) spectrum, calculate filter band energy output where the center frequencies of the bands are Mel-frequency distributed, and finally use Discrete Cosine Transform (DCT) to produce Mel-frequency cepstrum coefficients (MFCC). There is one vector of MFCC's for each frame.
MFCC's can be augmented by their first-order and second-order derivatives (expanded feature vectors) to enhance the recognition performance for speaker and speech recognition. Moreover, each MFCC can also be mean-removed in order to mitigate, e.g., channel distortion.
The above MFCC's and their expansion and/or normalization work best in a quiet environment where training and testing conditions match. For noisy environments, improvements have been achieved by incorporating some noise-robust algorithms, such as spectral subtraction.
Yet, no system works optimally both for quiet and noisy environments. For example, a noise-robust system generally yields degraded recognition accuracies when operating in a quiet condition when compared to a non-noise robust counterpart.
Thus, while advancements have been made in computer-assisted speaker/speech recognition during the past several decades, contemporary speaker/speech recognition systems are nevertheless subject to a variety of problems such as the capability of mitigating noise interference and of separating inter-speaker variability from channel distortion.