By using the pronunciation features of each speaker when he/she is speaking, different speakers may be identified, so that speaker authentication can be performed. In the article “Speaker recognition using hidden Markov models, dynamic time warping and vector quantization” by K. Yu, J. Mason, J. Oglesby (Vision, Image and Signal Processing, IEE Proceedings, Vol. 142, October 1995, pp. 313-18), three common kinds of speaker identification engine technology are introduced, which are HMM (Hidden Markov Model), DTW (Dynamic Timing Warping), and VQ (Vector Quantization).
Usually, the process of speaker authentication includes two phases, enrollment and verification. In the phase of enrollment, the speaker template of a speaker is generated based on an utterance containing a password spoken by the same speaker (user); in the phase of verification, it is determined whether the test utterance is the utterance with the same password spoken by the same speaker based on the speaker template.
In the phase of enrollment, generally, the speaker template is obtained through training by clean speech data, while in the phase of verification, the actually incoming speech is noisy. Thus, the matching between noisy incoming data and clean template will definitely lead to the reduction of the authentication precision.
Substantially, the matching between a test utterance and an enrollment template is to compare the acoustics features of the test utterance with those of the enrollment template. Therefore, whether in the phase of enrollment or in the phase of verification, it is fairly important for the speaker authentication to select and extract the acoustic features from an utterance.
The principal task in the extraction of the acoustic feature from an utterance is to extract the basic features that can characterize the speaker from the utterance signal. The extracted acoustic features of the utterance should be able to effectively distinguish different speakers, while being able to keep the relative stability for the changes between the utterances from a same speaker. In the article “Signal Modeling Techniques in Speech Recognition” by J. W. Picone (Proceedings of the IEEE, 1993, 81(9): 1215-1247), an utterance feature, MFCC (Mel-Frequency Cepstral Coefficient) which is widely used in the speech and speaker recognition, is introduced. MFCC, as an acoustic feature derived by the promotion of the study results on the human auditory system, taking the auditory characters of human ear into consideration, transforms the spectrum to the Mel-Frequency scale based non-linear spectrum, which is in turn converted to the cepstrum domain, thereby well simulating human's auditory characters.
The extraction process of MFCC is as follows: first, the utterance is fast-fourier transformed from the time domain to the frequency domain; then the convolution of logarithm energy spectrum thereof is obtained by using the triangle filter-bank with Mel-scale; and finally the energy vector formed by the outputs of the respective filters is discrete cosine transformed, and the first N coefficients thereof are taken.
However, the shortcoming in the use of MFCC is that a fixed filter-bank rather than an adaptive filter-bank dependent on the speaker is used. In the phase of verification, the distortion measure between a test utterance and a speaker template is often assumed as symmetric distance functions like Euclidean, Mahalanobis distances and so on. Both fixed filter-bank and symmetric distance ignore the intrinsic detailed spectral structure of particular signal or template. This is a waste of a priori information especially for binary decision problem like text-dependent speaker verification.