By using pronunciation features of each speaker when he/she is speaking, different speakers may be identified, so as to make speaker authentication. In the article “Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation” written by K. Yu, J. Mason, J. Oglesby (Vision, Image and Signal Processing, IEE Proceedings, Vol. 142, October 1995, pp. 313-18), commonly used three kinds of speaker identification engine technologies have been introduced: HMM (Hidden Markov Model), DTW (Dynamic Timing Warping) and VQ (Vector Quantization).
Generally, a speaker authentication system includes two phases: enrollment and verification. In the phase of enrollment, a speaker template of a speaker (client) is produced according to an utterance containing a password that is spoken by the speaker; in the phase of verification, it is determined according to the speaker template whether the testing utterance is an utterance containing the same password spoken by the speaker. Specifically, a DTW algorithm is usually used in the phase of verification to DTW-match an acoustic feature vector sequence of the testing utterance and a speaker template to obtain a matching score, and the matching score is compared with a discriminating threshold obtained in the phase of enrollment to determine whether the testing utterance is an utterance containing the same password spoken by the speaker. In the DTW algorithm, a common way to calculate a global matching score between an acoustic feature vector sequence of a testing utterance and a speaker template is to add up all local distances along an optimal matching path directly. However, there are often some big local distances due to matching mistakes during a client trial. This may bring difficulties to distinguishing clients from impostors.
A speaker verification system based on frame-level verification is proposed in an article “Enhancing the stability of speaker verification with compressed templates” written by X. Wen and R. Liu, ISCSLP2002, pp. 111-114 (2002). A fuzzy logic-based speech recognition system is described in an article “Fuzzy logic enhanced symmetric dynamic programming for speech recognition” written by P. Mills and J. Bowles, Fuzzy systems, proceedings of the Fifth IEEE International Conference on, Vol. 3, pp. 2013-2019 (1996). The concept of these two methods is to apply a transform to the local distances in a DTW algorithm. However, these two methods are sensitive to parameters and proved to be effective only when suitable parameters are set for each template.