Many applications rely on speaker recognition systems to either authenticate a speaker who is purporting to be a specific individual or to identify a speaker from a voice sample. For example, a security application controlling access to a building may authenticate a person requesting to enter the building by collecting a voice sample of the person and a purported identification of the person. Assuming that the true person is authorized to enter the building, the security application compares the input voice sample to previously collected voice samples of the true person to ensure that the person who wants to enter the building is indeed the true person. If so, the security application has authenticated the person and allows the person to enter the building. If not, the security application determines that the person is an imposter and denies entry. As another example, a wiretap application may collect voice samples of a telephone conversation and attempt to use speaker recognition to identify who is speaking. The wiretap application compares the voice sample to previous voice samples of known persons. If a match is found, the wiretap application has identified the speaker as the matching person.
Many speaker recognition techniques have been proposed to authenticate or identify a speaker by comparing a voice sample to a collection of voice samples. These speaker recognition systems can be classified as text-independent or text-dependent. In a text-independent speaker recognition system, a person can say any sequence of words both when training the speaker recognition system and when providing a voice sample for speaker recognition. A text-independent speaker recognition system employs a static analysis in which features (e.g., division of the sample into utterances) extracted from the speech are analyzed independently regardless of sequence. For example, the speaker can say “one two three” or “one three two” and the system will recognize the speaker. Text-independent speaker recognition systems typically use either a Gaussian Mixture Model or Vector Quantization. (See Reynolds, D., et al., “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1-3), 2000; Soong, F. K., Rosenberg, A. E., Juang, B. H., and Rabiner, L. R., “A Vector Quantization Approach to Speaker Recognition,” AT&T Journal, Vol. 66, pp. 14-26, 1987.)
In a text-dependent speaker recognition system, the system tells the speaker what to say or the speaker knows what to say (e.g., a password). A text-dependent speaker recognition system employs a dynamic analysis in which a sequence of features is analyzed to determine whether it corresponds to the known phrase as previously spoken by the speaker. Text-dependent speaker recognition systems typically use dynamic programming or a hidden Markov model.
Text-dependent speaker recognition systems typically require more training samples and are more computationally complex than text-independent speaker recognition systems. As result, text-dependent speaker recognition systems tend to be more accurate, but they support only a very limited vocabulary and sequence of words.
Typical speaker recognition systems have an initial training phase in which voice samples of a speaker are collected, features are extracted, and a model is generated from the extracted features for use in recognition. After a model is generated, a speaker recognition system inputs a target voice sample, extracts features, and compares them to the model or models. A popular set of features is referred to as the Mel Frequency Cepstral Coefficients (“MFCCs”). Typically, 12 or 13 features of the MFCC are extracted to form a feature vector. A voice sample is typically divided into overlapping frames of 10-20 milliseconds each and a feature vector is extracted from each frame. Thus, a one-second voice sample with a 20 ms frame size will have 50 frames with 50 feature vectors represented as X(1), X(2), . . . X(50). With static analysis, each feature vector is processed independently of the other feature vectors. With dynamic analysis, each feature vector is processed based on its sequential relationship to the other feature vectors. With dynamic analysis, a speaker recognition system analyzes how well entire sequences match, which is computationally expensive. To reduce the computational expense, some speaker recognition systems perform a static analysis on the MFCC features of a frame and, to capture the dynamics of the voice sample, a static analysis of the differences between the MFCC features of adjacent frames.
Typical speaker recognition systems are either template-based or vector quantization-based. A template-based speaker recognition system extracts features during training and keeps a single template for each feature as a representative of the speaker. Thus, there is one template for each speaker of the training data. During the speaker recognition, a feature vector is extracted from the voice sample and compared to all the templates. The speaker recognition system identifies the speaker as the person associated with the template that is closest to (e.g., a Euclidean distance) the extracted feature vector.
A vector quantization-based speaker recognition system creates a codebook for each speaker during training using standard vector quantization techniques. To generate a codebook, the speaker recognition system collects many voice samples and extracts a sequence of feature vectors for each sample. The speaker recognition system then compresses the dimensionality of the sequences of feature vectors to form code vectors. The speaker recognition system then generates a smaller number of code vectors that are representative of groups of the sequences of feature vectors that are similar. A codebook thus contains fewer code vectors than the voice samples and the code vectors have a lower dimensionality than the feature vectors. There is one codebook for each speaker. During speaker recognition, a sequence of feature vectors is extracted from the voice sample and its dimensionality is reduced to generate a code vector. The speaker recognition system then identifies the speaker as a person associated with the codebook that is closest to the code vector.