Speaker verification is used to determine whether a speaker is who he or she claims to be, based on a presented sample utterance. For example, an individual might have a bank account that they wish to access over the phone, where additional security is provided by speaker verification whereby the individual has to present a speech sample to the system, which verifies that the individual is who they claim to be.
Speaker verification consists of two phases: training and testing. In the training phase, the utterances spoken by an individual whose identity has already been confirmed (using passwords etc) are used to build a reference model. In the testing phase, a sample utterance received from an individual is compared against the reference model associated with the claimed identity.
Currently, the most common methods for speaker verification, especially operating in the text-independent mode where there are no constraints on the textual content of speech, are based on GMM-UBM (Gaussian Mixture Models-Universal Background Model) and GMM-SVM (Gaussian Mixture Models-Support Vector Machines). Other effective methods for this purpose include approaches based on GMM (Gaussian Mixture Models) and HMM (hidden Markov Models).
Regardless of the method used, in practice, speaker verification accuracy can be adversely affected by variations in speech characteristics due to additive noise (e.g. background noise). Such variations can cause a mismatch between the training and testing speech material from the same speaker, which in turn can reduce the verification accuracy. For example, if the training phase is performed in a noisy environment, the reference model will reflect that, which in turns means that if the test phase is performed in a quiet environment, mismatches can occur. Similarly, the opposite is also true with clean training data, but noisy test data.
Over the last few years, considerable research has been carried out into methods for minimising the effects of speech variation due to additive noise on speaker verification accuracy. This has resulted in various methods being developed such as spectral subtraction, Kalman filtering and missing-feature theory. These approaches focus on enhancing the quality of the test material (speech) before the testing process. In other words, they assume that the training material (speech) is always free from any form of degradation or noise.
Another method referred to as data-driven Parallel Model Combination (PMC) has also been proposed, and involves estimating degradations in testing and training material, and using these estimations to minimise the data mismatch conditions. However, the technique requires contaminating the reference utterance (and hence model) for the target client and also the test utterance with each test. Effectively, this requires rebuilding the reference model with each test trial which is computationally expensive and impractical in a real world situation.