Speaker verification (SV) is the process of verifying an unknown speaker whether he/she is the person as claimed. Speech verification or utterance verification (UV) is the process of verifying the claimed content of a spoken utterance (for example, verifying the hypotheses output from an automatic speech recognition system). Both SV and UV technologies have many applications. For example, SV systems can be used in places (e.g. security gates) where access is only allowed to certain registered people. UV systems can be used to enhance speech recognition systems by rejecting non-reliable hypotheses and therefore improve the user interface. Sometimes UV is included as a component of a speech recognition system to verify the hypotheses from the speech recognition process.
Acoustic model training is a very important process in building any speaker verification systems and speech or utterance verification systems. Acoustic model training has been extensively studied over the past two decades and various methods have been proposed.
Maximum likelihood estimation (ML) is the most widely used parametric estimation method for training acoustic models, largely because of its efficiency. ML assumes the parameters of models are fixed but unknown and aims to find the set of parameters that maximizes the likelihood of generating the observed data. ML training criterion attempts to match the models to their corresponding training data to maximize the likelihood.
Although ML is found to be efficient, discriminative training methods have proven to achieve better models. One example of a discriminative training method is Minimum Verification Error (MVE) training. MVE training criterion attempts to adjust model parameters so as to minimize the approximate verification errors on the training data. While the above mention methods have both proven to be effective, discriminative training methods for creating more accurate and more robust models are still being pursued.