The goal of a speaker verification system is to determine if a test utterance is spoken by a speaker having an unknown or alleged identity (i.e., determining whether an unknown voice is from a particular enrolled speaker). The problem is typically formalized by defining a 2-class Hypothesis test:H0: tested speaker is the target speaker,H1: tested speaker is not the target speaker.  (1)
Let xenr denote the total feature space of the enrolled (enr) speaker (large number of D dimensional feature vectors) available for offline training. Then one approach is to represent H0 by a model denoted λenr that characterizes the hypothesized speaker (statistics of the feature space xenr). The alternative hypothesis, H1, is represented by the model λubm that captures the statistics of the space of imposter speakers.
Let x=[x1, x2, . . . , xN] be a sequence of N, D dimensional feature vectors, extracted from the test utterance. To perform verification, H0 and H1 are tested with the feature sequence x, extracted from the test utterance (test data is matched with the model to calculate a verification score). This is done by calculating the log-likelihoods of x, given the models λ to constructΛ(x)=log(p(x|λenr))−log(p(x|λubm))  (2)where λenr is a model characterizing the hypothesized enrolled speaker and λubm is a Universal Background Model (UBM) characterizing all enrolled speakers. The log-likelihood distance Λ measures how much better the enrolled speaker model scores for the test utterance compared to the UBM. The Hypothesis test can be resolved based on the following relationship:if Λ(x)>θ accept H0,if Λ(x)≤θ accept H1  (3)where θ is an offline optimized threshold level.
Gaussian mixture models (GMMs) are the dominant approach for modeling distributions of feature space in text-independent speaker verification applications. So that λ denotes weights, mean vector and covariance matrix parameters of the GMM with K components λ: {uk, μk, Σk}k=1K 
In other words, probability distributions are modeled as superposition of K components (Gaussian densities) Φk, with weights uk, based on the following equation:
                              log          ⁡                      (                          p              ⁡                              (                                  x                  |                  λ                                )                                      )                          =                              ∑                          n              =              1                        N                    ⁢                                          ⁢                      log            ⁡                          (                                                ∑                                      k                    =                    1                                    K                                ⁢                                                                  ⁢                                                      u                    k                                    ⁢                                                            Φ                      k                                        ⁡                                          (                                              x                        n                                            )                                                                                  )                                                          (        4        )            where summation over n accumulates contributions from individual features vectors xn in the test sequence s. The components Φk are determined by set of means μk and covariances Σk based on the following equation:
                                          Φ            k                    ⁡                      (                          x              n                        )                          =                              exp            ⁢                          {                                                -                                      1                    2                                                  ⁢                                                      (                                                                  x                        n                                            -                                              μ                        k                                                              )                                    T                                ⁢                                                      ∑                    k                                          -                      1                                                        ⁢                                                                          ⁢                                      (                                                                  x                        n                                            -                                              μ                        k                                                              )                                                              }                                                                          (                                  2                  ⁢                  π                                )                                            D                2                                      ⁢                                                                            Σ                  k                                                                            1                2                                                                        (        5        )            
In a more general sense, the λenr GMMs for the enrolled speakers can be considered to model the underlying broad phonetic sounds that characterize a person's voice, while the much larger λubm GMM for the space of imposter speakers captures underlying sound classes in speech. Enrolled speakers λenr are simply trained on the available audio data for each particular speaker. The λubm is trained by pooling speech from a large number of enrolled speakers to build a single model, UBM, which results in one complex model for the imposter space. The λubm GMM can have a large number of components, typically K>1024, compared to about 64 components for the enrolled GMM.
One can distinguish two major classes of speaker verification systems: 1) text-dependent system which assumes that a person to be recognized is speaking a previously defined text string; and 2) text-independent speaker verification which does not know what text string is being spoken by a person to be recognized.
Text-dependent systems are more accurate, but their usage is typically limited to security applications because the speaker must vocalize one or more words or phrases from an allowed set. Text-independent speaker verification systems have been used in more types of applications, but are less accurate because they have to model speakers for a large variety of possible phonemes and contexts. This means that a context independent model can have a relatively high probability assigned to a feature subspace that is not present in the test utterance, which can offset the speaker verification of that particular utterance and result in incorrect verification. This problem becomes particularly pronounced in cases where the feature space of the current test utterance is modeled unequally well by the UBM and the speaker model.