Speaker recognition schemes generally include a feature extraction stage followed by a classification stage. The features used in speaker recognition are a transformation of an input speech signal into a compact acoustic representation that contains person-dependent information useful for the identification of the speaker. A classifier uses these features to render a decision as to the speaker identity or verifies the claimed identity of the speaker.
Conventional approaches to classification are based on a universal background model (UBM) estimated using an acoustic Gaussian mixture model (GMM) or phonetically-aware deep neural network (DNN) architecture. Each approach includes computation of “sufficient statistics,” also known as Baum-Welch statistics. (See Dehak et al., “Front-end Factor Analysis for Speaker Verification”, IEEE TASLP, 2011; and Lei et al., “A Novel Scheme for Speaker Recognition Using a Phonetically-aware Deep Neural Network”, IEEE ICASSP, 2014). In general, a UBM is a model used to represent general, person-independent feature characteristics to be compared against a model of person-specific feature characteristics (e.g., the extracted features noted above). In the case of UBM-GMM, the UBM is a speaker-independent GMM trained with speech samples from a large set of speakers to represent general speech characteristics. The resultant classes are multivariate Gaussian distributions and model the acoustic distribution of the speech sample. In the case of UBM-DNN, the classes are senones and model the phonetic distribution of the speech. The UBM-GMM is trained using the expectation-maximization (EM) procedure, while the UBM-DNN is trained for the task of automatic speech recognition (ASR) to distinguish between the different phonetic units.
The most successful conventional techniques consist of adapting the UBM model to every speech utterance using the “total variability” paradigm. The total variability paradigm aims to extract a low-dimensional feature vector known as an “i-vector” that preserves the total information about the speaker and the channel. In the i-vector approach, a low dimensional subspace called total variability space is used to estimate both speaker and channel variability. Baum-Welch statistics are first computed over the given UBM to estimate the total variability. The UBM, composed of Gaussian components or senones, is used to extract zero-order, first order, and second order Baum-Welch statistics (alternatively referred to as “sufficient statistics”). Zero-order statistics are the posterior probabilities of a short-term (usually 10 ms) feature vector computed using each class of the UBM, whereas first and second order statistics are computed using the posterior probabilities and the feature vector. After applying a channel compensation technique, the resulting i-vector can be considered a voiceprint or voice signature of the speaker.
Conventional speaker recognition systems are based on i-vectors. Those i-vectors are computed via dimensionality reduction of the first-order statistics through the total variability procedure. The training of the total variability consists of maximizing the likelihood over the training data using an iterative EM process. A typical tool-chain 100 of an i-vector based speaker recognition system is illustrated in FIG. 2. The chain may include, for example, a voice activity detector (VAD) 110 to discard the non-speech part of speech sample 50 (the remaining speech portion referred to herein as “net speech”), a feature extractor 120, where features such as Mel-frequency cepstral components (MFCC) (125) are extracted and normalized, a Baum-Welch statistics extractor 130 using a pre-trained UBM 128 to generate first-order GMM statistics 135. An i-vector extractor 140 uses the pre-trained total variability matrix to produce an i-vector 145. Post-processing of the i-vector 145 may employ a whitening transformation 150, length normalization 160, linear discriminant analysis (LDA) and/or within-class covariance normalization (WCCN) 170, and probabilistic linear discriminant analysis (PLDA) 180 that is used for both compensation and scoring. Feature extraction and statistics accumulation may be considered a “front end” of a speaker recognition apparatus or system 100.
One major problem of the i-vectors technique is that it is time-consuming at both training and testing time. Some applications of authentication and fraud detection require a near-instantaneous decision, particularly at testing time.
A second major problem of the i-vectors technique is that i-vectors are not well suited to variable-duration utterances.
Furthermore, as mentioned above, i-vectors account for total variability of the speech signal including both speaker and channel variability. Therefore, the additional post-processing discussed above is often required to remove the channel effect and to reduce the mismatch between training and testing conditions. Several post-processing techniques have been proposed to solve this problem: relevant ones being whitening, length-normalization, LDA, WCCN, and PLDA. However, those modeling techniques are based on mathematical assumptions that usually do not hold when the speaker and/or channel variability is very high. For example, in UBM-GMM it is assumed that the distribution of the audio features to be (e.g., MFCC) follow a mixture of multivariate Gaussian distribution; an in a Total Variability formulation for i-vector extraction, linearity is assumed (e.g., x=m+Tw, where x is the supervector of the speech signal, m is the supervector of UBM-GMM, T is the low-dimensional total variability subspace, and w is the low-dimensional i-vector.
It is an object of this disclosure to address the problems of relatively slow and inefficient computation of features from sufficient statistics. It is another object of this disclosure to address the problems that result from high speaker and/or channel variability.