Current state-of-the art approaches to speaker recognition are based on a universal background model (UBM) estimated using either acoustic Gaussian mixture modeling (GMM) (see Douglas A. Reynolds et al., “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, 2000, the entire contents of which are herein incorporated by reference) or phonetically-aware deep neural network architecture (see Y. Lei et al., “A Novel Scheme for Speaker Recognition Using a Phonetically-Aware Deep Neural Network,” Proceedings of ICASSP 2014, the entire contents of which are herein incorporate by reference). The most successful techniques consist of adapting the UBM model to every speech utterance using the total variability paradigm (see N. Dehak et al., “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, No. 4, pp. 788-798, May 2011, the entire contents of which are herein incorporated by reference). The total variability paradigm aims to extract a low-dimensional feature vector known as an “i-vector” that preserves the total information about the speaker and the channel. After applying channel compensation technique, the resulting i-vector can be considered a voiceprint or voice signature of the speaker.
The main drawback of such approaches is that, by only using handcrafted features designed to reproduce the human perception system, they tend to discard useful information that is important to recognize or verify speakers. Typically, the aforementioned approaches utilize low-level features, such as Mel Frequency Cepstrum Coefficients (MFCCs), and attempt to fit them to a fixed number of Gaussian distributions (typically 1024 or 2048 Gaussians). This makes it difficult to model complex structures in a feature space where the Gaussian assumption does not necessary hold.