The present invention, in some embodiments thereof, relates to speech processing and, more specifically, but not exclusively, to Gaussian mixture models in speech processing.
Gaussian Mixture Models (GMM) use supervectors to represent speech signals in speech processing, such as in performing speaker recognition, language identification, emotion recognition, speaker diarization, and the like. Each supervector is a vector of parameter values, changes in parameter values (first differentials), second differentials of parameter values, the interactions between the parameter values, and the like. For example, a parameter is a Mel-frequency cepstral coefficient (MFCC) used in speech signal processing. As used herein, the term feature refers to an element of a GMM supervector. In a GMM method, a Universal Background Model (UBM) with diagonal covariance matrices may be applied to the training data to extract the supervectors. For example, a UBM with 512 Gaussian mixtures is used on the MFCC speech features of training data and supervectors are extracted. As used herein, the term Gaussian refers to the Gaussian mixtures used in the UBM. GMM supervector methods are also used for other uses, such as image processing, pattern recognition, computer vision, data mining, and the like, and the embodiments of the invention described herein are relevant to these other applications and others that use GMM methods.
In a GMM Nuisance Attribute Projection (NAP) framework a GMM is adapted to data from multiple sessions, such as an enrollment session, a testing session, a development session, a training session, and/or the like, from a UBM using an iterative Expectation-Maximization (EM) method, a Maximum A Posteriori (MAP) estimation, and/or the like. A projection function is estimated from the training data and is used to compensate intra-speaker intersession variability, such as channel variability and the like.
GMM supervectors are constructed from a concatenation of GMM means and the like, with typical dimensions of up to 10,000-100,000 elements. GMM supervectors methods, such as i-vector extractor training, Joint factor analysis (JFA), NAP, and the like, require the estimation of covariance matrices to analyze the data. In order to estimate these covariance matrices, training data may be used to calculate the interdependence of the variability of the different supervector parameters. For example, training a speaker recognition system uses up to hundreds or thousands of speech samples from different speakers because of the large size of the supervectors. Sometimes there may be differences between the training data and the evaluated data. For example, when the target data is mismatched to the available training data due to channel mismatch, such as in text-dependent speaker recognition, target data may be collected and is used to train the speaker recognition system from scratch, to adapt an already existing speaker recognition system, and the like.