As is known, a speaker recognition system is a device capable of extracting, storing and comparing biometric characteristics of human voice, and of performing, in addition to a recognition function, also a training procedure, which enables storage of voice biometric characteristics of a speaker in appropriate models, commonly referred to as voice-prints. The training procedure is to be carried out for all the speakers concerned and is preliminary to subsequent recognition steps, during which the parameters extracted from an unknown voice sample are compared with those of the voice-prints for producing the recognition result.
Two specific applications of a speaker recognition system are speaker verification and speaker identification. In the case of speaker verification, the purpose of recognition is to confirm or refuse a declaration of identity associated to the uttering of a sentence or word. The system must, that is, answer the question: “Is the speaker the person he/she says he/she is?” In the case of speaker identification, the purpose of recognition is to identify, from a finite set of speakers whose voice-prints are available, the one to which an unknown voice corresponds. The purpose of the system is in this case to answer the question: “Who does the voice belong to?”.
A further classification of speaker recognition systems regards the lexical content usable by the recognition system: text-dependent speaker recognition or text-independent speaker recognition. The text-dependent case requires that the lexical content used for verification or identification should correspond to what is uttered for the creation of the voice-print: this situation is typical in voice authentication systems, in which the word or sentence uttered assumes, to all purposes and effects, the connotation of a voice password. The text-independent case does not, instead, set any constraint between the lexical content of training and that of recognition.
Hidden Markov Models (HMMs) are a classic technology used for speech and speaker recognition. In general, a model of this type consists of a certain number of states connected by transition arcs. Associated to a transition is a probability of passing from the origin state to the destination one. In addition, each state can emit symbols from a finite alphabet according to a given probability distribution. A probability density is associated to each state, which probability density is defined on a vector of acoustic features extracted from the voice at fixed time quanta (for example, every 10 ms), said vector being generated by an acoustic analysis module (acoustic front-end), and is generally referred to as observation or feature vector. The symbols emitted, on the basis of the probability density associated to the state, are hence the infinite possible feature vectors. This probability density is given by a mixture of Gaussians in the multidimensional space of them feature vectors. Example of features widely used for speaker recognition are the Mel-Frequency Cepstrum Coefficients (MFCC), and first-order time-derivative features are usually added to the basic features.
In the case of application of Hidden Markov Models to speaker recognition, in addition to previously described HMM models, with a number of states, frequently recourse is had to the so-called Gaussian Mixture Models (GMMs). A GMM is a Markov model with a single state and with a transition arc towards itself. Generally, the probability density of GMMs is constituted by a mixture of multivariate Gaussian distributions with cardinality of the order of some thousands of Gaussians. Multivariate Gaussian distributions are commonly used to model the multidimensional input feature vectors. In the case of text-independent speaker recognition, GMMs represent the category of models most widely used in the prior art.
Speaker recognition is performed by creating, during a training step, models adapted to the voice of the speakers concerned and by evaluating the probability that they generate based on feature vectors extracted from an unknown voice sample, during a subsequent recognition step. Models adapted to individual speakers, which may be either HMMs or GMMs, are commonly referred to as voice-prints. A description of voice-print training techniques which is applied to GMMs and of their use for speaker recognition is provided in Reynolds, D. A. et al., Speaker verification using adapted Gaussian mixture models, Digital Signal Processing 10 (2000), pp. 19-41.
One of the main causes of relevant performance degradations in automatic speech and speaker recognition is the acoustic mismatch that occurs between training and recognition conditions. In particular, in speaker recognition, errors are due not only to the similarity among voice-prints of different speakers, but also to the intrinsic variability of different utterances of the same speaker. Moreover, performance is heavily affected when a model, trained in certain conditions, is used to recognize a speaker voice collected via different microphones, channels, and environments. All these mismatching conditions are generally referred to as intersession variability.
Several proposals have been made to contrast intersession variability effects both in the feature and model domains.
A popular technique used to improve performance of a speaker recognition system by compensating the acoustical features is the Feature Mapping, a description of which may be found in D. Reynolds, Channel Robust Speaker Verification via Feature Mapping, in Proc. ICASSP 2003, pp. II-53-6, 2003. In particular, Feature Mapping uses the a priori information of a set of channel-dependent models, trained in known conditions, to map the feature vectors toward a channel-independent feature space. Given an input utterance, the most likely channel-dependent model is first detected and then each feature vector in the utterance is mapped to the channel-independent space based on the Gaussian selected in the channel-dependent GMM. The drawback of this approach is that it requires labeled training data to create the channel-dependent models related to the conditions that one wants to compensate.
Thus, model-based techniques have been recently proposed that are able to compensate speaker and channel variations without requiring explicit identification and labeling of different conditions. These techniques share a common background, namely modeling variability of speaker utterances constraining them to a low dimensional eigenspace. Thanks to the reduce dimension of the constrained eigenspace, model-based techniques allow robust intersession compensation even when only few speaker-dependent data is available.
In general, all the model-based eigenspace techniques construct supervectors from the acoustic models. A supervector is obtained appending the parameters of all the Gaussians of a HMM/GMM in a single list. Typically, only the mean Gaussian parameters are included in the supervectors. Considering, for instance, a 512 Gaussian GMM, modeling 13 MFCC+13 time-derivative features, a supervector of 512×26=13312 features is generated.
The speaker or channel compensation is then performed applying the following equation:{circumflex over (μ)}=μ+Ux  (1)where μ and {circumflex over (μ)} are respectively uncompensated and compensated supervectors, Ux is a compensation offset, U is a low-rank transformation matrix from constrained intersession variability subspace to the supervector subspace, and x is a low dimensional representation of the intersession variability in the constrained intersession variability subspace.
In U.S. Pat. No. 6,327,565, U.S. Pat. No. 6,141,644 and S. Lucey, and T. Chen, Improved Speaker Verification Through Probabilistic Subspace Adaptation, Proc. EUROSPEECH-2003, pp. 2021-2024, 2003, the subspace matrix U for speaker compensation is built collecting a large number of speaker-dependent models of different speakers and applying a linear transformation that reduces the high-dimensional supervectors into base vectors. Principal Component Analysis (PCA) is usually used to construct the transformation matrix U as a concatenation of the K eigenvectors corresponding to the K largest eigenvalues. The selected eigenvectors are commonly known as eigenspeakers or eigenvoices because every speaker-dependent model can be near represented as a linear combination of basis vectors in the supervector domain.
A similar approach for channel compensation in speaker recognition is proposed in P. Kenny, M. Mihoubi, and P. Dumouchel, New MAP Estimators for Speaker Recognition, Proc. EUROSPEECH-2003, pp. 2964-2967, 2003. In particular, this technique, called in the publication eigenchannel MAP, constructs the constrained eigenspace from a large number of supervectors representing the intra-speakers variability. In order to estimate the eigenchannels, a number of speaker models, from a large collection of speakers and training set comprising several recordings of each of these speakers are needed.
In R. Vogt, B. Baker, S. Sridharan (2005): Modelling session variability in text-independent speaker verification, in Proc. INTERSPEECH-2005, 3117-3120, the intersession variability compensation is performed using the previous equation. In this case, transformation matrix U is trained by an expectation maximization (EM) algorithm to represent the types of intra-speaker variations expected between sessions. To this end, the subspace is trained on a database containing a large number of speakers each with several independently recorded sessions. Moreover, an iterative procedure to estimate the clean speaker supervector (p in the equation) is proposed. In the verification step each target model is compensated on a given test utterance i:{circumflex over (μ)}i(s)=μ(s)+Uxi(s)  (2)
Compensation is performed first estimating the low-dimensional representation of the intersession variability in recording i on the speaker s, namely xi(s), and then compensating the speaker supervector to the recording i, obtaining the compensated supervector {circumflex over (μ)}i(s). In particular, compensation is performed by computing the offset Uxi(s) in the supervector space as projection of the intersession variability vector xi(s) to the supervector space, through the low-rank transformation matrix U, from the constrained intersession variability subspace to the supervector space.