Speaker recognition systems can be used to confirm or refuse that a person who is speaking is who he or she has indicated to be (speaker verification) and can also be used to determine who of a plurality of known persons is speaking (speaker identification). Such a speaker identification system can be open-set if it is possible that the speaker who is speaking is not one of the persons known to the system or close-set if the speaker is always in the set of the system. Such systems may find application in telephone banking, suspect identification and may generally be used in a security related context.
Such speaker recognition systems may require the user to say the same lexical content (e.g. the same key phrase) for both the enrolment and the recognition. Such a system is a text-dependent system, offering in some cases additional security because it requires recognizing the identity of the speaker as well as the lexical content of the utterance.
Such recognition systems may also be text-independent, thus not setting any constraint with regard to the lexical content of the enrolment and of the recognition utterances. Such systems may have the advantage that people may be identified for example from common conversations, e.g. everyday conversations or enrolled with such common conversations of which files already exist.
Document US 2008/0312926 A1 discloses an automatic text-dependent, language-independent speaker voice-print creation and speaker recognition based on Hidden Markov Models (HMM) and Automatic Speech Recognition (ASR) systems. Document US 2007/0294083 A1 discloses a fast, text-dependent language-independent method for user authentication by voice based on Dynamic Time Warping (DTW). Document U.S. Pat. No. 6,094,632 discloses a speaker recognition device where the ASR system and Speaker Identification (SID) system outputs are combined.
Patrick Kenny provides an introduction to speaker verification related methods, in particular an algorithm, which may be used in speaker recognition systems in his article “Joint Factor Analysis of Speaker Session Variability: Theory and Algorithms”.
Another prior art document is the document “Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification” by N. Dehak et al. in Interspeech, Brighton, London, September 2009.
It is known to use Hidden Markov Models (HMM) consisting of a set of states which correspond to a deterministically observable event and are connected by transition probability arcs. States are defined on a vector of parameters and are extracted from the voice signal. Each state has an associated probability density function (pdf), which models the feature vectors associated to that state. Such a probability density function may for example be a mixture of Gaussian functions (Gaussian Mixtures Models, GMM), in the multi-dimensional space of the feature vectors, but other distributions may also be used.
The Hidden Markov Model is defined by the transition probabilities Πqq′ associated with the arcs representing the probability of moving from state q to state q′, the initial state distributions Π0q, which are associated to the state q and are the initial probabilities of each state and the observation probability distribution λq which is associated with the state q and may for example be a GMM. Those observation probability distributions are defined by a set of parameters depending on the nature of the distributions.
Conventional approximations for using Hidden Markov Models in text-dependent speaker recognition frameworks usually requires a transcription of the used phrase which is needed to build the speaker HMM by applying some kind of HMM adaption, for example, a Maximum A Posteriori (MAP) (as disclosed in e.g. J. Gauvin and C. Lee “Maximum Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains” IEEE Transactions on Speech and Audio Processing, 2(2): 291-298) or Maximum Likelihood Linear Regression (MLLR) (as disclosed e.g. in C. J Leggeter and P. C. Woodland in “Maximum likelihood linear regression for speaker adaptation of the parameters of continuous density Hidden Markov Models”) or other adaptations from a starting point model like a concatenation of generic HMMs representing units (e.g. phonems or words) of audio signals e.g. the phrase. In this framework, the generic HMMs are usually called Universal Background Model (UBM). From this, a scoring can be computed using a suitable algorithm like for example Viterbi or forward-backward algorithm as disclosed e.g. in L. R. Rabiner “a tutorial of Hidden Markof Models and selected applications in speech recognition”, Proc. Of IEEE77 (2): 257-286, DOI:10.1109/5. 18626. [1].
Such generic HMMs usually require supervised training because every unit (e.g. phoneme, word, . . . ) needs to be associated with a certain HMM. From this the speaker recognition framework can be classified depending on how the transcription is obtained. Possibilities on how such a transcription can be obtained comprises prior knowledge, using conventional speech recognition systems or using universal speech recognition systems as described for example in US 2008/0312926. However, these approaches generally require supervised training and/or are computationally intensive, require a large amount of memory, are usually language dependent and/or are not very flexible. The classical approaches for text-dependent HMM based speaker recognition systems may additionally have the disadvantage that the speaker HMM model has a direct relation with the transcription which may be stolen in at least one point of the system.
In classical speaker recognition using HMM adaption techniques, all the information of the feature vectors is incorporated into the speaker model, even though some information, like for example the channel, is not a typical feature of the speaker and should thus not be included in the speaker model.
For these reasons, classical text-dependent speaker recognition approaches have considerable limitations.
Some of their problems are the above described; storage of the transcription or an estimation of the transcription of the speaker phrase, the use of a speaker recognition or phonetic decoder making the system use a lot of memory and unsuitable for small devices like tablets or smart phones, and the fact that they do not compensate the channel or other negative effects of the speech signal.
Preferably, an improved system may take advantage of the information of the temporal sequence of the feature vectors, which may be extracted from the audio signals, and provide satisfactory performance and accuracy without using a transcription of the utterance, e.g. the speech phrase.