This invention relates to the field of speaker verification and more particularly to a method and apparatus for generating certain data that is specific to a user and that can be used by a speaker verification system to authenticate the user based on a speech pattern. This invention is applicable to speech activated security systems such as access to voice-mail, automated telephone services, automated banking services and voice directed computer applications, among others.
Speaker verification is the process of verifying whether a given speaker is a claimed speaker. The basis of this process lies on comparing a verification attempt with a speaker specific speech pattern representative of the claimed speaker and then calculating the likelihood of the verification attempt actually being generated by the claimed speaker. A common approach is to determine the likelihood of the verification attempt being generated by the claimed speaker given the speaker specific speech pattern. Typically, if the calculated likelihood is above a certain threshold then the verification attempt is accepted as being generated by the claimed speaker. Otherwise, the verification attempt is rejected. The level of the threshold depends on a number of factors such as the level of security required and therefore on the level of tolerance for false acceptance or false rejection.
Speaker verification systems can be characterized as being either password non-specific, where the verification is entirely done on the basis of the voice of the speaker, or password specific, where the speaker must utter a specific password in addition to having the proper voice. Password specific speaker verification systems are desirable because an additional level of security is added since the speaker must utter the correct password in addition to having a voice with the correct acoustic properties. In addition, password specific speaker verification systems may be desirable when a given functionality in a system using speaker verification is operatively linked to a given password.
A common approach for improving the speaker verification process is the use of normalizing techniques such as the world normalizing model, the background normalizing model and cohort normalization model. The world, background and cohort normalization models perform verification on the basis of a template representing the claimed speaker, and a template that is independent of the claimed speaker. The template representing the claimed speaker is herein referred to as the speaker specific speech pattern. The template that is independent of the claimed speaker is herein referred to as a normalizing template. In broad terms, normalizing techniques involve computing a likelihood score indicative of a probability that the verification attempt was generated by the claimed speaker and normalizing the likelihood score by a second score, herein referred to as the normalizing score. For additional information on the background, cohort and world normal-zing methods, the reader is invited to refer to Gu et al. (1998) xe2x80x9cAn Implementation and Evaluation of an On-line speaker Verification System for Field Trialsxe2x80x9d Proc. ICASSP ""98, pp. 125-128 and to Rosenberg et al. (1996) xe2x80x9cSpeaker Background Models for Connected Digit Password Speaker Verificationxe2x80x9d Proc. ICASSP ""96, pp. 81-84. The contents of these documents are hereby incorporated by reference.
In the cohort normalizing method, the normalizing template is indicative of a template representing the most competitive speaker specific speech pattern selected from a group of speaker specific speech patterns. This is done by scoring the verification attempt against various speaker specific speech patterns in a set of speaker specific speech patterns excluding the speaker specific speech pattern associated to the claimed speaker. The speaker specific speech patterns in the set are indicative of a same password uttered by different speakers. The highest scoring speaker specific speech pattern in the database of speaker specific speech patterns is retained as the most competitive speaker specific speech pattern for use in the normalizing process. The score of the verification attempt on the speaker specific speech pattern associated to the claimed speaker is compared to the score of the verification attempt on the most competitive speaker specific speech pattern in order to determine whether the given speaker is to be accepted as the claimed speaker or not.
Mathematically the cohort normalizing method can be expressed as follows:
xe2x80x83log L(O)=log p(O|xcexc)xe2x88x92max{log p(O|xcexi)}
where L(O) is the likelihood of a verification attempt observation O, p(O|xcexc) is a probability that the observation O corresponds to the parameters given by xcexc, representative of the speaker specific speech pattern associated to the claimed speaker, and p(O|xcexi) is a probability that an observation O corresponds to the parameters given by xcexi, which represents a set of speaker specific speech patterns other than the speaker specific speech patterns associated to the claimed speaker; max{log p(O|xcfx80i)} represents the logarithmic likelihood of the most competitive speaker specific speech pattern.
In the background normalizing method, the normalizing template is derived by combining speaker specific speech models from a set of speech models associated to possible imposters to form a template. The speech models selected to be part of the normalizing template are typically derived on a basis of a similarity measurement. The score of the verification attempt of speaker specific pattern associated to the claimed speaker is compared to the normalizing template in a manner similar to that described in connection with the cohort normalizing method.
Methods of the type described above require a database of speaker dependent models to create the normalizing template. Performance is closely tied to the contents of the database of speaker specific models. Optimally, the database of speaker specific models should contain the speaker specific models associated to a probable imposter trying to access the system. Having a database containing a priori a complete set of speaker specific models is prohibitive to create.
Another common method is the world normalization model. In this method, instead of performing verification on the basis of the score of the speaker specific speech pattern associated to the claimed speaker and of many possible speaker specific speech patterns, the verification is done on the basis of a speaker specific speech pattern associated to the claimed speaker and of a single world template or speaker independent template. A speaker independent model set generated from a large number of speech samples collected from a large number of speakers uttering a plurality of words is used to generate a speaker independent template representative of an average pronunciation of a specific word by an average speaker. In other words, the speaker independent model set allows creating an approximation of the actual pronunciation of the specific word since the pronunciation was generated from a plurality of uttered words.
The world normalization method does not require a database of speaker dependent models and is therefore more flexible than the cohort and background model methods. A deficiency of the world normalization model is a lower performance in terms of speaker verification for a given acceptance/rejection threshold since the world normalizing model is an overgeneralization of the pronunciation of the specific word considered.
Consequently, there is a need in the industry for providing a method and apparatus for generating an improved normalizing template for use in a speaker verification system.
In accordance with a broad aspect, the invention provides an apparatus for creating a biased normalizing template suitable for use by a speaker verification system to authenticate a speaker based on a speech pattern. The speech pattern is representative of a first set of speech characteristics. The apparatus comprises an input for receiving an input signal representative of a speech pattern from a given speaker. The apparatus further comprises a processing unit coupled to the input for receiving the input signal. The processing unit is operative for processing the input signal and a first data element representative of a speaker independent normalizing template representative of a second set of speech characteristics to derive a second data element representative of an altered version of the user independent normalizing template. The second data element forms a biased normalizing template representative of a third set of speech characteristics, where the third set of speech characteristics is a combination of the first set of speech characteristics and the second set of speech characteristics. The apparatus further comprises an output for releasing an output signal conveying the biased normalizing template suitable for use by a speaker verification system.
In a specific example of implementation, the first set of speech characteristics and the second set of speech characteristics define extremes of a range of speech characteristics, the range of speech characteristics including the third set of speech characteristics.
Under this specific example of implementation, the processing unit is further operative for processing the input signal to derive the first data element representative of the speaker independent normalizing template representative of the second set of speech characteristics. More specifically, the processing unit is operative to process the input signal on a basis of a reference speaker independent model set to derive the first data element.
In a specific example, the apparatus is part of a speaker verification system.
An advantage of this invention is that it provides high performance speaker verification without requiring a database of speaker dependent model sets.
In accordance with another broad aspect, the invention provides a method for generating a pair of data elements. The first element is representative of a speaker specific speech pattern and the second element representative of a biased normalizing template. The pair of data elements is suitable for use in a speaker verification system. The method comprises receiving an audio signal derived from a spoken utterance forming a training token associated with a given speaker. The method also comprises processing the audio signal on a basis of a reference speaker independent model set to derive a speaker independent normalizing template. The method further comprises processing the training token on a basis of a reference speaker independent model set for generating a speaker specific speech pattern. The method also comprises processing the speaker specific speech pattern and the speaker independent normalizing template to derive a biased normalizing template. A signal indicative of the pair of data elements in a format suitable for use by a speaker verification system is then released.
In accordance with another broad aspect, the invention further provides an apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention provides a computer readable medium comprising a program element suitable for execution by a computing apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention further provides a computer readable medium containing a verification database comprising entries generated by the above-described method.
For the purpose of this specification, the expression xe2x80x9cmodelxe2x80x9d and xe2x80x9cspeech modelxe2x80x9d are used to designate a mathematical representation of the acoustic properties of a sub-word unit. Modeling sub-word units is well-known in the art to which this invention pertains. Commonly used models include Hidden Markov Models (HMMs) where each sub-word unit is represented by a sequence of states and transitions between the states.
For the purpose of this specification, the expression xe2x80x9ctemplatexe2x80x9d is used to designate a sequence of models indicative of a word or sequence of words. The expression xe2x80x9ctemplatexe2x80x9d should be giver a broad interpretation to include an electronic representation of the models themselves, a sequence of symbols each symbol being associated to a respective model, a sequence of pointers to memory locations allowing to extract the models or any other representation allowing a sequence of models can be extracted.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.