This invention relates to the field of speaker verification and more particularly to a method and apparatus for generating certain data that is specific to a user and that can be used by a speaker verification system to authenticate the user based on a speech pattern. This invention is applicable to speech activated security systems such as access to voice-mail, automated telephone services, automated banking services and voice directed computer applications, among others.
Speaker verification is the process of verifying whether a given speaker is a claimed speaker. The basis of this process lies in comparing a verification attempt with a speaker specific speech pattern representative of the claimed speaker and then calculating the likelihood of the verification attempt actually being generated by the claimed speaker. A common approach is to determine the likelihood of the verification attempt being generated by the claimed speaker given the speaker specific speech pattern. Typically, if the calculated likelihood is above a certain threshold, the verification attempt is accepted as being generated by the claimed speaker. Otherwise, the verification attempt is rejected. The level of the threshold depends on a number of factors such as the level of security required and therefore on the level of tolerance for false acceptance or false rejection.
Commonly, methods for modeling the speaker specific speech pattern make use of Continuous Density Hidden Markov Models (CDHMM) to model the acoustic characteristics. The ability of a CDHMM to model speech depends on the features, the topology, the parameterization and the training required for the CDHMM. In general, these four factors interact in a complex manner. Conventionally, the speech models are characterized by a sequence of states, the sequences of states being linked by uni-directional transitions and self-loop transitions that permit hesitation in a same given state. Typically, the parameterization chosen to define a given state is a Gaussian mixture, which is a weighted composition of a number of multivariate Gaussian probability density functions (pdfs). The CDHMM are trained on a large corpus of speech using an Expectation-Maximization (EM) algorithm.
Typical speaker verification systems generally require high performance with a minimal number of enrollment tokens. As a result, the models have many more parameters than can be supported by limited enrollment data. For high security systems making use of speaker verification, the amount of enrollment data required is prohibitively high.
Consequently, there is a need in the industry for providing a method and apparatus for providing an improved speaker specific speech pattern suitable for use by a speaker verification system.
In accordance with a broad aspect, the invention provides a method and an apparatus for creating a set of expanded speech models. The apparatus comprises an input for receiving a signal representative of enrollment data and a processing unit coupled to the input. The processing unit is operative for processing the enrollment data to generate a set of simple speech models trained on a basis of the enrollment data, each simple speech model in the set of simple speech models comprising a plurality of states linked by transitions. The processing unit is further operative for generating on a basis of the set of simple speech models a set of expanded speech models, each expanded speech model in the set of expanded speech models comprising a plurality of groups of states. The groups of states are linked to one another by inter-group transitions, the states in a given group of states originating from a single state in the set of simple speech models. The processing unit is further operative for processing the set of expanded speech models on the basis of the enrollment data to condition the inter-group transitions on the basis of the enrollment data. The apparatus further comprises an output for releasing a signal derived from the set of expanded speech models in a format suitable for use by a speech-processing device.
Advantageously, the use of a group of states originating from a single state in the set of simple speech models increases the ability to capture variability in a spoken utterance with respect to the simple model.
Another advantage of the present invention is that the set of expanded speech models can be generated with a limited amount of enrollment data.
In a specific example of implementation, the states in a given group of states are linked to one another by intra-group transitions. The processing unit is further operative for processing the set of expanded speech models on the basis of the enrollment data to condition the intra-group transitions on the basis of the enrollment data.
In a specific example, the apparatus is part of a speaker verification system.
In accordance with another broad aspect, the invention provides a method for generating a pair of data elements, namely a first element representative of a speaker independent template and a second element representative of an extended speaker specific pattern. The pair of data elements is suitable for use in a speaker verification system. The method comprises receiving an audio signal derived from a spoken utterance forming enrollment data associated with a given speaker. The method further comprises processing the audio signal on a basis of a reference speaker independent model set to derive a speaker independent template. The method further comprises processing the audio signal on a basis of a reference speaker independent model set for generating a speaker specific speech pattern. The speaker specific speech pattern includes a set of simple speech models trained on a basis of the audio signal, each simple speech model in the set of simple speech models comprising a plurality of states linked by transitions. The method further comprises processing the speaker specific pattern to derive an extended speaker specific pattern. The extended speaker specific speech pattern comprises a set of expanded speech models, each expanded speech model in the set of expanded speech models comprising a plurality of groups of states. The groups of states are linked to one another by inter-group transitions and states in a given group of states originate from a single state in the set of simple speech models. The method further comprises releasing a signal conveying the pair of data elements in a format suitable for use by a speaker verification system.
In accordance with another broad aspect, the invention further provides an apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention provides a computer readable medium comprising a program element suitable for execution by a computing apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention further provides a computer readable medium containing a speaker verification database comprising entries generated by the above-described method.
For the purpose of this specification, the expressions xe2x80x9cmodelxe2x80x9d and xe2x80x9cspeech modelxe2x80x9d are used to designate a mathematical representation of the acoustic properties of a sub-word unit.
For the purpose of this specification, the expression xe2x80x9ctemplatexe2x80x9d is used to designate a sequence of models indicative of a word or sequence of words. The expression xe2x80x9ctemplatexe2x80x9d should be given broad interpretation to include an electronic representation of the models themselves, a sequence of symbols each symbol being associated to a respective model, a sequence of pointers to memory locations allowing to extract the models or any other representation allowing a sequence of models to be extracted.