This invention relates to an apparatus and method for selecting one or more cohort models for use in a speaker verification system.
In various circumstances it is desirable to be able to limit access to a particular location or function to only one or more authorised individuals. Often an identifying badge or Personal Identification Number (PIN) are utilised for such purposes. Increasingly, efforts have been made to supplement such traditional identifiers with one or more biometric indicators. Finger-prints, retinal patterns, hand shape, and voice have, for example, all been considered in this regard, as all of these criteria are relatively unique to each individual.
In speaker verification systems, the individual person typically speaks a predetermined statement or series of sounds. These sounds are then compared in some way against a previously stored sample of that same person""s speech pattern. A sufficiently close match yields a positive verification that the speaker is who he or she claims to be, otherwise there is no such verification.
In one prior art approach, such speaker verification is accomplished by comparing this person""s present voice input against both a previously stored model representing that person""s speech, and also against one or more so-called cohort models. The cohort models are typically selected from many (typically hundreds) previously stored speech models of other individuals, in order to locate a sub-set of relatively close models by comparing an original speech utterance of the person with the previously stored speech models. The previously stored speech models that are most similar to the original speech utterance are then used as the cohort models, each of which is close, but not equal, to the target individual""s actual speech pattern. Upon comparing a claimed person""s present speech utterance against both the previously stored model and the cohort models, a determination can be made as to whether the present utterance is more similar to the stored model or to a cohort model. If more similar to a cohort model, a rejection is returned. If, however, the present utterance is closer to the original model, an acceptance can be returned.
Using prior art techniques, determining which of the previously stored speech models are most similar to the original speech utterance involves, in effect, running the original speech utterance through each of the stored speech models to determine the most similar, which is a computationally intensive and time consuming process. When first installing such a facility in an existing location having numerous employees, the training activity, including a significant amount of time spent determining the cohort models, can, at best, inconvenience the individual, and at worst, significantly delay clearance and participation for a significant number of individuals.
The cohort model approach to speaker verification, however, continues to offer significant promise with respect to both subsequent robustness, accuracy, and ease of use. A need therefore exists for a way to support cohort model based speaker verification systems while still reducing the amount of time required to select the cohort models for each new person.
In this specification, including the claims, the terms xe2x80x9ccomprisesxe2x80x9d, xe2x80x9ccomprisingxe2x80x9d or similar terms are intended to mean a non-exclusive inclusion, such that a method or apparatus that comprises a list of elements does not include those elements solely, but may well include other elements not listed.
The present invention therefore seeks to provide a method of and system for selecting a cohort model for use in a speaker verification system which overcomes, or at least reduces the above-mentioned problems of the prior art.
Accordingly, in one aspect, the invention provides a method of selecting at least one cohort model for use in a speaker verification system, the method including the steps of: providing a group of existing speaker models; receiving target speaker voice utterances from a target speaker; digitizing at least portions of the received utterances to provide at least one speech sample; determining a target speaker model from the at least one speech sample; determining at least one similarity value between each of a plurality of the existing speaker models and the target speaker model; and utilising the at least one similarity value to select at least one similar existing speaker model as a cohort model for the target speaker.
In one preferred embodiment, the method of selecting a cohort model further includes the steps of: determining at least one dissimilarity value between at least some of the plurality of the existing speaker models and each cohort model previously selected; and selecting at least one of the existing speaker models which is similar to the target speaker model and dissimilar to the at least one cohort model previously selected as at least one cohort model for the target speaker.
Preferably, each speaker model and cohort model comprises a set of parameters, each parameter representing a characteristic of the speech of the speaker, and the step of determining at least one similarity value between an existing speaker model and the target speaker model comprises the step of comparing the value of at least one of the parameters of the existing speaker model with the value of at least one corresponding parameter of the target speaker model to determine the similarity value.
In one embodiment, the step of determining the dissimilarity value comprises the step of: comparing the value of at least one of the parameters of the existing speaker model with the value of at least one corresponding parameter of the cohort model to determine the dissimilarity value. Preferably, the step of selecting at least one of the existing speaker models which is similar to the target speaker model but dissimilar to the at least one previously selected cohort model involves combining in a predetermined combination the dissimilarity values of two or more previously selected cohort models and selecting at least one of the existing speaker models which has a high similarity value and a high combined dissimilarity value. Conveniently, the predetermined combination can be normalised to the similarity values. One of the parameters is preferably a vector, which can be quantised, representing the frequency response of a time sample of the utterance.
Preferably, each parameter of the set of parameters is represented by a vector and the step of determining at least one similarity value between an existing speaker model and the target speaker model includes the steps of: determining at least two vectors for each existing speaker model and for the target speaker model; for each existing speaker model vector, determining the distance in the n-dimensional space between that existing speaker model vector and each target speaker model vector and, for each existing speaker model vector, storing whichever distance has a minimum value; and summing the stored minimum distances to provide the at least one similarity value.
Preferably, the step of determining at least one dissimilarity value between an existing speaker model and a cohort model includes the steps of: determining at least two vectors for each existing speaker model and for the cohort model; for each existing speaker model vector, determining the distance in the n-dimensional space between that existing speaker model vector and each cohort model vector and, for each existing speaker model vector, storing whichever distance has a minimum value; and summing the stored minimum distances to provide the at least one dissimilarity value.
According to a second aspect, the invention provides an apparatus for selecting at least one cohort model for use in a speaker verification system, the apparatus including: a database of existing speaker models; a receiver for receiving target speaker voice utterances from a target speaker; a speech digitizer coupled to the receiver to provide at least one speech sample; a modeller coupled to the speech digitizer for producing and storing a target speaker model from the at least one speech sample; similarity determining means coupled to the database and the modeller for determining at least one similarity value between each of a plurality of the existing speaker models and the target speaker model; storage means coupled to the similarity determining means for storing the similarity values; selection means coupled to the storage means for comparing the similarity values and selecting at least one similar existing speaker model as a cohort model for the target speaker; and a memory coupled to the selection means for storing the cohort model.
In a preferred embodiment, the apparatus further includes dissimilarity determining means coupled to the database and the memory for determining at least one dissimilarity value between at least some of the plurality of the existing speaker models and each cohort model previously selected; wherein the selection means is coupled to the dissimilarity determining means for selecting at least one of the existing speaker models which is similar to the target speaker model and dissimilar to the at least one cohort model previously selected as at least another one cohort model for the target speaker.
Preferably, each speaker model and cohort model comprises a set of parameters, each parameter representing a characteristic of the speech of the speaker, and the similarity determining means comprises a comparator circuit for comparing the value of at least one of the parameters of the existing speaker model with the value of at least one corresponding parameter of the target speaker model to determine the similarity value. The comparator circuit preferably comprises means for storing at least two vectors representing at least two of the parameters in n-dimensional space for each existing speaker model and the target speaker model, means for determining the distance in the n-dimensional space, for each existing speaker model vector, between that existing speaker model vector and each target speaker model vector, means for storing, for each existing speaker model vector, whichever distance to a target speaker model vector has a minimum value, and means for summing the stored minimum distances to provide the at least one similarity value.
The dissimilarity determining means preferably includes a comparator circuit for comparing the value of at least one of the parameters of the existing speaker model with the value of at least one corresponding parameter of the cohort model to determine the dissimilarity value. The comparator circuit preferably comprises means for storing at least two vectors representing at least two of the parameters in n-dimensional space for each existing speaker model and each previously selected cohort model, means for determining the distance in the n-dimensional space, for each existing speaker model vector, between that existing speaker model vector and each cohort model vector, means for storing, for each existing speaker model vector, whichever distance to a cohort model vector has a minimum value, and means for summing the stored minimum distances to provide the at least one dissimilarity value.
In one preferred embodiment, the selection means includes combining means for combining in a predetermined combination the dissimilarity values of two or more previously selected cohort models and the selection means selects at least one of the existing speaker models which has a high similarity value and a high combined dissimilarity value.
Preferably, the combining means includes a normaliser for normalising the predetermined combination to the similarity values.