The present invention relates generally to the field of speaker verification systems and more particularly to a method for creating background models for use therewith.
Speaker verification is the process of verifying the identity of a speaker based upon an analysis of a sample of his or her speech using previously saved information. More particularly, speaker verification consists of making a determination as to whether the identity of a speaker is, in fact, the same as an identity being claimed therefor (usually by the speaker himself or herself). Some applications of speaker verification include, for example, access control for a variety of purposes, such as for telephones, computer networks, databases, bank accounts, credit-card funds, automatic teller machines, building or office entry, etc. Automatic verification of a person""s identity based upon his or her voice is quite convenient for users, and, moreover, it typically can be implemented in a less costly manner than many other biometric methods such as, for example, fingerprint analysis. Moreover, speaker verification is fully non-intrusive, unlike such other biometric methods. For these reasons, speaker verification has recently become of particular importance in mobile and wireless applications.
Typically, speaker verification is performed based upon previously saved information which, at least in part, represents particular vocal characteristics of the speaker whose identity is to be verified. Specifically, the speech signal which results from a speaker""s xe2x80x9ctestxe2x80x9d utterance (i.e., an utterance offered for the purpose of verifying the speaker""s identity) is analyzed to extract certain acoustic xe2x80x9cfeaturesxe2x80x9d of the speech signal. Then, these features are compared with corresponding features which have been extracted from previously uttered speech spoken by the same individual.
The previously uttered speech which is used for comparison purposes most commonly, but not necessarily, consists of a number of repetitions of the same word or phrase as the one which is to be spoken as the xe2x80x9ctestxe2x80x9d utterance. In any case, the previously uttered speech is referred to as xe2x80x9ctrainingxe2x80x9d speech, and it is provided to the system as part of an xe2x80x9cenrollmentxe2x80x9d session. If the same word or phrase is used for both the training utterances and the test utterance, the process is referred to as xe2x80x9ctext dependentxe2x80x9d or xe2x80x9cfixed phrasexe2x80x9d speaker verification. If, on the other hand, the speaker is permitted to use any speech as a test utterance, the process is referred to as xe2x80x9ctext independentxe2x80x9d speaker verification, and operates based solely on the general vocal characteristics of the speaker. The latter approach clearly provides more flexibility, but it is not nearly as robust in terms of verification accuracy as a fixed phrase approach.
Specifically, the speaker""s claimed identity is verified (or not), based on the results of a comparison between the features of the speaker""s test utterance and those of the training speech. In particular, the previously uttered speech samples are used to produce speech xe2x80x9cmodelsxe2x80x9d which may, for example, comprise stochastic models such as hidden Markov models (HMMs), well known to those of ordinary skill in the art. (Note that in the case of text independent speaker verification, these models are typically atemporal models, such as, for example, one state HMMs, thereby capturing the general vocal characteristics of the speaker but not the particular selection and ordering of the uttered phonemes.)
The model which is used for comparison with the features extracted from the speech utterance is known as a xe2x80x9cspeaker dependentxe2x80x9d model, since it is generated from training speech of a particular, single speaker. Models which are derived from training speech of a plurality of different speakers are known as xe2x80x9cspeaker independentxe2x80x9d models, and are commonly used, for example, in speech recognition tasks. In its simplest form, speaker verification may be performed by merely comparing the test utterance features against those of the speaker dependent model, determining a xe2x80x9cscorexe2x80x9d representing the quality of the match therebetween, and then making the decision to verify (or not) the claimed identity of the speaker based on a comparison of the score to a predetermined threshold. One common difficulty with this approach is that it is particularly difficult to set the threshold in a manner which results in a reasonably high quality of verification accuracy (i.e., the infrequency with which misverificationxe2x80x94either false positive or false negative resultsxe2x80x94occurs). In particular, the predetermined threshold must be set in a speaker dependent mannerxe2x80x94the same threshold that works well for one speaker is not likely to work well for another.
Addressing this problem, it has long since been determined that a substantial increase in verification accuracy can be obtained if a speaker independent xe2x80x9cbackground modelxe2x80x9d is also compared to and scored against the test utterance, and if the ratio of the scores (i.e., the score from the comparison with the speaker dependent model divided by the score from the comparison with the background model) is compared to a predetermined threshold instead. Moreover, in this case, it is usually possible to choose a single predetermined value for the threshold, used for all speakers to be verified (hereinafter referred to as xe2x80x9ccustomersxe2x80x9d), and to obtain a high quality level of verification accuracy therewith. Both of these advantages of using a background model for comparison purposes result from the effect of doing so on probability distributions of the resultant scores. In particular, using such a background model increases the separation between the probability distribution of the actual customer scores (i. e., the scores achieved when the person who actually trained the speaker dependent model provides the test utterance) and the probability distribution of imposter scores (i.e., the scores achieved when some other person provides the test utterance). Thus, it is easier to set an appropriate threshold value, and the accuracy of the verification results improve.
Some studies of speaker verification systems using speaker independent background models advocate that the background model should be derived from speakers which have been randomly selected from a speaker independent database. (See, e.g., D. Reynolds, xe2x80x9cSpeaker Identification and Verification Using Gaussian Mixture Speaker Models,xe2x80x9d Speech Communication, vol. 17: 1-2, 1995.) Other studies suggest that speakers which are acoustically xe2x80x9cclosexe2x80x9d to the person having the claimed identity (i.e., xe2x80x9ccohortxe2x80x9d speakers) should be selected for use in generating the background model, since these speakers are representative of the population near the claimed speaker. (See, e.g., A. E. Rosenberg et al., xe2x80x9cThe Use of Cohort Normalized Scores for Speaker Verification,xe2x80x9d Proc. Int. Conf. on Spoken Language Processing, Banff, Alberta, Canada, 1992.) By using such a selection of speakers, this latter approach claims to improve the selectivity of the system as against voices which are similar to that of the customer, thereby reducing the false acceptance rate of the system.
Specifically, most state-of-the-art fixed phrase (i.e., text dependent) speaker verification systems verify the identity of the speaker through what is known in the art as a Neyman-Pearson test, based on a normalized likelihood score of a spoken password phrase. (See, e.g., A. L. Higgins et al., xe2x80x9cSpeaker Verification Using Randomized Phrase Prompting,xe2x80x9d Digital Signal Processing, 1:89-106, 1991.) If xcexc is the customer model (i.e., the speaker dependent model generated from the enrollment session performed by the particular customer), then given some set of acoustic observations X (i. e., features derived from the test utterance), then the normalized score snorm(X, xcexc) is typically computed as being the ratio of the xe2x80x9clikelihoodsxe2x80x9d as follows:                     s        norm            ⁡              (                  X          ,                      λ            c                          )              =                  p        (                  X          ⁢                      "LeftBracketingBar"                          λ              c                        )                                      p        (                  X          ⁢                      "LeftBracketingBar"                          λ              B                        )                                ,
where p(X|xcex) is the likelihood of the observations X given the model xcex, and where xcexB in particular is a background model. As described above, the customer model is usually a hidden Markov model (HMM) built from repeated utterances of a password phrase spoken by the customer during an enrollment session. This model is usually created either by concatenating phone-based HMMs (familiar to those skilled in the art) for the particular customer, or by directly estimating a whole-phrase HMM. (See, e g., S. Parthasarathy et al., xe2x80x9cGeneral Phrase Speaker Verification Using Sub-Word Background Models and Likelihood-Ratio Scoring,xe2x80x9d Proc. Int. Conf. on Spoken Language Processing, Philadelphia, 1996.) As also pointed out above, the background model of the prior art is a speaker independent model (e.g., an HMM), that reduces or eliminates the need for determining speaker dependent thresholds. The background model is typically built by concatenating speaker independent phone models of the particular customer""s password phrase.
In applications where it is desirable to give the customer the freedom to select his or her own password phrase in his or her own language, most prior art systems assume that the phonetic transcription of the customer password phrase is available, which in turn assumes the availability of pre-trained multi-lingual phone models, dictionaries and a set of letter-to-sound rules for the particular language. Because good phone end-points are necessary, a speaker independent phone recognizer might be used, for example, to derive the phone segmentation. The overall architecture of such a speaker verification system can therefore become quite complicated.
Furthermore, having a good set of speaker independent background phone models often necessitates that each model have a large acoustic resolutionxe2x80x94that is, a high number of mixture components per statexe2x80x94in order to obtain high quality performance characteristics. (See, e.g., Parthasarathy et al., cited above.) This, in turn, demands a higher level of computation and memory requirements, which may not be desirable for applications running on hand-held devices such as personal digital assistants, palm-top computers or wireless phones. Moreover, there is also an issue of robustnessxe2x80x94the background speaker independent phone models provided by the system may exhibit very different acoustic properties from the particular operating condition under which the test phrase is being uttered. As a result, misverification may occur under operating conditions which differ from those which existed at the time the background model data was gathered. For practical purposes, and most particularly in portable applications, these requirements and limitations may not be desirable, and they may create a unreasonable burden on both the customer and the system developer alike. The customer may wish to select a password phrase in any language, and he or she may choose to perform speaker verification with any type of microphone under any set of acoustic conditions.
In accordance with the principles of the present invention, a novel speaker verification method and apparatus is provided which advantageously minimizes the constraints on the customer and substantially simplifies the system architecture. Specifically, we have realized that it is possible to create and make use of a speaker dependent, rather than a speaker independent, background model, and by doing so, that many of the advantages of using a background model in a speaker verification process may be obtained without many of the disadvantages thereof. In particular, with the use of such an approach, no training data (i.e., speech) from anyone other than the customer is required, no speaker independent models need to be produced, no a priori knowledge of acoustic rules are required, and, no multi-lingual phone models, dictionaries, or letter-to-sound rules are needed. Nonetheless, in accordance with a first illustrative embodiment of the present invention, the customer is free to select any password phrase in any language. Specifically, the speaker verification system in accordance with the present invention may be advantageously built with no speech material or other prior information whatsoever, other than the set of enrollment utterances provided by the customer himself or herself as part of the enrollment session. The net result is a flexible and simple speaker verification system, which nonetheless achieves a performance quality which is appreciably better than would a system which uses no background model at all.
More specifically, the present invention provides a method and apparatus for verifying a proffered identity of a speaker (e.g., for performing speaker verification) comprising steps or means for (a) comparing features of a speech utterance spoken by the speaker with a first speaker dependent speech model (e.g., a HMM), the first speaker dependent speech model based upon previously provided training speech from a person having said proffered identity, and determining a first score based upon such a comparison; (b) comparing features of the speech utterance spoken by the speaker with a second speaker dependent speech model (e.g., another HMM), the second speaker dependent speech model also based upon the same previously provided training speech from the person having said proffered identity, and determining a second score based upon such a comparison; and (c) verifying the proffered identity of the speaker based upon a value reflecting a differential between the first score and the second score (such as, for example, a ratio of the first score to the second score).
Obviously, it would be pointless if the background model were exactly the same as the customer model, since the scores (i.e., the likelihoods) would always be the same and so the ratio would always be unity. Therefore, in accordance with the principles of the present invention, a background model which differs from the customer model is generated, despite having been generated based on the same training speech (i.e., the same enrollment data) as the customer model.
For example, in accordance with the first illustrative embodiment of the present invention, the background model, like the customer model, is generated from the customer""s repeated enrollment utterances of a fixed password phrase, but the background model is created so as to be representative of a more general model than is the customer modelxe2x80x94specifically, not so general so as to provide no added information when used as a background model, but general enough to be somewhat different than the customer model itself. One illustrative way of achieving this result (in accordance with the first illustrative embodiment) is to produce a background model which, although it is generated from the same enrollment utterances as is the customer model, nonetheless has a cruder acoustic resolution than does the customer model. Such a cruder acoustic resolution may be achieved, for example, by producing an HMM having fewer states than the customer model. Alternatively, cruder acoustic resolution may be achieved by producing an HMM having fewer acoustic parameters than the customer model.
In addition, and in accordance with other illustrative embodiments of the present invention, a background model for fixed phrase speaker verification may be generated by perturbing the temporal information thereof. For example, a customer model comprising a multi-state HMM may be modified to produce the background model by, for example, reversing the ordering of the HMM states. Alternatively, a background model comprising a multi-state HMM having fewer states than the customer model may be generated (as described above, for example), and then the. ordering of the HMM states may be reversed.
And in accordance with still other illustrative embodiments of the present invention, a background model for a text independent speaker verification system may also be generated from the same customer enrollment data as is the customer model, but with fewer acoustic parameters. In this manner, although both the customer model and the background model each may comprise single state HMMs (as is typical with text independent speaker verification systems), the background model once again can be constructed so as to have a cruder acoustic resolution than does the customer model.