Speech possesses multiple acoustic characteristics which vary greatly between individuals according to such diverse factors as vocal tract size, gender, age, native dialect, education, and idiosyncratic articulator movements. These factors are so specifically correlated to individual speakers that listeners often can readily determine the identity of a recognized speaker within the first few syllables heard. Considerable effort has been expended to develop artificial systems which can similarly determine and verify the identity of a given speaker.
Speaker verification systems may be broadly divided into free-text passphrases systems and text-dependent systems. Each type of system has its difficulties. To accommodate free-text passphrases, the storage and match processes must accommodate virtually any utterance. This higher acoustic-phonetic variability imposes longer training sessions in order to reliably characterize, or model, a speaker. In addition, free-text systems are not able to model speaker specific co-articulation effects caused by the limited movements of the speech articulators. Moreover, the ability to accommodate virtually any utterance exists in tension with the ability to discriminate among a wide range of speakers--the greater the vocabulary range, the more challenging it is to simultaneously provide both reliable word storage, and discriminate among speakers.
Text-dependent systems, on the other hand, permit easier discrimination between multiple speakers. In text-dependent passphrase systems, one or more preselected passphrases is modeled for each individual user. The models reflect both individual-specific acoustic characteristics as well as lexical and syntactic content of the passphrase. In contrast to free-text systems, fairly short utterances (typically, just a few seconds) are adequate for training in text-dependent systems. However, too narrow a scope of acceptable text may make a text-dependent system more vulnerable to replay attack. Text-dependent systems can be further sub-classified as either fixed passphrase systems, where the passphrase was defined at design time, or as freely chosen passphrase systems equipped with an online training procedure. The specific techniques utilized correspond generally to the recognized techniques of automatic speech recognition--acoustic templates, hidden Markov models (HMM), artificial neural networks, etc.
Text-prompted approaches with multiple passphrases were introduced in order to enhance security against playback recordings. Each verification session requires a speaker seeking to be verified to speak a different pseudo-random sequence of words for which the system has speaker-dependent models. Thus, the required verification sentence cannot be predicted in advance, inhibiting an unauthorized speaker from pre-recording the speech of an authorized user. With the current state of the art in speech processing, however, it is realistic to imagine a computer system which is equipped with a speech recognition engine, and which has the fixed vocabulary of text segments defined. If a prerecording of all text fragments of a certain speaker is available to the computer, a speech recognition engine could be used to decode the random combination of text prompted for, and a computer program could assemble the corresponding pre-recorded speech segments. Text-prompted systems do, however, suffer from the same co-articulation problems as free-text systems.
A method called cohort normalization partially overcomes some problems of text-prompted systems by using likelihood ratio scoring. Cohort normalization is described, for example, in U.S. Pat. No. 5,675,704 to Juang et al. and U.S. Pat. No. 5,687,287 to Gandhi et al, the disclosures of which are hereby incorporated herein by reference. Likelihood ratio scoring requires that the same contexts be represented in the models of the different authorized speakers. Normalizing scores are obtained from individual reference speakers, or by models generated by pooling reference speakers. Models of bona fide registered speakers that are acoustically close to the claimant speaker are used for score normalization.
It has been shown that the cohort normalization technique can be viewed as providing a dynamic threshold which partially compensates for trial-to-trial variations. In particular, the use of cohort normalization scores compensates to some extent for microphone mismatch between a training session and subsequent test sessions. Cohort normalization has been successfully introduced in free-text systems as well, where a full acoustic model should be generated from each concurrent speaker. Speaker verification systems using cohort normalization are intrinsically language dependent, however, and speaker independent models are not commonly used for normalization purposes, mainly due to the mismatch in model accuracy of the speaker independent model and the rather poorly trained speaker dependent models.
Speaker verification systems have characterized input utterances by use of speaker-specific sub-word size (e.g., phoneme-size) hidden Markov models (HMMs). This approach changes the key text each time the system is used, thereby addressing the problem of replay attack. The speaker-specific sub-word models can be generated by speaker adaptation of a speaker-independent models. Speaker-dependent sub-word models are created for each reference speaker. These systems again need extensive training sessions.
The following references are pertinent to the present invention:
Higgins et al., "Speaker Verification Using Randomized Phrase Prompting," Digital Signal Processing, March 1991, pp. 89-106.
A. E. Rosenberg et al., "The Use of Cohort Normalized Scores for Speaker Verification," Proc. 1992 ICSLP, October. 1992, pp. 599-602.
F. K. Soong et al., "A Vector Quantisation Approach to Speaker Verification," IEEE 1985, pp. 38714 390.
A. E. Rosenberg et al., "Sub-word Unit Talker Verification Using Hidden Markov Models," IEEE 1990, pp. 269-272.
T. Masui et al., "Concatenated Phoneme Models for Text-variable Speaker Recognition," IEEE 1993, pp. 391-394.
J. Kuo et al., "Speaker Set Identification Through Speaker Group Modeling," BAMFF `92 `.
Each of the foregoing references in its entirety is hereby incorporated herein by reference.