Speakers produce speech by forcing air through the vocal tract while in a coordinated manner manipulating a set of articulators. The articulators include the tongue, lips, jaws, vocal cords, and the velum. The vocal tract includes the lungs, throat, mouth, and nasal passages. The speaker dependent physiological characteristics of the articulators and the vocal tract cause acoustic variability in the rendered speech of different speakers. In addition, the dynamics of the articulatory gestures used during the production of speech vary among speakers.
Taking advantage of these variations, it is an object of the present invention to verify if a speaker is who she or he claims to be based on their spoken utterances, independent of speech content.
Automated speaker verification can be of substantial value in applications where the true identity of individuals is of paramount importance, for example, financial transactions involving credit cards or phone calling cards are notoriously prone to fraud. The losses to banks and credit card companies alone are estimated to run between five and ten billion dollars annually. Speaker verification can also be used to reduce the unauthorized use of voice communication devices such as cellular telephones.
In a speaker verification system, individuals having known identities supply utterances or speech samples during "training" sessions. A sample of continuous speech signals is digitally analyzed to form a temporal discrete sequence of observation vectors, each of which contains a set of acoustic features. Each observation vector is called a "frame" of speech. The components of the frames can be acoustic attributes that have been chosen to represent the discrete portions of the speech signal. The frames of the various individuals can be further processed to create models representing the speech. The models can be stored in a database along with an identity of the corresponding individuals.
Subsequently, the claimed identity of an individual can be verified by having the individual utter a prompted sequence of words or spontaneous speech during a "testing" session. These "validation" or testing speech signals are analyzed and compared with the prestored observation models corresponding to the "claimed" identity to determine scores. For example, the scores can be expressed as log likelihood scores: score=log p(O/I). In this example, p represents the likelihood that the observed frames O were produced by the individual I. If the scores exceed a predetermined threshold, it is presumed that the individual is who he or she claims to be.
Ideally, under consistent acoustic conditions, it is possible to pose the speaker verification problem as a simple hypothesis test. Unfortunately, the acoustic variabilities between training and testing conditions complicate the problem. For example, microphones used during training can have different acoustic capabilities from those used during testing. There can also be differences in background noise features. In addition, the speech samples used during testing may arrive at a centralized verification site via a telephone network with unpredictable transmission characteristics likely distorting the signals. Furthermore, sampling rates used during training can be different than sampling rates used during testing.
All of these factors can increase equal error rates. The equal error rate is the point where the percentage of erroneously rejected correct speakers (false negatives) is equal to the percentage of erroneously accepted impostors (false positives). Systems with lower equal error rates have better performance.
In the prior art, cohort normalization has been used as a technique to minimize equal error rates. In cohort normalization, the models of each individual speaker are linked to the models of "cohort" individuals. The cohorts can be selected from the pool of all speakers who have "trained" models. Alternatively, the cohort models can be synthesized from the models of several speakers. In the prior art, a small number, typically less then ten, "cohort" models are linked to the models of each identified individual. Generally, error rates increase if more cohorts are used.
During testing, the score obtained from the models of the speaker whose identity is claimed is compared with all of the scores derived from the small set of cohort models to produce a set of score differences. The differences are then used as a "normalized" score, for example: normalized score=log p (O/I)--f[log p (O/(C.sub.k (I))], where log p (O/(C.sub.k (I)) are the scores for the k cohorts linked to the claimed individual. A function f can combine all of the cohort scores during the normalization. The function can be statistical in nature, for example, maximum, average, percentile, median, or mean, or the output of a neural network.
Cohort normalization provides thresholds which compensate for acoustic variations during testing conditions. Determining the difference between the scores of a claimed speaker and the scores of cohorts has proven to be very effective, please see Rosenberg, Delone, Lee, Juang, and Soong, The Use of Cohort Normalized Scores for Speaker Verification, Proc. ICSLP, October, 1992 pp. 599-602. Here, a factor of five reduction in the error rate is reported for cross-microphone conditions using a set of five cohorts.
In the prior art, the specific speaker's set of cohorts are selected by some metric of "closeness," e.g., multi-dimensional statistical distance, to the speaker's models in the acoustic space based on the training data. However, it has been observed that low scores for a given utterance for one or more of the selected cohorts can still result in a substantial degradation of system performance.
To compensate for the lower scores, the score threshold can be set to a larger value so that impostors who are able to beat the lower scores are more likely to be rejected. However, increasing the threshold also increases the likelihood that a valid speaker will be erroneously rejected, e.g. an increase in the equal error rate.
Therefore, there is a need for a cohort selection mechanism which reduces the equal error rate. Reducing the equal error rate increases the likelihood of rejecting impostors, while decreasing the rate of erroneously rejecting individuals who are in fact who they claim to be, even in the presence of variations in the acoustic environment during testing.