A speaker verification system determines whether or not a person claiming an identity previously established within the system is the identified person by comparing a password (which may be multiple speech "words") spoken by the person seeking to be identified at the time of the identification request with previously stored speech containing corresponding "words" entered into the system by the identified person. Such a system is particularly useful as a means for controlling entry/exit with respect to secured environments or to enable access to a secure communications system.
With most existing speaker verification systems, the comparison between the spoken password and the reference speech vocabulary previously entered into the system by the identified speaker is based on a measurement of the Euclidean distance between elements of the password speech and of the reference speech using computer processing of such speech elements which have been converted to digital form. Such comparison may also include a measurement of such distances from elements of the password speech and generic speech elements established as a reference base. If the distance so measured is less than a predetermined value, and, in the case of a verification system using reference speech measurement, less than any of such reference measurements, the speaker is judged to be the identified speaker, and if greater than the threshold value (or one or more of the reference measurements),the speaker is judged to be an impostor. An example of such a speaker verification system is found in U.S. Pat. No. 4,694,493 to Sakoe, entitled Speaker Verification System, issued on Sep. 15, 1987.
A limitation of all speaker verification systems is that an exact match between the password speech and the reference speech is seldom, if ever, achieved. This happens because of naturally occurring differences in a particular speaker's voice between the time of entering the reference speech into the system and of the request for verification, and due to the fact that the process of converting the analog waveform of the speaker's voice into digitized voice components may produce slight variations in such components as between the reference speech and the password speech, even in the absence of variations in waveform of the speaker's voice between the reference speech and the password speech.
This limitation is manifested in two possible errors for the speaker verification system: either a false rejection of the identified speaker or a false acceptance of an impostor. The consequence of such error is managed by a choice of a threshold value to be used as a basis for comparison with the measured distance between the password speech elements and the reference speech elements. A low threshold value can be expected to minimize the likelihood of an imposter being accepted, but will also increase the likelihood that the identified person will be rejected. A high threshold value, on the other hand, will diminish the likelihood of the identified person being rejected, but will increase the likelihood of an imposter being accepted. While the reliability (i.e., avoidance of erroneous result) of such a speaker verification system can be improved by increasing the number of voice components analyzed, this methodology suffers from the parallel constraints of (1) limitations in computer processing power and (2) the human-factor limitation that verification processing time must be very short (likely no more than 15-20 seconds) for acceptance by the user.
In the quest for a comparative measurement speaker verification system which achieves an acceptable level of security while at the same time minimizing the likelihood of the identified speaker being rejected, randomization techniques have also been used to determine components of the password from a vocabulary of reference "words" entered into the system by the identified speaker. Prior speaker verification systems typically prompted the person seeking to be identified to read a fixed phrase as a password, and compared that spoken password with previous utterances of the same phrase or password by the identified speaker. By using fixed prompts, such systems offered would-be impostors the opportunity to prepare responses (including tape recorded responses) in advance in order to increase their chances of being illegitimately verified. Through the use of test phrases composed at random at the time of verification, and requiring that the word content of the spoken utterance match the prompt, the likelihood of accepting an imposter is significantly reduced. There are so many possible prompts that would-be imposter has virtually no chance of being prepared with an acceptable response. An example of the use of such randomization techniques in speech verification systems is found in a paper entitled Personal Identity Verification Using Voice presented by Dr. George R. Doddington and printed in Proc. ELECTRO-76, May 11-14, 1976, pp. 22-4, 1-5.
Randomization of the test phrases does, however, introduce a new problem. Words occur in contexts (of surrounding words) that did not occur in the enrollment phrases. The context in which a word is spoken influences its pronunciation through coarticulation, caused by limitations in the movement of the speech articulators. These coarticulations, which have not been incorporated into the verification analysis model, contribute to the measured dissimilarity between the test and enrollment utterances, increasing the likelihood of a false rejection of the identified speaker.
Previous work by the inventor partially overcomes this difficulty by means of a scoring method called likelihood scoring. See A. Higgins, L. Bahlet and J. Porter, Speaker Verification Using Randomized Phrase Prompting, 1991. Digital Signal Processing 1, 89-106 (1991). The current invention builds upon this previous work, providing a more complete solution to the coarticulation problem. It is to be noted, however, that the invention applies as well to verification using fixed (i.e., non-random) phrase prompts.
Prior speaker verification systems have also commonly used word templates as the basis for matching speech utterances. In these methods, word templates are derived from occurrences of the words spoken during enrollment. As an example of the use of such templates, see U.S. Pat. No. 4,773,093 to Higgins, et al, entitled Test-Independent Speaker Recognition System And Method Based On Acoustic Segment Matching, issued on Sep. 20, 1988,and assigned to ITT Corporation, the assignee herein. In verification, the word templates are temporally aligned with occurrences of the same words in the test phrases and used to derive a distance or dissimilarity score. Two methods of deriving word templates are generally used, which have different problems with respect to coarticulation. In one method, averaged templates are derived by temporally aligning all the enrollment occurrences of each word and averaging the constituent frames. The problem with averaged templates is that the diversity of coarticulations near word boundaries is poorly represented by the average. In the second method, multiple templates for individual word occurrences are extracted from the enrollment phrases (with multiple representations of each word). The problem with multiple templates is that coarticulation influences both the beginning and end of each word, and a prohibitively large number of templates per word would be needed to simultaneously match all possible contexts on both sides.
These problems with word template matching have been alleviated by a recent development in the speaker verification art--a comparison of test phrases with enrollment phases using individual frames, rather than words, as the atomic units. Frames of the enrollment data are used directly in the comparison, without averaging. Thus, the multiple templates problems are avoided because each frame is effectively a "snapshot" representing a single instant of time. The use of such frames of speech data is described at length in U.S. Pat. No. 4,720,863 to Li, et al, entitled Method and Apparatus For Text-Independent Speaker Recognition, issued on Jan. 19, 1988, and assigned to ITT Corporation, the assignee herein. See also U.S. Pat. No. 4,837,830 to Wrench, Jr. et al. entitled Multiple Parameter Speaker Recognition System And Methods, issued on Jun. 6, 1989, and also assigned to ITT Corporation.
A further improvement in the analysis and comparison of speech data has been developed by L. Bahler and is manifested in his invention called "Speaker Sorter" for which an application is copending under Serial No. 07/699,217, filed May 13, 1991 and is incorporated herein by reference. Bahler teaches the use of a baseline algorithm for speaker recognition which is non-parametric in nature and makes no assumption about the statistical'distributions of speech features. The reference data used to characterize a given talker's speech patterns are a large set of speech feature vectors, not a set of estimated distribution parameters. A significant advantage of Bahler's methodology is its use of non-parametric methods because the further development of parametric methods--toward more complicated distributions which might approximate true speech more accurately--has the difficulty of estimating an increased number of statistical parameters which such models entail. It is an object of this invention to provide an improved speaker verification system having a low error rate while minimizing verification processing time and/or reducing computer processing power requirements.