The present invention relates generally to biometric attribute validation and, more particularly, to methods and apparatus for correlating a biometric attribute with one or more biometric attribute production features to validate the production of the biometric attribute.
The use of biometric attributes to validate, i.e., identify and/or verify, a person for access to secure applications, systems and/or facilities has increased greatly in the past several years. Some examples of personal biometric attributes that have been used in the validation process include acoustic or speech patterns, fingerprints, retinal scans, to name only a few. Unfortunately, with the increased use of biometric user validation has come increased attempts to deceive the applications, systems and facilities which employ such validation techniques in order to gain unauthorized access. This is especially true in the case of speech biometrics. Some drawbacks of the use of conventional speech biometric techniques in speaker recognition systems for making a validation decision are described below.
When conventional speaker recognition systems are deployed, it is typically assumed that the application manages to verify that the input utterances originate from a live session with a live speaker to enroll, identify or verify. This assumption extends across modalities from text-constrained (e.g., text-dependent, text-prompted, user selected password) to text-independent and speech biometrics. See, for example, U.S. Pat. No. 5,897,616, issued on Apr. 27, 1999, and entitled xe2x80x9cApparatus and Methods for Speaker Verification/Identification/Classification Employing Non-Acoustic and/or Acoustic Models and Databases,xe2x80x9d the disclosure of which is incorporated by reference herein.
However, with the evolution of digital signal processing (DSP) of digital recordings, as well as advances in text-to-speech (TTS) technology and, in particular, in voice fonts, one can no longer be certain whether a live person is generating the submitted sounds. Voice fonts are known to have the potential to provide the capability to playback or synthesize speech sounding like a given individual based on some training data obtained from the individual and/or voice transformation functions. Compare, for example, U.S. patent application identified by Ser. No. 08/821,520 (docket no. YO996-247), filed on Mar. 21, 1997, and entitled xe2x80x9cSpeech Synthesis Based on Pre-Enrolled Tokens,xe2x80x9d the disclosure of which is incorporated by reference herein.
The transition from text-dependent speaker recognition (which is known to be especially vulnerable to recordings) to text-prompted speaker recognition provided somewhat of a solution to the problem. However, even text-prompted speaker recognition does not offer any guarantee against a sophisticated TTS or playback signal processing system. The use of user selected passwords is a proposed extension of the text-prompted speaker recognition concept. However, user selected passwords are easily stolen and used to gain unauthorized access.
Text-independent speaker recognition systems are also essentially defenseless against an efficient TTS/voice font system. Only the use of a conventional text-independent system in the background of a transaction or interaction with a human operator makes it somewhat difficult for a speaker to maintain the flow of the transaction if he uses a TTS/playback system to attempt to fool the recognition system. However, with more sophisticated DSP/TTS capabilities (especially on personal digital assistant or PDA devices), there are no more guarantees with respect to user validation.
The concept of speech biometrics adds a knowledge-based dimension to the recognition process. As is known, see U.S. Pat. No. 5,897,616 and S. Maes, xe2x80x9cConversational Computing,xe2x80x9d IBM Pervasive Computing Conference, Yorktown Heights, N.Y., June 1999, speech biometric systems use simultaneous content-based recognition (e.g., answers to random questions) and acoustic-based recognition techniques. However, provided that an imposter has the knowledge, a system using speech biometric techniques is essentially defenseless against such an imposter also using sophisticated voice font capabilities. As long as the imposter is able to follow the flow of the dialog, he will likely be able to gain unauthorized access. However, in the case where the speech biometrics system changes multiple non-trivial questions from one access request to another, it is no easy task for an imposter to possess sufficient knowledge and follow the flow of the dialog in order to gain unauthorized access.
Some attempts have been made at detecting the non-linearities of DSP/coders and loudspeakers to detect usage of such devices attempting to fool the system into believing that the person is actually speaking. However, these techniques are not always reliable when dealing with high quality audio equipment or new and unknown equipment.
The use of synchronized biometrics, e.g., face recognition, local mouth geometry recognition, and lip reading synchronized with utterance recognition and speaker recognition has been proposed to guarantee that the user does not use a speaker close to his mouth and lips to generate the utterance. See, for example, U.S. patent application identified by Ser. No. 09/067,829 (docket no. YO997-251), filed on Apr. 28, 1998, and entitled xe2x80x9cMethod and Apparatus for Recognizing Identity of Individuals Employing Synchronized Biometrics,xe2x80x9d the disclosure of which is incorporated by reference herein; as well as the above-incorporated U.S. Pat. No. 5,897,616. Although this adds an additional level of security, it may not be completely fool proof against an effective voice font system combined with good lip sync capabilities.
Accordingly, it is clear that a need exists for techniques that can better guarantee that a speaker physically produced a subject utterance. More generally, a need exists for techniques that can better guarantee that a given biometric attribute has been physically produced by the person offering the biometric attribute as his own.
The present invention provides methods and apparatus for validating the production of a biometric attribute that better guarantee that a given biometric attribute has been physically produced by the person offering the biometric attribute as his own.
In one broad aspect of the invention, a method of validating production of a biometric attribute allegedly associated with a user comprises the following steps. A first signal is generated representing data associated with the biometric attribute allegedly received in association with the user. A second signal is also generated representing data associated with at least one feature detected in association with the production of the biometric attribute allegedly received from the user. Then, the first signal and the second signal are compared to determine a temporal correlation level between the biometric attribute and the production feature, wherein the validation of the production of the biometric attribute depends on the correlation level. Accordingly, the invention serves to provide substantial assurance that the biometric attribute offered by the user has been physically generated by the user.
In one embodiment, the biometric attribute is a spoken utterance and the production feature is a physiological effect attributable to the production of the spoken utterance alleged to have been produced by the user, e.g., glottal excitation or vibration. The spoken utterance may be decoded and labeled by a speech recognition system to produce the first signal. For example, a sequence of voiced and unvoiced phones is generated from the spoken utterance. Then, a first data value (e.g., a logic value xe2x80x9c1xe2x80x9d) is assigned to a voiced phone and a second data value (e.g., a logic value xe2x80x9c0xe2x80x9d) is assigned to an unvoiced phone. Thus, the first signal represents a sequence of such logic values representing the occurrence of voiced and unvoiced phones from the spoken utterance. In an alternative embodiment, a speaker recognition system may be employed to decode and label the spoken utterance.
The physiological effect attributable to the production of the spoken utterance alleged to have been produced by the user, e.g., glottal excitation or vibration, may be detected by a speech production detecting system, e.g., a laryngograph device or a radar device. The physiological effect may be represented by a time varying signal. Then, to generate the second signal, the time varying signal may be processed to generate a sequence of data values (logic xe2x80x9c1xe2x80x9ds and xe2x80x9c0xe2x80x9ds) representing some characteristic of the signal content. For example, since it is known that a relatively lower level of glottal excitation energy is generally associated with unvoiced speech, while a relatively higher level of glottal excitation energy is generally associated with voiced speech, a mean-square value of the excitation signal may be computed for each time period corresponding to a time period associated with the acoustic sequence, and if the mean-square value exceeds a predetermined relative threshold value, a logic xe2x80x9c1xe2x80x9d may be assigned to that time period, and a logic xe2x80x9c0xe2x80x9d otherwise. The xe2x80x9crelative threshold valuexe2x80x9d may represent a fixed fraction of the average mean-square value of the entire signal, as will be explained below. Thus, through the use of a relative threshold value, gain/loss effects are advantageously accounted for over each time period.
Thus, in such an embodiment, the comparing operation may comprise a time-aligned comparison of the respective sequences associated with the first signal and the second signal to determine a percentage or relative fraction of matches between the data values representing voiced and unvoiced phones and the data values representing the energy level of the glottal excitation signal over all time periods being considered. The percentages of matches represents the level of correlation. The level of correlation may then be compared to a threshold value. If the level is not less than the threshold value, for example, the production of the biometric attribute is considered validated. That is, the invention provides a substantial assurance that the speech production detecting system is witnessing the actual source of the biometric.
It is to be understood that the speech-based embodiment described above, that is, comparison between the voicing in the acoustic signal and high-energy in the production signal, is not the only manner of determining a correlation level between a spoken utterance and a speech production feature. In one alternative embodiment, the mutual information between the two signals may be used, e.g., see T. M. Cover and J. A. Thomas, xe2x80x9cElements of Information Theory,xe2x80x9d 1991. In another embodiment, a two-dimensional contingency table may be used in conjunction with a Chi-Square test to measure the association between the two signals, e.g., see E.S. Keeping, xe2x80x9cIntroduction to Statistical Inference,xe2x80x9d 1962. In general, any statistical measure of correlation or association may be used. In yet another implementation, the pitch/fundamental from the speech waveform and the glottal excitation signal may be directly extracted (e.g., by the speech or speaker recognition/acoustic system and the speech production/non-acoustic system, respectively) and their periodicities compared.
In yet other approaches, the characteristics to be compared may be the voiced/unvoiced distribution extracted from each signal, or the voiced/unvoiced distribution extracted from the production signal with respect to the energy in the fundamental excitation component of the acoustic signal (e.g., measured by an LPC model as described in S. Furui, xe2x80x9cDigital speech processing, synthesis and recognition,xe2x80x9d Marcel Dekker, New York, N.Y. 1989). When the voiced/unvoiced distribution from the production signal is employed, for example, the glottal energy contained in the production signal may be directly measured in order to extract the voiced/unvoiced decision.
Accordingly, it is to be appreciated that a key to the outcome of the correlation operation is the degree of temporal coincidence between the signal representing the biometric attribute (the first signal) and the signal representing the biometric attribute production feature (the second signal). The comparison is accomplished by extracting the temporal correlation between characteristics associated with both signals. However, any suitable correlation measure/estimator can be used. Given the inventive teachings herein, one of ordinary skill in the art will realize other implementations that are within the scope of the invention.
It is also to be appreciated that the inventive methodology of validating production of a spoken utterance allegedly associated with a user may be employed in conjunction with a speech biometric recognition system. For example, a speaker recognition system may employ the speech biometric techniques described in the above incorporated U.S. Pat. No. 5,897,616. These speech biometric results may be used in conjunction with the results obtained via the above-mentioned spoken utterance production validating methodology to provide an overall validation result with regard to the potential user. It is to also be understood that the invention may be used not only for verification, but also for identification. That is, the invention may determine who is speaking out of a set of pre-enrolled users.
It is further to be appreciated that while the above embodiment describes speech as the biometric, the invention is not so limited. That is, the methods and apparatus of the invention are applicable for use in accordance with other biometrics, e.g., fingerprints, retinal scans, to name only a few. Also, a system according to the invention may be configured to validate the production of more than one biometric attribute at a time.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.