The present invention relates to methods and apparatus for providing security with respect to user access of services and/or facilities and, more particularly, to methods and apparatus for providing same employing automatic speech recognition, text-independent speaker identification, natural language understanding techniques and additional dynamic and static features.
In many instances, it is necessary to verify that an individual requesting access to a service or a facility is in fact authorized to access the service or facility. For example, such services may include banking services, telephone services, or home video provision services, while the facilities may be, for example, banks, computer systems, or database systems. In such situations, users typically have to write down, type or key in (e.g., on a keyboard) certain information in order to send an order, make a request, obtain a service, perform a transaction or transmit a message.
Verification or authentication of a customer prior to obtaining access to such services or facilities typically relies essentially on the customer's knowledge of passwords or personal identification numbers (PINs) or by the customer interfacing with a remote operator who verifies the customer's knowledge of information such as name, address, social security number, city or date of birth, mother's maiden name, etc. In some special transactions, handwriting recognition or signature verification is also used.
However, such conventional user verification techniques present many drawbacks. First, information typically used to verify a user's identity may be easily obtained. Any perpetrator who is reasonably prepared to commit fraud usually finds it easy to obtain such personal information such as a social security number, mother's maiden name or date of birth of his intended target. Regarding security measures for more complex knowledge-based systems which require passwords, PINs or knowledge of the last transaction/message provided during the previous service, such measures are also not reliable mainly because the user is usually unable to remember this information or because many users write the information down thus making the fraudulent perpetrator's job even easier. For instance, it is known that the many unwitting users actually write their PINs on the back of their ATM or smart card.
The shortcomings inherent with the above-discussed security measures have prompted an increasing interest in biometric security technology, i.e., verifying a person's identity by personal biological characteristics. Several biometric approaches are known. However, one disadvantage of biometric approaches, with the exception of voice-based verification, is that they are expensive and cumbersome to implement. This is particularly true for security measures involved in remote transactions, such as internet-based or telephone-based transaction systems.
Voice-based verification systems are especially useful when it is necessary to identify a user who is requesting telephone access to a service/facility but whose telephone is not equipped with the particular pushbutton capability that would allow him to electronically send his identification password. Such existing systems which employ voice-based verification utilize only the acoustic characteristics of the utterances spoken by the user. As a result, existing voice identification methods, e.g., such as is disclosed in the article: S. Furui, "An Overview of Speaker Recognition", Automatic Speech and Speaker Recognition, Advanced Topics, Kluwer Academic Publisher, edited by C. Lee, F. Soong and K. Paliwal, cannot guarantee a reasonably accurate or fast identification particularly when the user is calling from a noisy environment or when the user must be identified from among a very large database of speakers (e.g., several million voters). Further, such existing systems are often unable to attain the level of security expected by most service providers. Still further, even when existing voice verification techniques are applied under constrained conditions, whenever the constraints are modified as is required from time to time, verification accuracy becomes unpredictable. Indeed, at the stage of development of the prior art, it is clear that the understanding of the properties of voice prints over large populations, especially over telephone (i.e., land or cellular, analog or digital, with or without speakerphones, with or without background noise, etc.), is not fully mastered.
Furthermore, most of the existing voice verification systems are text-dependent or text-prompted which means that the system knows the script of the utterance repeated by the user once the identify claim is made. In fact in some systems, the identity claim is often itself part of the tested utterance; however, this does not change in any significant way the limitations of the conventional approaches. For example, a text-dependent system cannot prevent an intruder from using a pre-recorded tape with a particular speaker's answers recorded thereon in order to breach the system.
Text-independent speaker recognition, as the technology used in the embodiments presented in the disclosure of U.S. Ser. No. 08/788,471, overcomes many disadvantages of the text-dependent speaker recognition approach discussed above. But there are still several issues which exist with respect to text-independent speaker recognition, in and of itself. In many applications, text-independent speaker recognition requires a fast and accurate identification of the identity of a user from among a large number of other prospective users. This problem is especially acute when thousands of users must be processed simultaneously within a short time period and their identities have to be verified from a database that stores millions of user's prototype voices.
In order to restrict the number of prospective users to be considered by a speech recognition device and to speed up the recognition process, it has been suggested to use a "fast match" technique on a speaker, as disclosed in the patent application (IBM Docket No. Y0996-189) entitled, "Speaker Recognition Over Large Population with Combined Fast and Detailed Matches", filed on May 6, 1997. While this procedure is significantly faster than a "detailed match" speaker recognition technique, it still requires processing of acoustic prototypes for each user in a database. Such a procedure can still be relatively time consuming and may generate a large list of candidate speakers that are too extensive to be processed by the recognition device.
Accordingly, among other things, it would be advantageous to utilize a language model factor similar to what is used in a speech recognition environment, such factor serving to significantly reduce the size of fast match lists and speed up the procedure for selecting candidate speakers (users) from a database. By way of analogy, a fast match technique employed in the speech recognition environment is disclosed in the article by L. R. Bahl et al., "A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition", IEEE Trans. Speech and Audio Proc., Vol. 1, pg. 59-67 (1993).