The technical field of the present invention relates to a method and apparatus for implementing automatic speaker verification, speaker identification, speech recognition and spoken text verification, and more specifically relates to using the same in a converged voice and data Internet Protocol network.
In the past, voice recognition and verification has been accomplished on a limited basis for hardwired and direct comparison configurations. Other attempts have been made to provide security measures incorporating automatic speaker verification, speaker identification, speech recognition and spoken text verification. However, as the security needs of this country expand, and as customers need more and more contact with their banks, financial institutions, and other businesses to provide transactions over the Internet, it becomes clear that speaker verification and identification can be used as an important tool for carrying out those transactions. The identification of a speaker can be useful in the financial and banking industry because verification of the person speaking will enable certain transactions to proceed, whereas an imposter would be barred from carrying on any further transactions. Imposters will also be barred from receiving certain information that should only be heard by the customer.
Prior to this, some speaker verification and identification were carried out by having the customer first enter their “voiceprint” by reciting a phrase over the telephone into their databank. When the customer called back at a later date, they would say the same phrase, and it would be matched to the original “voiceprint”. If it matched, then the transaction could continue. Like other hardwired, or dedicated, systems, the speaker is well aware of the match being attempted and may be able to use techniques to convince the institution that an imposter is actually the bona fide customer.
However, now with the maturing of the Internet and its wide use worldwide, a new phenomena has emerged of using the Internet Protocol (IP) to transmit voice packets and establish and maintain a telephone conversation, known as Voice over IP (VoIP). For the end user a VoIP telephone connection is perceived to be undistinguishable from the traditional Public Switched Telephone Network (PSTN) connection. However the VoIP uses a conceptually different way to transmit the human voice between two points. The voice is processed and transmitted in the same way as the Internet data packets. The convergence of voice and data IP networks allows for new methods for implementation of speech processing algorithms to emerge.
Conventional technologies include matching techniques, such as that proposed by U.S. Pat. No. 5,897,616 entitled “Apparatus and Methods for Speaker Verification/Identification/Classification Employing Non-acoustic and/or Acoustic Models and Databases” issued to Kanevsky, et al., dated Apr. 27, 1999, which discloses a process of collecting a sample of the voice of the user as a separate procedure. It is implemented as a separate telephone IVR (Interactive Voice Response) dialog with the possibility of asking additional questions and personal information from the user. The apparent goal of the '616 invention is to authorize or grant access to the legitimate user, forcing the speaker to participate in the dialog and spending additional time for this purpose.
Another example of a conventional hardwire or dedicated matching technique includes U.S. Pat. No. 5,937,381 entitled “System for Voice Verification of Telephone Transactions” issued to Huang, et al., on Aug. 10, 1999. This patent discloses a system and a method for verifying the voice of a user conducting a traditional telephone transaction. The system and method includes a mechanism for prompting the user to speak in a limited vocabulary consisting of twenty-four combinations of the words “four”, “six”, “seven” and “nine” arranged in a double two-digit number combination. The example uses a traditional telephone-based IVR system and also has the limitation of working with a predefined small vocabulary of password prompts. This limits its scope to operating in only one language performing a variation of text-dependent speaker verification.
Recent academic research in the field of speaker verification and speaker recognition has made an attempt to use the established ITU (International Telecom Union) compression standards used for VoIP transmission in conjunction with known algorithms for speech processing. For example in the publication “Speaker Recognition Using G.729 Speech Codec Parameters” by T. F. Quatieri, et al., ICASSP 2000, two approaches are investigated for recovering speaker recognition performance from G.729 parameter. The first is a parametric approach that makes explicit use of G.729 parameters, the second nonparametric approach uses the original MFB paradigm. In a different paper from the same conference entitled “GSM Speech Coding and Speaker Recognition” by L. Besacier, et al., ICASSP 2000, the influence of GSM speech coding on the performance of text independent speaker recognition is investigated. The three existing GSM speech codec standards were considered with results showing that by extracting the features directly from the encoded bit-stream good performance for speaker recognition can be obtained.
In yet another example of previous matching technologies, U.S. Pat. No. 6,021,387 entitled “Speech Recognition Apparatus for Consumer Electronic Application” issued to Mozer, et al., on Feb. 1, 2000, a spoken word or phrase recognition device is described. The device uses a neural net to perform the recognition task and recognizes an unknown speaker saying a digit from zero through nine with accuracy between 94-97%. The '387 patent relates to speech recognition in the context of low cost applications.
With the advent of new technologies over the Internet, these prior art technologies become lacking in being able to adapt for use in the Internet arena. The convergence of voice and data in the IP network along with the increasing role of VoIP communications indicates that the human voice will become intrinsically present in the network. This presence can be used to voice-enable new types of enhanced network applications and services.
The present invention seeks to provide an advantage over the prior art by utilizing the presence of the voice information over the Internet in new ways for voice-enabled applications. One of the objects of the present invention is to use the converged voice and data IP network for implementing automatic speaker verification, speaker identification, speech recognition and spoken text verification.