The invention is directed to a system and method for detecting a recorded voice that can be used to determine whether an individual is employing a recording device in an attempt to defraud an automatic speaker recognition (xe2x80x9cASRxe2x80x9d) system.
1. Field of the Invention.
The invention relates to the fields of digital speech processing and speaker recognition.
2. Description of Related Art
Voice identification and verification systems, sometimes known as automatic speaker recognition (xe2x80x9cASRxe2x80x9d) systems, attempt to match the voice of the person whose identity is undergoing identification or verification with the voice of a known user enrolled in the system. ASR systems have, in recent years, become quite reliable in recognizing speech samples generated by enrolled users. Accordingly, ASR systems could potentially be employed in a wide variety of applications.
For example, many banks permit customers to transfer money from their accounts over the phone. A bank normally provides a customer with a numeric password that must be entered via a touch-tone phone before the customer can access his/her account. Should that password be stolen, however, an imposter could gain access to the customer""s account. Consequently, banks could add a measure of security by employing an ASR system, wherein a customer""s voice must be verified before gaining access to his/her account. ASR systems may also be used, among other things, to: protect personal records; provide physical building security by controlling access to doors; and verify the presence of a convict subject to home detention.
ASR systems can be divided generally into two categories: text-dependent and text-independent. A text-dependent ASR system requires that the user speak a specific password or phrase (the xe2x80x9cpasswordxe2x80x9d) to gain access. This password is determined by the system or by the user during enrollment, and the system generates and stores a xe2x80x9cvoice printxe2x80x9d from samples of the user saying his/her particular password. A voice print is a mathematical model generated from certain of the user""s speech characteristics exhibited during enrollment. During each subsequent verification attempt, the user is prompted again to speak the password. The system extracts the same speech characteristics from the verification sample and compares them to the voice print generated during enrollment.
In a text-independent ASR system, the system builds a more general model of a user""s voice characteristics during enrollment. This usually requires the user to speak several sentences during enrollment rather than a simple password so as to generate a complete set of phonemes on which the model may be based. Verification in a text-independent system can involve active prompting or passive monitoring. In an active-prompting system the user is prompted to state specific words or phrases that arc distinct from the words or phrases spoken during enrollment. Such systems check first to ensure that the prompted words were spoken, and, second, to determine whether an authorized user spoke those words. In a passive-monitoring system, the user is expected to speak conversationally after access, and the system monitors the conversation passively until it can determine whether the user is authorized. In either event, verification usually requires the user to speak eight to ten seconds of speech compared with the one to two seconds required in a text-dependent system.
Despite their potential for wide-spread use, ASR systems have enjoyed only limited application to date. One reason for this is that an imposter can defraud as ASR system by playing a recording of an authorized user""s voice. If the recording is of a high enough quality, an ASR system recognizes the recorded voice as that of an authorized user and grants access. A variety of recording devices can be used to defraud ASR systems, including wiretapping and tape recording devices. For example, unauthorized users have bugged public telephones with a tape recording device mounted in the vicinity of the phone booth or in the receiver of the phone. In addition, digital voice or speech files and digital audio tapes of an authorized user can be stolen by an imposter and used to gain unauthorized access to the systems protected by ASR techniques.
Some text-independent systems may inherently avoid this problem. In an active-prompting text-independent system, an imposter will not have advanced notice of the phrase required to be spoken during verification and is, therefore, unlikely to have the proper phrase recorded. Further, in a passive-monitoring text-independent system, the imposter is required to have the entire conversation of an authorized user recorded to gain access.
As discussed, however, text-independent systems have drawbacks that make them ill-suited to many applications. For example, active-prompting text-independent systems can be less user friendly than text-dependent systems. A bank customer is likely to complain of having to speak long phrases to gain access to his/her accounts. In addition, there are many applications in which a user is not be expected to speak at all after access, thus making passive-monitoring text-independent systems less useful.
U.S. Pat. No. 5,548,647, entitled xe2x80x9cFixed Text Speaker Verification Method and Apparatus,xe2x80x9d issued to Naik et al. on Aug. 20, 1996, provides one method for reducing fraudulent access to a text-dependent system. In the disclosed method, an authorized user enrolls using a number of passwords, such as the numbers one through nine. During verification, the user is prompted to speak a random one or several of the passwords. Without advanced notice of the specific password required for access, an imposter is less likely to have immediate access to the proper recorded password.
Nevertheless, the method taught by Naik has some drawbacks. For example, an imposter who wiretaps an authorized user""s phone may eventually be able to collect recordings of each of the randomly prompted passwords and replay the correct password(s) quickly enough during verification to gain access. Moreover, in some settings, an authorized user may purposefully attempt to defraud the ASR system using a recording of his/her own voice. For example, where a convict is subject to home detention, he/she may record all of the random passwords in his/her own voice. Then, when the ASR system calls to ensure that the convict was in his/her home a prescribed time, a cohort could play back the correct password and defraud the system.
What is needed is a reliable system and method to detect the use of a recorded voice over a communications channel.
What is needed is a reliable system and method to prevent fraudulent access to ASR-protected systems using the recorded voice of an authorized user.
The method and apparatus of the present invention provide significant improvements over the prior art. The present invention employs a variety of techniques that, alone or in combination, provide a reliable system for detecting the use of a recorded voice over a communications channel. Further, the present invention can be employed to improve the ability of both text-dependent and text-independent ASR systems to detect the fraudulent use of a recorded voice. The present invention provides improved performance over the prior art by employing the following techniques and modules alone or in combination to perform the following: (1) analyzing the temporal characteristics of the user""s speech; (2) analyzing the characteristics of the channel over which the user""s voice is transmitted; (3) training a pattern classifier to recognize the difference between live and recorded speech; and (4) employing an xe2x80x9caudio watermarkxe2x80x9d to detect use of a recording of a previous enrollment or verification attempt.
Most people cannot naturally repeat a word or phrase exactly the same way. Although the human ear may not be able to hear the difference when an individual repeats a particular word, slight changes in the individual""s speech are inevitable. In one embodiment, the claimed invention determines whether certain temporal characteristics of a voice sample captured during a verification attempt match closely with characteristics of a voice sample obtained during an enrollment phase or previous verification attempts. If so, the use of a recording is detected.
For example, each speech sample has a particular xe2x80x9cpitch contourxe2x80x9d (the change in pitch over time). If the pitch contour matches too closely with the pitch contour from a previously stored verification attempt, the system detects the use of a recording and denies verification. Other characteristics that cannot be repeated naturally and, therefore, may be employed in this embodiment include: loudness contour, zero crossings, duration of actual speech content, and verification score from an ASR system.
Each communications channel has unique, detectable characteristics. For example, repeated phone calls made between two stationary phones should always exhibit the same channel characteristics, within certain tolerances. By contrast, a call made from a cellular phone exhibits different channel characteristics. If an imposter records an authorized user""s voice over a communications channel (e.g., by wiretapping), he/she also records the channel characteristics of that communication. The present invention utilizes channel characteristics to detect use of a recording in several different applications.
First, in an application where the system expects the channel characteristics to be identical for each verification (e.g., home detention system), the present invention detects a potential fraud where the channel characteristics of the current call do not closely match the stored characteristics from enrollment.
Second, in an application where a user is prompted to say several random passwords, the present invention detects a recording when the channel characteristics change significantly from one password to another. This indicates that the passwords were recorded at different times, on different channels.
Finally, in an application where initial password verification is followed by either user speech or user-input touch tones, the present invention detects a recording if the channel characteristics detected during password verification do not match the post-verification channel characteristics. This indicates that an imposter used a recording during password verification.
Another embodiment of the present invention employs a pattern classifier to determine whether a speech sample is live or recorded. Live and recorded speech samples are digitized and converted into a particular format, such as spectral feature vectors. The formatted data is then fed to a pattern classifier, such as a Neural Tree Network (xe2x80x9cNTNxe2x80x9d), which develops models of live versus recorded speech. The pattern classifier can then be used to make decisions whether a particular speech sample is live or recorded.
xe2x80x9cWatermarkingxe2x80x9d is a technique of imposing a transparent seal of authenticity that cannot be duplicated easily by an imposter. A further embodiment of the present invention stamps an xe2x80x9caudio watermarkxe2x80x9d on each enrollment and verification attempt by an authorized user. For example, in one embodiment a series of DTMF tones is transmitted to the user""s telephone immediately after he/she is prompted to speak a password during enrollment and verification attempts. If an enrollment or verification attempt is recorded, the audio watermark is recorded along with the authorized user""s voice. An unauthorized user who employs that recording then play back the audio watermark, which is detected, and the unauthorized user is denied verification.
While the different embodiments can be employed independently, they can also be joined either serially or in parallel. With such combinations, the decision whether to deny verification can be made either strictly (e.g., if the user fails under one technique, verification is denied);by majority decision (e.g., the user must fail under a preset number of techniques to deny verification); or by an average of xe2x80x9cconfidence scoresxe2x80x9d (e.g., each technique produces a confidence score, and the average of those confidence scores is used to make a decision regarding verification).
Accordingly, it is an object of the present invention to provide a reliable system and method to detect the use of a recorded voice over a communications channel. It is a further object of the present invention to provide a reliable system and method to prevent fraudulent access to ASR-protected systems using the recorded voice of an authorized user.