In order to provide security to transactions conducted via voice, e.g., over the telephone, it is often desirable to authenticate the speaker. One existing method for authenticating speakers in a telephone transaction is through a personal identification number (PIN) or telephone PIN (TPIN). While generally referred to as “PIN,” the identification can be other than a number, e.g., a voiced phrase, an encoded data stream. Where this application refers to. “voiced PIN,” “keyed PIN,” “PIN,” “authentication information,” etc., the full range of identification means are implied.
For example, a telephone banking user provides the PIN via voice or telephone keypad in order to inquire as to her account balance. Such an approach to authentication is subject to being compromised, for example, by a third party recording the voiced PIN or decoding the keyed PIN. The recorded or decoded information can then be used for unauthorized access to the account.
One potential solution involves voiceprint authentication, e.g., matching characteristics of a user's voice over the communications channel. Some embodiments of this approach use a training phrase, e.g., “open sesame.” A user repeats the training phrase until sufficient characteristics of the user's voice saying the training phrase have been collected. When executing a transaction, the user speaks the training phrase (also referred to as a “pass phrase”); if the characteristics of the spoken training phrase matches the stored characteristics within an acceptable level of confidence, the user is authenticated. This approach is still open to exploitation by recording.
A variation on this approach relies on characteristics of the user's voice that are not specific to training phrases. This variation typically requires a much larger training set; the time required to obtain that training set may serve as a disincentive to enrollment. In addition, the processing resources required are likely much greater for this variation. Further, since the potential for false negatives and false positives is generally greater when the training is not based on a known set of pass phrases, this approach has a major disadvantage with respect to user acceptance.
Approaches have been developed to mitigate the risk of exploitation by record/playback of a speaker's authentication utterances. One such approach involves identifying telltale characteristics and limitations of a playback device (e.g. the absence or presence of special harmonics, modulations or other special signal characteristics) present in the play back of the illicitly recorded utterance (voice, PIN or otherwise). This approach would be effective only where the telltale characteristics were present within the bandwidth of the communication channel.
Another approach involves identifying the natural variation between separate instances of a spoken phrase. If such variations are not present, the risk that the utterance or TPIN is a recording is increased. Substantial variation would not be present between a high-fidelity recording and its spoken original, or between separate high fidelity playbacks of the same recording. Nevertheless, this approach can be defeated, albeit requiring some technical sophistication, by introducing artificial variations—or in a lower-tech fashion by illicitly recording multiple versions of the spoken phrase.
Training on several different user phrases could be used to introduce diversity to the authentication phrase used in any specific transaction. Randomly alternating the required authentication response among the different phrases could be used. This diversity could mitigate the risk of false authentication but, as with other approaches, is susceptible to a reasonably persistent adversary who records multiple user authentication sessions. In addition, diversity among authentication phrases requires more training time, hence potentially less user acceptance.