1. Field of the Invention
The present invention relates to speaker identity verification methods and systems.
2. Description of the Related Art
Automatic speech recognition (ASR), in general, and speaker identity verification (SIV) applications, in particular, are used in network-based applications to provide secure access to online information or a physical facility. Using an SIV application, a caller may speak into a telephone device to gain access to a secure device via a telephone network. The SIV application verifies the identity of the caller based on his/her speech.
Early systems in network-based SIV applications created a voice profile or voice print for a pre-qualified user under a unique personal identification number (PIN). For an initial enrollment or registration session, the system asks the user to record a few utterances of certain texts. Such text-dependent schemes may include 10-digit telephone numbers, special alphanumeric strings of certain meaning (e.g. “ABC1234”) or public/group passwords (e.g. “Dallas Cowboy”). The recorded speech materials, having length of about 10 to 20 seconds, are used to construct a voice profile or voice print for the user using a system-wide unique PIN.
In subsequent verification sessions, a caller first makes a speaker identity claim (SIC) using a valid PIN in a proper modality such as voice, touch-tone, or a smart card. The system uses the PIN to initialize an SIV engine based on the previously-created voice profile associated with this PIN. Thereafter, the system asks the caller to speak a few phrases in order to determine if the voice matches the voice profile. This process is known as a two-step process: (a) get a PIN and (b) verify the SIC using additional speech materials.
More recent systems use a one-step process. The system asks the caller to speak his/her PIN. Using an embedded ASR engine, the system first recognizes the PIN that was spoken. Thereafter, the system retrieves the voice profile registered under the PIN, and compares the voice characteristics extracted from the speech (e.g. the spoken PIN) against the claimed voice profile. To prevent an imposter from using a recorded source (e.g. one obtained from secretly taped conversations with an authorized user) to break-in this one-step SIV process, some systems of this type generate a sequence of random digits (e.g. “one five two four”) and then ask the caller to say the sequence.
With increasingly sophisticated digital recording technology such as an MP3-enable device with a telephony interface, it is conceivable that imposters could compose on demand a digit sequence using previously-recorded digits spoken by a true speaker. In a Voice-over-Internet-Protocol (VoIP)-based SIV scenario, the imposter could use real-time digital signal processing (DSP) technology to concatenate individual digits recorded earlier to form the required digit sequence and then send a data packet to a remote SIV server.