Speaker verification is a speech technology employed in a variety of applications that require or benefit from protection against fraudulent or unauthorized access to information and/or secure areas. For example, speaker verification systems may be used to verify the identity of a speaker prior to authorizing the speaker to access sensitive or confidential information and/or to enter a secure area of a building or other locale to which access is limited to authorized personnel. The financial services industry, for example, may benefit from speaker verification as a means to provide security in their online or telephone banking systems to replace or supplement convention security schemes such as password protections.
Systems that employ speaker verification typically attempt to verify the claimed identity of a speaker undergoing verification by matching the voice of the speaker with a “voice print” obtained from a person whose identity the speaker is claiming. A voice print refers to any type of model that captures one or more identifying characteristics of a person's voice. Typically, a voice print is obtained at the time a speaker verification system enrolls a user by prompting the user to utter a particular enrollment utterance or utterances to obtain a voice signal from the user. The enrollment utterance may be comprised of one or more words selected by the system, for example, due to the presence of a variety of vowel, nasal or other sounds in the words that tend to carry information specific to the speaker. The voice signal obtained from the user may then be analyzed to extract characteristic features of the voice signal to form, at least in part, a voice print that models the speech of the enrolled user.
Prior to granting access, the speaker verification system may prompt a speaker undergoing verification to utter a challenge utterance to obtain a voice signal to be matched with the voice print of the enrolled user whose identity the speaker is asserting. The term “challenge utterance” refers to one or more words that a speaker verification system prompts a speaker undergoing verification to utter so that the voice characteristics of the speaker can be compared with voice characteristics of the enrolled user (e.g., as modeled by the associated voice print). Based on the similarity between the characteristic features in the voice signal obtained from the speaker and the voice print obtained at enrollment, the speaker verification system can either accept or reject the asserted identity of the speaker.
Speaker verification may have significant security advantages over conventional security measures such as passwords, personal identification numbers (PINS), etc. For example, a person's voice may be uniquely tied to the speaker's identity and therefore less susceptible to being obtained via theft and less vulnerable to being discovered by hackers. Despite the security enhancements that speaker verification affords, however, state of the art digital recorders are capable of recording a speaker's voice with enough fidelity to trick conventional speaker verification systems using a technique known as a playback attack.
Perpetrators of playback attacks have devised various schemes to elicit one or more utterances from an enrolled user that includes the challenge words for the speaker verification system being attacked. The perpetrator secretly records the utterance(s) and plays back the recording in response to a challenge from the speaker verification system to trick the system into believing that the enrolled user is present and uttering the challenge words. Thus, playback attacks may present a substantial security risk to institutions employing conventional speaker verification systems. Some conventional speaker verification systems have attempted to thwart playback attacks by prompting the user to speak a series of random digits. However, these efforts may not be entirely effective and such conventional systems are still susceptible to playback attacks.
The accuracy of a speaker verification system may be affected by a number of factors that cause voice signals obtained at enrollment to differ from those obtained during a challenge/response session, even when the voice signals are produced by the same speaker. For example, over time, the characteristics of a person's vocal tract age resulting in changes in the sound of the person's voice. Thus, voice aging may cause false negatives to occur because a person's voice has aged sufficiently such that it's characteristics no longer closely match the voice print obtained during enrollment. Other changes that may reduce the accuracy of speaker verification include voice changes brought about by illness (e.g., cold, congestion or chronic illness), differences in the handsets used during enrollment and any subsequent challenge/response session (e.g., differences in cell phone versus land line), ambient noise present during the challenge and response sessions, etc.
Adaptation is a process of updating a voice print over time using voice information obtained from a speaker at one or more times subsequent to enrollment to model any voice changes that might have occurred. For example, a speaker verification system may, from time to time, use a voice signal obtained during a challenge/response session of a speaker that is subsequently verified by the system to incorporate characteristic features of the aged or changed voice into the model (i.e., into the voice print). Such adaptation techniques may allow a voice print to evolve over time to maintain satisfactory recognition accuracy even in the face of changes in the person's voice.