The present invention relates generally to speech technology and, more particularly, to a system and method for performing speaker verification or speaker identification.
The problem of authentication lies at the heart of nearly every transaction. Millions of people conduct confidential financial transactions over the telephone, such as accessing their bank accounts or using their credit cards. Authentication under current practice is far from foolproof. The parties exchange some form of presumably secret information, such as social security number, mother""s maiden name or the like. Clearly, such information can be pirated, resulting in a false authentication.
One aspect of the present invention addresses the foregoing problem by providing a system and method for performing speaker verification. Speaker verification involves determining whether a given voice belongs to a certain speaker (herein called the xe2x80x9cclientxe2x80x9d) or to an impostor (anyone other than the client).
Somewhat related to the problem of speaker verification is the problem of speaker identification. Speaker identification involves matching a given voice to one of a set of known voices. Like speaker verification, speaker identification has a number of attractive applications. For example, a speaker identification system may be used to classify voice mail by speaker for a set of speakers for which voice samples are available. Such capability would allow a computer-implemented telephony system to display on a computer screen the identity of callers who have left messages on the voice mail system.
While the applications for speaker verification and speaker identification are virtually endless, the solution to performing these two tasks has heretofore proven elusive. Recognizing human speech and particularly discriminating the speaker from other speakers is a complex problem. Rarely does a person speak even a single word the same way twice due to how human speech is produced.
Human speech is the product of air under pressure from the lungs being forced through the vocal cords and modulated by the glottis to produce sound waves that then resonate in the oral and nasal cavities before being articulated by the tongue, jaw, teeth and lips. Many factors affect how these sound producing mechanisms inter-operate. The common cold, for example, greatly alters the resonance of the nasal cavity as well as the tonal quality of the vocal cords.
Given the complexity and variability with which the human produces speech, speaker verification and speaker identification are not readily performed by comparing new speech with a previously recorded speech sample. Employing a high similarity threshold, to exclude impostors, may exclude the authentic speaker when he or she has a head cold. On the other hand, employing a low similarity threshold can make the system prone to false verification.
The present invention uses a model-based analytical approach to speaker verification and speaker identification. Models are constructed and trained upon the speech of known client speakers (and possibly in the case of speaker verification also upon the speech of one or more impostors). These speaker models typically employ a multiplicity of parameters (such as Hidden Markov Model or GMM parameters). Rather than using these parameters directly, the parameters are concatenated to form supervectors. These supervectors, one supervector per speaker, represent the entire training data speaker population.
A linear transformation is performed on the supervectors resulting in a dimensionality reduction that yields a low-dimensional space that we call eigenspace. The basis vectors of this eigenspace we call xe2x80x9ceigenvoicexe2x80x9d vectors or xe2x80x9ceigenvectorsxe2x80x9d. If desired, the eigenspace can be further dimensionally reduced by discarding some of the eigenvector terms.
Next, each of the speakers comprising the training data is represented in eigenspace, either as a point in eigenspace or as a probability distribution in eigenspace. The former is somewhat less precise in that it treats the speech from each speaker as relatively unchanging. The latter reflects that the speech of each speaker will vary from utterance to utterance.
Having represented the training data for each speaker in eigenspace, the system may then be used to perform speaker verification or speaker identification.
New speech data is obtained and used to construct a supervector that is then dimensionally reduced and represented in the eigenspace. Assessing the proximity of the new speech data to prior data in eigenspace, speaker verification or speaker identification is performed. The new speech from the speaker is verified if its corresponding point or distribution within eigenspace is within a threshold proximity to the training data for that client speaker. The system may reject the new speech as authentic if it falls closer to an impostor""s speech when placed in eigenspace.
Speaker identification is performed in a similar fashion. The new speech data is placed in eigenspace and identified with that training speaker whose eigenvector point for distribution is closest.
Assessing proximity between the new speech data and the training data in eigenspace has a number of advantages. First, the eigenspace represents in a concise, low-dimensional way, each entire speaker, not merely a selected few features of each speaker. Proximity computations performed in eigenspace can be made quite rapidly as there are typically considerably fewer dimensions to contend with in eigenspace than there are in the original speaker model space or feature vector space. Also, the system does not require that the new speech data include each and every example or utterance that was used to construct the original training data. Through techniques described herein, it is possible to perform dimensionality reduction on a supervector for which some of its components are missing. The result point for distribution in eigenspace nevertheless will represent the speaker remarkably well.