1. Field of the Invention
This invention relates generally to speech processing, and more particularly to a system and method for robust speaker verification employing temporal decorrelation.
2. Description of the Related Art
Current system and methods of speaker voice verification require voice enrollment prior to actual verification usage. During such enrollment, a model of the speech particular to each speaker to be verified is created. This is usually done by gathering speech data from several utterances known to come from a given speaker and then processing the data to form models unique to the speaker. The unique models are stored along with information that identifies the speaker of the models.
During actual verification usage, speakers first claim their identity. The system requests the speaker speak an utterance which is then compared to the stored speech models for the speaker with the claimed identity. If the spoken utterance and speech models agree closely, then the speaker is declared to be the same as the claimed identity.
Present methods of speech processing measure vectors of speech parameters from an utterance over small periods of time, called frames, during which it is assumed that the acoustical signal is not changing appreciably. Often, these parameter vectors undergo an orthogonalizing linear transformation, or some other transformation, to create statistically uncorrelated speech parameter vectors, also known as speech feature vectors. The resulting parameter or feature vectors can be used to model an individual's speech.
Currently, some speaker verification systems group together the speech vectors from all frames of a given person's speech and use them to determine average statistical properties of the speech vectors over entire utterances. Sometimes these systems estimate average statistical properties of the distortion of the speech vectors due to different handsets and channels. The average statistical properties are subsequently used to verify the speaker.
Other speaker verification systems group speech vectors that correspond to the same speech sounds in a process called alignment. Dynamic Time Warping (DTW) or Hidden Markov Modeling (HMM) are among the more well-known methods for alignment. The system estimates the statistical properties of the speech vectors corresponding to each group separately. The resulting collection of statistical properties of the groups of speech vectors form the reference model for the speaker to be verified. Verification systems often separate the collection of statistical properties into multiple models representing individual words, syllables, or phones.
It is important to note that all of these present art systems utilize statistical properties of the speaker's data at the speech vector level. Hence, the systems implicitly assume independence of the statistical properties associated with each group of speech vectors.
One of the problems faced by many speaker verification applications include unavoidable distortion or variation of the speech signal. A distorted speech signal results in distorted speech vectors. If the vectors are considered individually, as current verification systems do, it is difficult to determine whether the speech came from an assumed true speaker or an impostor because of the distortion of the speech vector. This degrades speaker verification performance.
For example, in telecommunications applications, where one wishes to control access to resources by voice identification over the telephone, use of different telephone handsets and channels often distorts and varies a person's speech. In other applications, such as an automated teller for banking, the use of different microphones causes variation of the speech signal. It is also important to note that with current speaker verification systems, since only one telephone handset or microphone is used at a time, the variation of the speech signal is consistent so long as only that particular handset or microphone is used.
Accordingly, improvements which overcome any or all of these problems are presently desirable.