The present invention relates generally to speaker recognition techniques, and more particularly, to methods and apparatus for determining the identity of a speaker given a speech sample.
A number of techniques have been proposed or suggested for identifying one or more speakers in an audio stream. For example, U.S. patent application Ser. No. 09/345,237, filed Jun. 30, 1999, discloses a method and apparatus that automatically transcribe audio information from an audio source while concurrently identifying speakers in real-time, using an existing enrolled speaker database. While currently available speaker recognition algorithms perform well for many applications, they typically impose computational requirements that exceed the available resources of portable computing devices, such as personal digital assistants (PDAs).
Generally, currently available speaker recognition algorithms employ background and target speaker models that are loaded into memory. Thus, the memory footprint required to store such background and target speaker models typically exceeds the amount of available memory in most portable computing devices. In addition, currently available speaker recognition algorithms have processor requirements that exceed the computational capacity of most portable computing devices.
Furthermore, portable computing devices, such as personal digital assistants, together with the personal and confidential information stored thereon, can be easily lost or stolen due to their small dimensions. Thus, many portable computing devices incorporate one or more access control features to prevent unauthorized users from accessing the data and applications stored thereon. For example, a biometric access control technique can be used to recognize one or more authorized users. Currently, most portable computing devices incorporate a limited number of input devices. For example, while many portable computing devices incorporate a microphone to enable a user to enter information using speech recognition techniques, they generally do not include a camera or a fingerprint imager.
Thus, a need exists for a speaker recognition technique that can operate within the memory and processing constraints of existing portable computing devices. A further need exists for a speaker recognition technique that may be deployed as an access control mechanism on existing portable computing devices.
Generally, the present invention provides a speaker recognition technique that can operate within the memory and processing constraints of existing portable computing devices. The disclosed speaker recognition technique achieves a smaller memory footprint and computational efficiency using single Gaussian models for each enrolled speaker. During an enrollment phase, features are extracted from one or more enrollment utterances from each enrolled speaker, to generate a target speaker model based on a sample covariance matrix. Thereafter, during a recognition phase, features are extracted from one or more test utterances to generate a test utterance model that is also based on the sample covariance matrix. Finally, the speaker recognition technique of the present invention computes a normalized similarity score, referred to as a sphericity ratio, that compares the test utterance model to the target speaker model, as well as a background model.
The present invention identifies a speaker given a sample of his or her speech, by computing a similarity score based on the sphericity ratio. The sphericity ratio compares the given speech to the target speaker model and to the background models created during the enrollment phase. Thus, the sphericity ratio incorporates the target, test utterance, and background models. Generally, the sphericity ratio indicates how similar the test utterance speech is to the speech used when the user was enrolled, as represented by the target speaker model, and how dissimilar the test utterance speech is from the background model, thus allowing the speaker to be recognized.
The sphericity ratio score may be expressed as follows:
Score=xe2x88x92trace(CU inv(CS))/trace(CU inv(CB)),
where CS is the sample covariance matrix of the features extracted from the target speaker training data, CB is the sample covariance matrix of the features extracted from the background data, and CU is the sample covariance matrix of the features extracted from the speech utterance that needs to be recognized. Trace( ) indicates the trace operator of matrices, and inv( ) indicates the inverse operator of matrices. The present invention is invariant to scaling of the features extracted from speech, showing more robust recognition. In addition, the present invention is text independent and can recognize a speaker regardless of the words spoken by the speaker. The recognition score generated by the present invention can be used to recognize the identity of the speaker by (i) comparing the score of a given speaker model to scores obtained against other speaker models; or (ii) applying a threshold to the score to verify whether the speaker is the person that the speaker has asserted himself or herself to be.