Speaker recognition is a biometric modality that uses a person's voice for identification purposes. Speaker recognition is different from speech recognition where transcription of the spoken word is desired instead of the identity of the speaker. Before a speaker recognition system can be used, it must go through a training procedure where speech samples from the user are introduced to the system to build a speaker model. In the testing phase, the user or speaker's speech is compared to the speaker model database and the one which closely resembles it is identified.
There are two main types of speaker recognition. The first type is text dependent speaker recognition. In this type of speech recognition, the user or speaker needs to utter a specific keyword or phrase to be accepted by the system. The keyword or phrase can either be a password or a prompted phrase. These types of speaker recognition systems have a strong control over user input and need user cooperation to perform the identification process. Most text dependent speaker recognition system uses Hidden Markov Model (HMM) which provides a statistical representation of the individual's speech. FIG. 1 shows the structure of a typical HMM. The training process requires the user to say the keyword or phrase numerous times to build the statistical model. Another method utilized in speaker dependent systems is to use template matching where a sequence of feature vectors is built from a fixed phrase that is spoken by the speaker or user to thereby generate a template. The verification of the speaker can be done by using Dynamic Time Warping (DTW) to measure the similarity between the test phrase and the template.
The second type of speaker recognition system is text independent speaker recognition. Text independent speaker recognition systems allow a user to say anything he or she wants, and the system should then be able to perform the identification of the speaker. Sufficient speech samples from the speaker are needed in such systems in order to make accurate recognition of the speaker given the system has no prior or learned knowledge of what is being spoken by the speaker. The primary advantage of text independent speaker recognition systems is that the process can be done without user cooperation (e.g., no keyword or phrase need be spoken). Various forms of neural networks can be trained to perform the task of text independent speaker recognition, but the de facto reference method is the Gaussian Mixture Model (GMM) to represent the speaker model. Usually this GMM is adapted from a Universal Background Model (UBM) using an adaptation method such as maximum a posteriori (MAP). In recent years, Support Vector Machine (SVM) has been found to be the most robust classifier in speaker verification, and its combination with GMM has successfully increased the accuracy of text independent speaker recognition systems.
There is a need for improved methods and circuits for text independent speaker recognition in lower power applications such as in battery powered devices like smartphones.