Speaker identification systems identify a person by analyzing the person's speech. In general, there are three kinds of speaker identification: speaker verification, closed set identification, and open set identification.
A speaker verification system compares a sample of speech from a person who professes to be a particular known speaker to previous samples or models of speech of that known speaker. The speaker verification system verifies the identity of the speaker by determining whether the sample matches the previous samples or models.
A closed set identification system analyzes a sample of speech in relation to the speech of each of a set of known speakers. The system then determines that the speech was produced by the known speaker whose speech most closely matches the sample of speech. Thus, a closed set identification system identifies the single known speaker who is most likely to have produced the sample of speech.
An open set identification system analyzes a sample of speech in relation to the speech of each of a set of known speakers. The system determines for each known speaker whether the sample of speech was likely to have come from that speaker. The system may determine that the sample of speech was likely to have come from multiple speakers or none at all.
In one approach to speaker identification, referred to as a large vocabulary continuous speech recognition (LVCSR) approach, speech recognition is used to identify the words spoken by the person as the first step in the identification process. A speech recognition system analyzes a person's speech to determine what the person said. In a typical frame-based speech recognition system, a processor divides a signal derived from the speech into a series of digital frames, each of which corresponds to a small time increment of the speech. The processor then compares the digital frames to a set of speech models. The speech models may be speaker-independent models that represent how words are spoken by a variety of speakers. Speech models also may represent phonemes that correspond to portions of words. Phonemes may be subdivided further within the speech model into phoneme nodes, where a phoneme may be represented by, for example, three phoneme nodes. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. The processor determines what the person said by finding the speech models that correspond best to the digital frames that represent the person's speech.
After using speech recognition to determine the content of the speech, the speaker identification system determines the source of the speech by comparing the recognized speech to speech models for different known speakers. The likelihood that a particular known speaker is the source of the speech is estimated based on the degree to which the recognized speech corresponds to the speech model for the known speaker.
The speech model for a known speaker may be produced, for example, by having the known speaker read from a list of words, or by having the known speaker respond to prompts that ask the speaker to recite certain words. As the known speaker reads from the list of words or responds to the prompts, the known speaker's speech is sampled and the samples are stored along with the identity of the known speaker. Typically, the samples are stored as speaker-adapted models. Though the speaker-independent model used in speech recognition is typically a triphone model that considers the context in which phonemes are spoken, the scoring models used in speaker identification typically are monophone models that do not consider the context. This permits the scoring models to adapt efficiently from a small amount of data.
In another approach to speaker identification, referred to as a Gaussian mixture model (GMM) approach, each digital frame is compared to a single, speaker-independent mixture model representing all speech and to speaker-adapted mixture models for each of the known speakers. The speaker-independent mixture model is a mixture of approximately 2000 Gaussians and represents all speech without reference to particular phonemes or phoneme nodes. The likelihood that a particular known speaker is the source of the speech is estimated based on the degree to which the digital frames resemble the speaker-adapted models more closely than they resemble the speaker-independent model.