The present invention relates generally to the field of speaker recognition, which includes speaker verification and speaker identification.
The use of speaker verification systems for security purposes has been growing in recent years. In a conventional speaker verification system, speech samples of known speakers are obtained and used to develop some sort of speaker model for each speaker. Each speaker model typically contains clusters or distributions of audio feature data derived from the associated speech sample. In operation of a speaker verification system, a person (the claimant) wishing to, e.g., access certain data, enter a particular building, etc., claims to be a registered speaker who has previously submitted a speech sample to the system. The verification system prompts the claimant to speak a short phrase or sentence. The speech is recorded and analyzed to compare it to the stored speaker model with the claimed identification (ID). If the speech is within a predetermined distance (closeness) to the corresponding model, the speaker is verified.
Speaker identification systems are also enjoying considerable growth at the present time. These systems likewise develop and store speaker models for known speakers based on speech samples. Subsequently, to identify an unknown speaker, his speech is analyzed and compared to the stored models. If the speech closely matches one of the models, the speaker is positively identified. Among the many useful applications for such speaker identification systems is in the area of speech recognition. Some speech recognition systems achieve more accurate results by developing unique speech prototypes for each speaker registered with the system. The unique prototype is used to analyze only the speech of the corresponding person. Thus, when the speech recognition system is faced with the task of recognizing speech of a speaker who has not identified himself, such as in a conference situation, a speaker identification process can be carried out to determine the correct prototype for the recognition operation.
The present disclosure relates to a method for generating a hierarchical speaker model tree. In an illustrative embodiment, a speaker model is generated for each of a number of speakers from which speech samples have been obtained. Each speaker model contains a collection of distributions of audio feature data derived from the speech sample of the associated speaker. The hierarchical speaker model tree is created by merging similar speaker models on a layer by layer basis. Each time two or more speaker models are merged, a corresponding parent speaker model is created in the next higher layer of the tree. The tree is useful in applications such as speaker verification and speaker identification.
A speaker verification method is disclosed in which a claimed ID from a claimant is received, where the claimed ID represents a speaker corresponding to a particular one of the speaker models. A cohort set of similar speaker models associated with the particular speaker model is established. Then, a speech sample from the claimant is received and a test speaker model is generated therefrom. The test model is compared to all the speaker models of the cohort set, and the claimant speaker is verified only if the particular speaker model is closest to the test model. False acceptance rates can be improved by computing one or more complementary speaker models and adding the complementary model(s) to the cohort set for comparison to the test model. In a cumulative complementary model (CCM) approach, one merged complementary model is generated from speaker models outside the original cohort set, and then added to the cohort set. In a graduated complementary model (GCM) approach, a complementary model is defined for each of a number of levels of the tree, with each complementary model being added to the cohort set.