1. Technical Field
The present invention relates generally to speaker verification; and, more particularly, it relates to speaker verification employing a combination of universal cohort modeling and automatic score thresholding.
2. Description of Related Art
Conventional systems employing speaker recognition and other automatic speaker verification (ASV) provide a means to ensure secure access to various facilities. The ability to control the flow of personnel to various portions within a facility, without the intervention of man-occupied stations, is also very desirable for many applications. For example, many businesses use card-controlled access or numerical keypads to control the flow of personnel into various portions of a facility. Facility management, when controlling a single building having a number of businesses occupying various portions of the building, often use such means of facility access control to monitor and ensure that various portions of the facility are safe and secure from intruders and other unauthorized personnel. Such personnel recognition system and automatic speaker verification (ASV) systems provide the ability to control the flow of personnel using speech utterances of the personnel. Verbal submission of a predetermined word or phrase or simply a sample of an individual speaker""s speaking of a randomly selected word or phrase are provided by a claimant when seeking access to pass through the speaker recognition and other automatic speaker verification (ASV) systems. An authentic claimant is one of the personnel who is authorized to gain access to the facility.
A trend for many of these speaker recognition and other automatic speaker verification (ASV) systems is to employ systems that employ unsupervised training methods to prepare the speaker verification system to operate properly in real time. However, many of the conventional systems require substantial training and processing resources, including memory, to perform adequately. Within such systems, a claimant provides a speech sample or speech utterance that is scored against a model corresponding to the claimant""s claimed identity and a claimant score is then computed. There are two commonly known conventional methods known to those having skill in the art of speaker verification to decide whether to accept or reject the claimant; that is to say, whether to permit the claimant to pass through the speaker verification system of to deny the claimant access, i.e., to confirm that the claimant is in fact an authorized member of the personnel of the facility.
A first conventional method to perform speaker verification compares a score that is derived from the claimant provided utterance to a predetermined threshold level. The claimant is subsequently declared to be a true speaker solely upon the determination of whether the claimant""s score exceeds the predetermined threshold level. Alternatively, if the claimant""s score falls below the predetermined threshold level, the claimant is rejected and denied access through the speaker verification system. Deficiencies in this first conventional method of performing speaker verification are many. Although this first conventional method of performing speaker verification has relatively low computational and storage requirements, it is substantially unreliable. A predominant reason for the unreliability of this first conventional method of performing speaker verification stems from the fact that it is highly biased to the training data, and it is consequently highly biased to the training conditions that existed during its training.
A second conventional method used to perform speaker verification compares the score that is derived from the claimant""s utterance to a plurality of scores that are computed during the speaker verification process, i.e., when the claimant claims to be a true speaker or member of the personnel of the facility, namely, an individual speaker authorized to gain access through the speaker identification system. The plurality of scores that are compared to the score that is derived from the claimant provided utterance using the second conventional method to perform speaker verification are generated by scoring the claimant""s score against a set of scores extracted from models known cohort speakers. One difficulty, among others, with using the cohort modeling is the required set of cohort models necessitated to perform speaker verification is different for every speaker; consequently, a large amount of processing must be performed to determine the proper cohort model or models for a given claimant. A relatively significant amount of memory is also required to store all of the various cohort models to accommodate all of the speakers of the system. In addition, the method of training the conventional speaker verification system requires access to a relatively large pool of speaker cohort models to select the proper cohort set; the accompanying data storage requirements are typically very large as described above. A problem for speaker verification systems having relatively constrained memory requirements and processing requirements is that their reliability suffers greatly using such conventional methods. Also, the memory management and data processing needs are also great, in that, several cohort scores must be computed for proper verification; these cohort scores are in addition to the claimant""s score in the instant case. Conventional speaker verification systems suffer in terms of relatively large memory requirements, an undesirable high complexity, and an unreliability associated with each of the first conventional method and the second conventional method to perform speaker verification.
Further limitations and disadvantages of conventional and traditional systems will become apparent to one of skill in the art through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.
Various aspects of the present invention can be found in an integrated speaker training and speaker verification system that generates a speaker model and a speaker authenticity using a speech utterance provided by a claimant. The integrated speaker training and speaker verification system contains a training circuitry, a memory, a pattern classification circuitry, and a decision logic circuitry. The training circuitry generates the speaker model and a speaker threshold using the speech utterance provided by the claimant. The memory stores the speaker model and the speaker threshold corresponding to the speech utterance provided by the claimant. The memory also stores a number of cohort models. The pattern classification circuitry processes the speech utterance provided by the claimant. The speech utterance is scored against a selected cohort model chosen from the number of cohort models and the speaker model. The decision logic circuitry processes the speech utterance provided by the claimant, and the speech utterance is scored against the speaker threshold. The pattern classification circuitry and the decision logic circuitry operate cooperatively to generate a speaker authenticity.
In certain embodiments of the invention, the integrated speaker training and speaker verification system contains an offline cohort model generation circuitry that generates three cohort models. One of the cohort models is generated using speech utterances of male speakers. Another of the cohort models is generated using speech utterances of female speakers. A third of the cohort models is generated using speech utterances of both male and female speakers. The pattern classification circuitry of the integrated speaker training and speaker verification system is any unsupervised classifier. In certain embodiments of the invention, the integrated speaker training and speaker verification system contains a switching circuitry that selects between a training operation and a testing operation. The speech utterance provided by the claimant corresponds to a claimed identity of a user authorized to gain entry through the testing operation of the integrated speaker training and speaker verification system. The decision logic circuitry compares the speaker threshold to a relative score that is calculates during the testing operation of the integrated speaker training and speaker verification system, and the relative score itself is generated using the speech utterance provided by the claimant. The integrated speaker training and speaker verification system contains a pre-processing and feature extraction circuitry wherein the pre-processing and feature extraction circuitry removes silence and extracts a plurality of cepstral features of the speech utterance provided by the claimant. If desired, the speech utterance provided within the integrated speaker training and speaker verification system is a predetermined verification phrase.
Other aspects of the invention can be found in a speaker verification system that generates a speaker authenticity using a speech utterance provided by a claimant. The speaker verification system contains a memory, a pattern classification circuitry, and a decision logic circuitry. The memory stores a plurality of speaker models, a plurality of speaker thresholds, and a plurality of cohort model identification variables. The memory also stores a male cohort model that is generated using a plurality of speech utterances of a plurality of male speakers, a female cohort model that is generated using a plurality of speech utterances of a plurality of female speakers, and a general cohort model that is generated using a plurality of speech utterances of a plurality of female and male speakers. The pattern classification circuitry processes the speech utterance provided by the claimant. The speech utterance is scored against a selected one of the male cohort model, the female cohort model, the general cohort model, and the speaker model. The pattern classification circuitry operates using an unsupervised classifier. The decision logic circuitry processes the speech utterance provided by the claimant to generate a relative score. The relative score is compared against a claimant speaker threshold that is selected from the plurality of speaker thresholds. The pattern classification circuitry and the decision logic circuitry operate cooperatively to generate a speaker authenticity.
In certain embodiments of the invention, the unsupervised classifier employs a simplified hidden Markov modeling (HMM) training method. If desired, the speaker verification system is operable at an arbitrary rate type including an equal error rate. In addition, the speech utterance provided by the claimant is of a substantially short duration. In other embodiments of the invention, a relative score is generated when the speech utterance is scored against the selected one of the male cohort model, the female cohort model, the general cohort model, and the speaker model, the relative score is compared to the claimant speaker threshold. Also, the speech utterance provided by the claimant is a predetermined verification phrase in certain embodiments of the invention. Pre-processing and feature extraction circuitry removes silence and extracts a plurality of cepstral features of the speech utterance provided by the claimant.
Other aspects of the invention can be found in a method that performs speaker verification by claiming an identity by recording a speech utterance, pre-processing and feature extracting the speech utterance, scoring the speech utterance against a speaker model and a cohort model, and determining an authenticity of the speech utterance. In certain embodiments of the invention, a relative score is generated during the scoring the speech utterance against a speaker model and a cohort model. In addition, pre-processing and feature extraction performed on the speech utterance also includes removing silence and extraction of a plurality of cepstral features of the speech utterance provided by the claimant. The method that performs speaker verification is operable wherein the speech utterance is of a substantially short duration and wherein the method is performed at an arbitrary rate type including an equal error rate.
Other aspects, advantages and novel features of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.