Current speaker recognition systems use different modeling approaches for different levels of text dependency. For example, a Text Independent (TI) system designed to work on free conversational speech would use Gaussian Mixture Models (GMM), typically adapted from another global GMM referred to as a Universal Background Model (UBM). Alternatively, for example, a Text Constrained (TC) speaker recognition system designed to work only on spoken digits would use a Hidden Markov Model (HMM) as a UBM, and adapt target speaker models that also are HMMs. For a digit system the HMM may be a word HMM comprised of a HMM for every individual digit, as described by Che et al., in “An HMM Approach to Text-Prompted Speaker Verification”, ICASSP, 1996. That is, rearranged according to the prior knowledge of the spoken digit string. For a system that expects another (perhaps wider) subset of the spoken language, a phonetic HMM may be trained, that includes a more complicated structure and more symbols. Finally, for a pass-phrase Text Dependent (TD) speaker recognition system, a rigid HMM will typically be trained for the target model, or the corresponding phonetic subset will be adapted from a phonetic UBM.
All of the above modeling techniques employ Gaussian mixtures for estimating the Probability Density Functions (PDF) of the speaker features, either as the model itself (GMM) or as the PDF of individual states in a HMM. However, the current target model training and adaptation techniques for each system are different. GMM training involves determining the Gaussian values only. HMM training involves training both the Gaussian values and the state structure and transition probabilities at the same time. As a result, separate speaker recognition systems are employed when different levels of text dependency are expected, and users need to enroll in each system independently if they need to use all of them.
FIG. 1 is a block diagram illustrating a combination TI/conversational and TD/TC speaker recognition system 100, according to the prior art. The system 100 is actually made up of essentially a separate TI/conversational speaker verification system 101 and essentially a separate TD/TC speaker verification system 102.
The speaker verification systems 101 and 102 share a microphone/telephone (hereinafter “microphone”) 105. The microphone 105 is connected in signal communication with the TI/conversational speaker verification system 101 and with the TD/TC speaker verification system 102. The TI/conversational verification recognition system 101 is also connected in signal communication with a database of TI user voiceprints 110. The TD/TC speaker verification system 102 is also connected in signal communication with a database of TD user voiceprints 120. Outputs of the TI/conversational speaker verification system 101 and the TD/TC speaker verification system 102 are outputs of the combination TI/conversational and TD/TC speaker recognition system 100, and provide speaker recognition decisions and similarity scores.
Accordingly, it would be desirable and highly advantageous to have a single speaker recognition system and method capable of different levels of text dependency.