This invention relates to a speaker recognition system which selects inhibiting cohort reference patterns.
In a manner which will later be described more in detail, in conventional speaker recognition techniques, there is a problem that a recognition accuracy is decreased by such as differences of enrollment and test condition, for example, additive noise and line characteristics. In order to resolve this problem, a likelihood ratio normalizing method which uses inhibiting reference patterns is proposed by such as Higgins, Rosenberg, and Matsui. Concretely, there is, as a first document, "A. Higgins, L. Bahler, and J. Porter: "Speaker Verification Using Randomized Phrase Prompting", Digital Signal Processing, 1, pp. 89-106 (1991)". Also, there is, as a second document, "Aaron E. Rosenberg, Joel DeLong, Chin-Hui Lee, Biing-Hwang Juang, Frank K. Soong: "The Use of Cohort Normalized Scores for Speaker Verification", ICSLP92, pp. 599-602 (1992)". Also, there is, as a third document, "Tomoko Matsui, Sadaoki Furui: "Speaker Recognition Using Concatenated Phoneme Models", ICSLP92, pp. 603-606 (1992)".
Generally, in the likelihood ratio normalizing method, N inhibit speakers are selected in an order from a speaker having a voice that is the closest to a voice of a true speaker. Therein, normalization of the likelihood ratio is carried out by subtracting each of likelihood ratios of the inhibit speakers from a likelihood ratio of the true speaker when distances are calculated in times of verifying. Here, there is such as a maximum likelihood of the inhibiting speakers or an average likelihood of the inhibiting speakers as the likelihood ratios of the inhibiting speakers to be subtracted. Since various differences of environments in times of recording and verifying influence both of the likelihood of the true speaker and the inhibiting speaker, it is possible to remove the various differences of environments in times of recording and verifying by subtracting the likelihood of the inhibiting speaker from the likelihood of the true speaker.
As explained in detail in the second document, the method of Rosenberg uses the utterance of the true speaker in the time of recording in case of calculating similarities in selection of inhibiting reference patterns. Also, as explained in detail in the first and the third documents, the methods of Higgins and Matsui use the utterance of the true person in the time of verifying in case of calculating similarities in selection of inhibiting reference patterns.
However, since the method of Rosenberg selects inhibiting speakers at the time of recording, effect of normalization is decreased when the environments in times of recording and verifying are different. Also, since the methods of Higgins and Matsui calculate, at the time of verifying, similarities between each of the inhibiting reference patterns and the utterance of the true person, a large processing amount requires patterns of a large number of speakers to calculate the similarities of reference of the utterance of the true person. Therefore, the methods of Higgins and Matsui select the inhibiting speakers from a small number of speakers. In this case, it is very difficult to select accurate inhibiting speakers.