The present invention relates to a speaker collation apparatus, a method, and a storage medium, and particularly to a speaker collation apparatus, a method, and a storage medium characterized by generation of a standard pattern of inhibition speakers to prepare the standard pattern of inhibition speakers.
A big problem in speaker collation is that differences in ambient noise and difference in line characteristics (environmental differences) in registration and collation decrease the ratio of collation. The method for solving such problem is exemplified by likelihood normalization method on the basis of the standard pattern of inhibition speakers, proposed by Higgins, Rosenberg, and Matsui et al. These examples are A. Higgins, L, Bahler, and J. Porter; xe2x80x9cSpeaker collation using randomized phrase prompting,xe2x80x9d digital signal processing, 1, pp. 89-106 (1991) as the Reference 1; A. E. Rosenberg, Joel Delong, Chin-Hui Lee, Biing-Hweng Juang, Frank K. Soong: xe2x80x9cThe Use of cohort normalized scores for speaker collation.xe2x80x9d ICSLP 92, PP. 599-602 (1992), as the Reference 2; Tomoko Matsui, Sadaoki Furui: xe2x80x9cSpeaker adaptation of tied-mixture-based phoneme models for text-prompted speaker recognitionxe2x80x9d ICASSP 94, pp. 125-128 (1994) as the Reference 3.
A likelihood normalization method on the basis of the standard pattern of inhibition speakers is a method to normalize a likelihood by subtracting likelihood (likelihood of inhibition speakers) between an inputted voice and the standard pattern of inhibition speakers from a likelihood (likelihood of the identical person) between an inputted voice and the standard pattern of the identical person. Likelihood not easily affected by environmental differences can be acquired by subtraction of likelihood of inhibition speakers from the likelihood of the identical person, because environmental differences in registration and collation affect both of the likelihood of the identical person and likelihood of inhibition speakers. Known methods for selection of inhibition speakers are a method for selecting inhibition speakers similar to a voice of the identical person in registration and a method for selecting inhibition speakers similar to an inputted voice in collation. The former method is detailedly described in the Reference 2 and the latter method is detailedly described in the Reference 1 and the Reference 3.
In the likelihood normalization method using the standard pattern of inhibition speakers, a good ratio of collation can be acquired in environmental differences as small as possible in registered voice, collated voice and of the standard pattern of inhibition speakers. It is a problem that a large difference in these environmental differences reduces the ratio of collation. In order to solve the problem, many standard patterns of the candidates of inhibition speakers must be previously prepared for respective environments in registration and collation.
However, it is difficult to prepare many standard patterns of the candidates of inhibition speakers for respective environments. Therefore, a method for acquiring a good ratio of collation is required without necessity of preparing the standard patterns of the candidates of inhibition speakers for respective environments.
For a solving method in the case of a large difference in environment between registered voice and the standard pattern of inhibition speakers, a method of normalization of likelihood is proposed by adapting the standard pattern of inhibition speakers using registered voice, by acquiring likelihood (likelihood of inhibition speakers) between the adapted reducing standard pattern and the collated voice, and by subtracting the likelihood of inhibition speakers from the likelihood""s of the collated voice and the standard pattern of the identical person.
This method is a method for reducing environmental differences between registered voice and the standard pattern of inhibition speakers by adapting the standard pattern of inhibition speakers on the basis of the voice of the identical person in registration. This method is an effective method in selecting inhibition speakers in registration; and detailedly described in Yamada and Hattori of the reference 4 (a method and a system of generation of a reducing standard pattern namely cohort in speaker recognition and a speaker collation apparatus including the system. Japanese Patent Application No. 1997-040102).
It is therefore an object of the present invention to provide a speaker collation apparatus, method, and storage medium capable of acquiring a high ratio of collation without previous generation of the standard patterns of the candidates of inhibition speakers for many environments in a method for selection of the standard patterns of inhibition speakers in collation.
Other objects of the present invention will become clear as the description proceeds.
According to an aspect of the present invention, there is provided a speaker collation apparatus comprising; an analysis section for converting an inputted voice data for collation to a characteristic vector, a storage section of the characteristic vector for storing the characteristic vector converted in said analysis section, a storage section of a standard pattern of candidates of inhibition speakers in which one or more standard patterns of candidates of inhibition speakers have been stored, a selection section for selecting at least one inhibition speaker by calculating similarity degree between the characteristic vector converted in said analysis section and the standard patterns of respective speakers stored in said storage section of the standard pattern of candidates of inhibition speakers, an adaptation section for adapting the standard patters of inhibition speakers by acquiring a mapping function from a characteristic vector space of a voice of a inhibition speaker to a characteristic vector space of an inputted voice by using the mapping function acquired, using the standard pattern of inhibition speakers selected in said selection section to select a inhibition speaker and the characteristic vector stored in said storage section for the characteristic vector, a calculation section of a similarity degree of inhibition speakers for calculating the similarity degree between a characteristic vector stored in said storage section of characteristic vector and the standard pattern of inhibition speakers adapted in said adaptation section, a storage section of the standard pattern of the identical person in which the registered standard pattern of the identical person has been stored, a calculation section of a similarity degree to the identical person for calculating the similarity degree between of the characteristic vector stored in said storage section for the characteristic vector and the standard pattern of the identical person stored in said storage section of the standard pattern of the identical person, a normalization section of the similarity degree for normalizing the similarity degree by using the similarity degree calculated in said calculation section of a similarity degree to the identical person and the similarity degree calculated in said calculation section of a similarity degree of inhibition speakers, a threshold value storage section for storing a threshold value previously determined. and a decision section for deciding the person by using the similarity degree normalized in said normalization section of the similarity degree and the threshold value stored in said storage section got storing a threshold value.
The speaker collation apparatus may further comprise; a normalization section for normalizing said characteristic vector converted in said analysis section, said standard pattern of a candidate of inhibition speakers stored in said storage section of said standard pattern of the candidate of inhibition speakers, and said standard pattern of the identical person stored in said storage section of the standard pattern of the identical person.
According to another aspect of the present invention, there is also provided a speaker collation apparatus, comprising; an analysis section for converting an inputted voice data for collation to a characteristic vector, a storage section of the characteristic vector for storing the characteristic vector converted in said analysis section, a storage section of a standard pattern of candidates of inhibition speakers in which one or more standard patters of candidates of inhibition speakers have been stored, an adaptation section for adapting the standard patters of a speaker by acquiring a mapping function from a characteristic vector space of a voice of respective speakers to a characteristic vector space of an inputted voice using all standard patterns of speakers stored in said storage section of a standard pattern of said candidates of inhibition speakers and the characteristic vector stored in said storage section for said characteristic vector in order to use the mapping function acquired, a selection section of inhibition speakers for selecting at least one inhibition speaker by calculating a similarity degree between the characteristic vector converted in said analysis section and the standard patterns of speakers adapted in said adaptation section, a calculation section of a similarity degree of inhibition speakers in order for calculating the similarity degree between a characteristic vector stored in said storage section of characteristic vector and said standard pattern of inhibition speakers selected in said selection section of inhibition speakers, a storage section of the standard pattern of the identical person in which the registered standard pattern of the identical person has been stored a calculation section of a similarity degree of the identical person for calculating the similarity degree between the characteristic vector stored in said storage section for the characteristic vector and the standard pattern of the identical person stored in said storage section of the standard pattern of the identical person, a normalization section of the similarity degree for normalizing the similarity degree by using the similarity degree calculated in said calculation section of a similarity degree to the identical person and the similarity degree calculated in said calculation section of a similarity degree of inhibition speakers, a threshold value storage section for storing a threshold value previously determined, and a decision section for deciding the person by using the similarity degree normalized in said normalization section of the similarity degree and the threshold value stored in said storage section to store a threshold value.
The speaker collation apparatus may further comprise; a normalization section for normalizing said characteristic vector converted in said analysis section, said standard pattern of a candidate of inhibition speakers stored in said storage section of said standard pattern of the candidate of inhibition speakers, and said standard pattern of the identical person stored in said storage section of the standard pattern of the identical person.
According to yet another aspect of the present invention, there is provided a method of collating a speaker, said method comprising the steps of: calculating a similarity degree between a characteristic vector acquired from a collated voice and a standard pattern of respective speakers stored in a storage section for a standard pattern of candidates of inhibition speakers; selecting at least one inhibition speaker; acquiring a mapping function from a characteristic vector space of a standard pattern of inhibition speakers to a characteristic vector space of a collated voice; adapting the standard pattern of inhibition speakers by using the mapping function acquired; calculating a likelihood of inhibition speakers based on the likelihood between the adapted standard pattern of inhibition speakers and the collated voice; calculating the likelihood of the identical person based on the likelihood between the standard pattern of the identical person and the collated voice; acquiring a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and decideing the person based on the likelihood of normalization.
The step of selecting at least one inhibition speaker may be carried out by any one of such methods of selection as N persons, random N persons, N persons around M percentile in the order of high degree of likelihood.
According to yet another aspect of the present invention, there is also provided a method of collating a speaker, said method comprising the steps of: normalizing a characteristic vector of an input for collation, said standard pattern of candidates of inhibition speakers, and said standard pattern of the identical person; calculating a likelihood between a normalized standard pattern of candidates of inhibition speakers and a normalized characteristic vector; selecting inhibition speakers;
acquiring a mapping function from a characteristic vector space of a, standard pattern of selected inhibition speakers to a characteristic vector space of a collated voice; adapting the standard pattern of inhibition speakers by using the mapping function obtained; calculating a likelihood of inhibition speakers based on the likelihood between the adapted standard pattern of inhibition speakers and the collated voice, calculating the likelihood of the identical person based on the likelihood between the normalized standard pattern of the identical person and the normalized characteristic vector; calculating a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and deciding the person based on the likelihood of normalization.
The step of selecting inhibition speakers may be carried out by any one of such methods of selection as N persons, random N persons, N persons around M percentile in the order of high degree of likelihood.
According to yet another aspect of the present invention, there is also provided a method of collating a speaker, said method comprising the steps of: acquiring a mapping function from a characteristic vector space of a standard pattern of all candidates of inhibition speakers to a characteristic vector space of a collated voice; adapting the standard pattern of the candidates of inhibition speakers by using respective mapping function known; calculating a likelihood between the adapted standard pattern of the candidates of inhibition speakers and the featured vector; selecting inhibition speakers; calculating a likelihood of inhibition speakers based on the likelihood between the selected standard pattern of inhibition speakers and the collated voice; calculating the likelihood of the identical person based on the likelihood between the standard pattern of the identical person and the collated vector, calculating a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and deciding the person based on the likelihood of normalization.
The step of selecting inhibition speakers may be carried out by any one of such methods of selection as N persons, random N persons, N persons around M percentile in the order of high degree of likelihood.
According to yet another aspect of the present invention, there is also provided a method of collating a speaker, said method comprising the steps of: acquiring a mapping function from a characteristic vector space of normalized standard patterns of all candidates of inhibition speakers to a characteristic vector space of the normalized collated voice; adapting the standard pattern of the candidates of inhibition speakers by using the mapping function acquired; selecting inhibition speakers by acquiring a likelihood between the adapted standard pattern of the candidates of inhibition speakers and the featured vector; calculating a likelihood of inhibition speakers based on the likelihood between the selected standard pattern of inhibition speakers and the collated voice; calculating the likelihood of the identical person based on the likelihood between the normalized standard pattern of the identical person and the normalized collated vector; calculating a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and deciding the person based on the likelihood of normalization.
The step of selecting inhibition speakers may be carried out by any one of such methods of selection as N persons, random N persons, N persons around M percentile in the order of high degree of likelihood.
According to still another aspect of the present invention, there is provided a computer readable memory medium for storing a program of collating a speaker, said program comprising: calculating a similarity degree between a characteristic vector acquired from a collated voice and a standard pattern of respective speakers stored in a storage section for a standard pattern of candidates of inhibition speakers; selecting at least one inhibition speaker; acquiring a mapping function from a characteristic vector space of a standard pattern of inhibition speakers to a characteristic vector space of a collated voice; adapting the standard pattern of inhibition speakers by using the mapping function acquired; calculating a likelihood of inhibition speakers based on the likelihood between the adapted standard pattern of inhibition speakers and the collated voice; calculating the likelihood of the identical person based on the likelihood between the standard pattern of the identical person and the collated voice; acquiring a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and decideing the person based on the likelihood of normalization.
According to still another aspect of the present invention, there is provided a computer readable memory medium for storing a program of collating a speaker, said program comprising: normalizing a characteristic vector of an input for collation, said standard pattern of candidates of inhibition speakers, and said standard pattern of the identical person, calculating a likelihood between a normalized standard pattern of candidates of inhibition speakers and a normalized characteristic vector; selecting inhibition speakers; acquiring a mapping function from a characteristic vector space of a standard pattern of selected inhibition speakers to a characteristic vector space of a collated voice; adapting the standard pattern of inhibition speakers by using the mapping function obtained; calculating a likelihood of inhibition speakers based on the likelihood between the adapted standard pattern of inhibition speakers and the collated voice; calculating the likelihood of the identical person based on the likelihood between the normalized standard pattern of the identical person and the normalized characteristic vector; calculating a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and deciding the person based on the likelihood of normalization.
According to still another aspect of the present invention, there is also provided a computer readable memory medium for storing a program of collating a speaker, said program comprising: acquiring a mapping function from a characteristic vector space of a standard pattern of all candidates of inhibition speakers to a characteristic vector space of a collated voice; adapting the standard pattern of the candidates of inhibition speakers by using respective mapping function known; calculating a likelihood between the adapted standard pattern of the candidates of inhibition speakers and the featured vector; selecting inhibition speakers; calculating a likelihood of inhibition speakers based on the likelihood between the selected standard pattern of inhibition speakers and the collated voice; calculating the likelihood of the identical person based on the likelihood between the standard pattern of the identical person and the collated vector; calculating a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and deciding the person based on the likelihood of normalization.
According to still another aspect of the present invention, there is also provided a computer readable memory medium for storing a program of collating a speaker, said program comprising: acquiring a mapping function from a characteristic vector space of normalized standard patterns of all candidates of inhibition speakers to a characteristic vector space of the normalized collated voice; adapting the standard pattern of the candidates of inhibition speakers by using the mapping function acquired; selecting inhibition speakers by acquiring a likelihood between the adapted standard pattern of the candidates of inhibition speakers and the featured vector; calculating a likelihood of inhibition speakers based on the likelihood between the selected standard pattern of inhibition speakers and the collated voice; calculating the likelihood of the identical person based on the likelihood between the normalized standard pattern of the identical person and the normalized collated vector; calculating a likelihood of normalization by subtracting said likelihood of inhibition speakers from said likelihood of the identical person; and deciding the person based on the likelihood of normalization.