An example of a voice data analyzing device has been described in Non-patent Literature 1. The voice data analyzing device of the Non-patent Literature 1 learns speaker models, each of which specifies the character of voice of each speaker, by use of previously stored voice data of each speaker and speaker labels.
For example, the voice data analyzing device learns the speaker model for each of a speaker A (voice data X1, X4, . . . ), a speaker B (voice data X2, . . . ), a speaker C (voice data X3, . . . ), a speaker D (voice data X5, . . . ), . . . .
Thereafter, the voice data analyzing device receives unknown voice data X obtained independently of the stored voice data and executes a matching process of calculating the degree of similarity between each of the learned speaker models and the voice data X based on a definitional equation defined by factors like “the probability that the speaker model generates the voice data X”. In this example, the voice data analyzing device outputs speaker IDs (identifiers each identifying a speaker, which correspond to the above A, B, C, D, . . . ) corresponding to speaker models whose degrees of similarity are ranked high among all speaker models or greater than a prescribed threshold value. In another example, speaker matching means 205 receives unknown voice data X and a certain speaker ID (specified speaker ID) as a data pair and executes a matching process of calculating the degree of similarity between the speaker model having the specified speaker ID and the voice data X. Then, a judgment result indicating whether the degree of similarity exceeds a prescribed threshold value or not (that is, whether the voice data X belongs to the specified speaker ID or not) is outputted.
Meanwhile, a speaker feature extraction device has been described in Patent Literature 1, for example. The speaker feature extraction device generates Gaussian mixture distribution-type acoustic models by executing the learning for each set of speakers belonging to each cluster obtained by the clustering of speakers based on a vocal tract length expansion/contraction coefficient with respect to a standard speaker. By calculating the likelihood of an acoustic sample of a learned speaker with respect to each of the generated acoustic models, the speaker feature extraction device extracts one acoustic model as a feature of the inputted speaker.