1. Field of the Invention
The present invention relates to a speaker recognition apparatus, a computer program for speaker recognition, and a speaker recognition method, for recognizing a speaker by using personal information contained in a voice wave.
2. Description of the Related Art
A text-dependant speaker recognition apparatus, which recognizes a speaker based on a voice speaking predetermined contents, and a text-independent speaker recognition apparatus, which identifies a speaker based on a voice speaking any contents, have been proposed as speaker recognition apparatuses.
The speaker recognition apparatus, in general, converts an input voice wave into an analogue signal, converts the converted analogue signal into a digital signal, executes a discrete-analysis of the digital signal, and then produces a voice feature vector sequence which contains personal information. Here, a cepstrum coefficient is used as the voice feature vector. The speaker recognition apparatus, at a registration mode, clusters the voice feature vector sequence into a predetermined number of clusters, for example, thirty two clusters, and produces a representative vector, which is a centroid of each cluster (see Furui, “Speech Information Processing”, 1st ed, pp 56-57, Morikita Shuppan Co., Ltd, Japan). Further, the speaker recognition apparatus, in an identification mode, calculates a distance between the voice feature vector sequence produced from the input voice wave at the registration mode and a pre-registered codebook based on each voice feature vector, figures out an average value (an average distance), and identifies the speaker based on the average distance.
In the case where the speaker recognition apparatus is used as a speaker verification apparatus, a distance between the voice feature vector sequence produced from a speaker to be recognized and a codebook with respect to the speaker is calculated, and the distance and a threshold value are compared to execute speaker verification. In the case where the speaker recognition apparatus is used as a speaker identification apparatus, distances between the voice feature vector sequence produced from a speaker to be identified and codebooks of all registered speakers are calculated, and the shortest distance is selected from the plurality of distances corresponding to the registered speakers to execute speaker identification.
Currently, a cepstrum coefficient reflecting a shape of vocal tract, or a pitch indicating a vibrational frequency of a vocal band is commonly used as a voice feature amount. The information thereof contains phonological information indicating contents of speech, and personal information depending on a speaker. When a difference of the speaker's voice is calculated as a distance, it is not desirable to compare dispersion of the phonological information with dispersion of the personal information because the dispersion of the phonological information is broader than that of the personal information. Rather, it is desirable to compare the same phonological information. Therefore, according to an existing speaker recognition apparatus, approximate normalization by phonemes is executed by clustering of vector dispersion in observation space, and a speaker distance reflecting a personality, which is gained by a comparison of approximately the same phonemes, is calculated as a distortion amount.
When clustering the voice feature vector sequence, however, to which order the voice feature vector should be set is a problematic. In general, there is a large amount of phonological information existing in low orders, while large amount of personal information exists in high orders. Therefore, if the voice feature vector order is set to a low order in order to improve phonological resolving performance when clustering, the speaker resolving performance may be lowered. On the contrary, if the voice feature vector is set to a high order in order to raise the speaker resolving performance, the phonological resolving performance may be lowered. This gives rises to a trade-off relationship problem. Because of this problem, the voice feature vector order is currently set to a most appropriate order determined by an experimental method.
Accordingly, an object of the present invention is to eliminate the trade-off relationship between the phonological resolving performance and the speaker resolving performance, and is to realize precise speaker recognition.