As an example of a related art about the speech classification method, a description in Non-patent Document 1 is referred to, for example. FIG. 6 is a block diagram showing a configuration example of a speech classification apparatus to which the speech classification method described in Non-patent Document 1 has been applied. The speech classification apparatus shown in FIG. 6 includes a speech storage unit 801, an initialization unit 802, an inter-cluster distance calculation unit 803, a cluster pair integration unit 804, a stop determination unit 805, and a cluster storage unit 806.
The speech classification apparatus shown in FIG. 6 operates as follows. First, the initialization unit 802 collectively reads speech data (speech signal extracted to a finite length) stored in the speech storage unit 801, defines the number of clusters which is the same as the number of the speech data, and sets an initial assignment for classification where one speech belongs to one cluster. Specifically, a unique cluster ID is allocated to each speech data, calculates statistics (mean, variance, sufficient statistics, and the like) for each cluster using the speech data to which the same cluster ID has been allocated, and stores results of calculation in the cluster storage unit 806.
Next, the inter-cluster distance calculation unit 803 calculates a distance (difference level) between arbitrary two clusters, based on the statistics for each cluster stored in the cluster storage unit 806. Then, the cluster pair integration unit 804 selects the cluster pair having a minimum distance calculated by the inter-cluster distance calculation unit 803, and consolidates the cluster pair. Herein, the cluster ID of one cluster of the cluster pair to be consolidated is assigned to all speech data belonging to the other cluster. Then, statistics of the consolidated clusters are recalculated, using a speech data group to which the cluster ID has been allocated, and are stored in the cluster storage unit 806.
The stop determination unit 805 determines appropriateness of a current classified state (or whether or not cluster consolidation is further performed), based on a predetermined rule calculated from the statistics. That is, based on the predetermined rule derived from the current statistics for each cluster, the stop determination unit 805 determines whether or not to further perform cluster consolidation. When the stop determination unit 805 determines that cluster consolidation should not be performed (determines the current classified state to be appropriate), the stop determination unit outputs the current classified state as a final result of classification. On the other hand, when the stop determination unit 805 determines cluster consolidation should be further performed (determines that the current classified state not to be appropriate), each of the inter-cluster distance calculation unit 803 and the cluster pair integration unit 804 repeats the operation described above, based on the current classified state.
Such a classification method is referred to as a “shortest distance method”. Further, as a data type of speech data (speech signal), a time series of feature vectors constituted from features that reflect a speaker or an environment, such as Mel-frequency cepstral coefficients (MFCC) often used in a speech recognition system, is employed.
As another related art of the speech classification method, a description in Non-patent Document 2 is referred to. FIG. 7 is a block diagram showing a configuration example of a speech classification apparatus to which the speech classification method described in Non-patent Document 2 has been applied. The speech classification apparatus shown in FIG. 7 includes a speech input unit 901, a speech-cluster distance calculation unit 902, a cluster number determination unit 903, a speech-cluster consolidation unit 904, and a cluster storage unit 905.
The speech classification apparatus shown in FIG. 7 operates as follows. First, the speech input unit 901 receives sequentially input speeches, and sequentially delivers the received speeches to the speech-cluster distance calculation unit 902. Upon reception of one speech data, the speech-cluster distance calculation unit 902 calculates statistics of the one speech data (such as mean, variance, and sufficient statistics). Further, the speech-cluster distance calculation unit 902 refers to the statistics of each of clusters already stored in the cluster storage unit 905, and calculates a distance (difference level) between the one speech data and each cluster. The cluster number determination unit 903 selects the cluster having a minimum one of distances between the input one speech data and the respective clusters. When the value of the distance is larger than a predetermined threshold value, the cluster number determination unit 903 determines the number of the clusters to be N+1. Otherwise, the cluster number determination unit 903 determines the number of the clusters to remain at N.
When the number of the clusters determined by the cluster number determination unit 903 is N+1, the speech-cluster consolidation unit 904 creates a new cluster having the input one speech data as a constituent, and stores statistics of the new cluster in the cluster storage unit 905. On the other hand, when the number of the clusters remains at N, the input one speech data is consolidated into the cluster that is selected by the cluster number determination unit 903 and has the minimum distance with the input one speech. The speech-cluster consolidation unit 904 recalculates statistics of this cluster, and stores the statistics in the cluster storage unit 905.
In the speech classification apparatus in this example, in a stage where no speech data is input, or in the stage where no cluster is present (N=0) in the cluster storage unit 905, the speech-cluster distance calculation unit 902 performs no particular processes, and the cluster number determination unit 903 determines the number of clusters to be N+1 (or 1). Then, the speech-cluster consolidation unit 904 creates the new cluster having input one speech data as the constituent and stores the new cluster in the cluster storage unit 905.
Patent Document 1 describes a speaker clustering processing apparatus in which clustering is performed with an algorithm adopted in a well known SPLIT method, using a distance between speakers. In this method, the distances between speakers are calculated for the entire combination of speakers in advance and then, with reference to the calculation results of the distances between speakers, division is executed from a cluster having a sum of the distances between speakers assuming the maximum value.
Patent Document 1:
JP Patent Kokai Publication No. JP-A-11-175090 (column 0026)
Non-Patent Document 1:
S. S. Chen, E. Eide, M. J. F. Gales, R. A. Gopinath, D. Kanvesky, and P. Olsen, “Automatic Transcription of Broadcast News”, Speech Communication, 2002, Vol. 37, pp. 69-87
Non-Patent Document 2:
D. Liu and F. Kubala, “OnLine speaker clustering”, Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004, Vol. 1, pp. 333-386.