1. Field of the Invention
The present invention relates to a speaker clustering apparatus based on feature quantities of a vocal-tract configuration, and a speech recognition apparatus provided with the speaker clustering apparatus. In particular, it relates to a speaker clustering apparatus for generating hidden Markov models (hereinafter, referred to as HMMs) for a plurality of clusters by performing a speaker clustering process based on feature quantities of a vocal-tract configuration of speech waveform data. It further relates to a speech recognition apparatus for recognizing speech by selecting an HMM, which is optimum for a speaker targeted for speech recognition, from the HMMs of the plurality of clusters generated by the speaker clustering apparatus.
2. Description of the Prior Art
Use of gender-dependent acoustic models for speech recognition is an effective way to improve the recognition performance. However, since there is still a wide variety of speakers having different features even within each same gender, several speaker clustering methods for obtaining more detailed speaker cluster models have been proposed. For example, the Japanese Patent Laid-Open Publication No. 7-261785 proposed not only a tree-structured, hierarchical speaker clustering method but also a fast speaker adaptation method based on selection of speaker clusters defined on the tree structure of speaker clusters. The effectiveness of these methods also as an initialization model for speaker adaptation was disclosed in, for example, the Japanese Patent Laid-Open Publication No. 8-110792.
In order to obtain highly efficient speaker clusters by such a speaker clustering method, there is a need of setting an appropriate distance across speakers. In previous work on speaker clustering, acoustic feature quantities, in particular, distances across acoustic models to be used for the recognition of speaker dependent HMMs or the like have widely been used as distances across speakers for clustering.
However, in the speaker clustering using distances across acoustic models to be used for the recognition of speaker-dependent HMMs or the like, as shown in these prior arts, there have been such problems that large amounts of speech waveform data would be required to obtain a higher speech recognition rate, it is necessary to provide a storage unit having a large storage capacity, while the amount of computations involved in speaker clustering would become very large. Further, in speech recognition using HMMs resulting from speaker clustering with relatively low amounts of speech waveform data, there has been another problem that the speech recognition rate would still be low.