1. Field of the Invention
The present invention relates to an apparatus, a method, and a program for clustering phonemic models by regarding nodes in a tree structure as clusters.
2. Description of the Related Art
In a field of speech recognition, a technique that enables to represent acoustical features of input speech by respective probability models of phonemes is commonly used. These probability models are called “phonemic models”. The phonemic model is obtained by statistically training parameters of the phonemic model using speech data of phonemes that are pronounced. Accuracy of the phonemic model depends on the speech data used in the training. Therefore, to obtain a highly-accurate phonemic model, it is preferable that the training of the phonemic model be performed by using as much speech data as possible.
However, in some cases, so much speech data cannot be used in the training of the phonemic model. For example, in the case of Southeast Asian languages including That, only a small amount of speech data can be utilized for the training of some phonemic models. Therefore, only a less-accurate phonemic model can be obtained as compared to Euro-American languages, or the like, which enable to utilize a relatively large amount of speech data. In addition, there are discriminations between short vowels and long vowels in phonemes of That, and less speech data can be utilized for long vowels than for short vowels in some phonemes. In such phonemes, the phonemic models of the long vowels have relatively lower accuracy than the phonemic models of the short vowels.
To train the phonemic model having only a small amount of speech data that can be used for training with high accuracy, a well-known method called “adaptive training” is applied. In the adaptive training of a phonemic model A having only a small amount of usable speech data, a phonemic model B different from the phonemic model A is selected, and an initial phonemic model is obtained by performing training using speech data corresponding to the phonemic models A and B. Parameters of the initial phonemic model are then adaptively updated using the small amount of speech data corresponding to the phonemic model A, thereby training the phonemic model A. As described above, the adaptive training is a technique that enables to adaptively update parameters of an initial phonemic model by using speech data corresponding to a phonemic model to be trained (training-target phonemic model), thereby obtaining the training-target phonemic model based on the initial phonemic model.
It is known, in the adaptive training, that a phonemic model sufficiently similar to a training-target phonemic model is selected as an initial phonemic model, and that a highly-accurate initial phonemic model is prepared, so that a highly-accurate phonemic model can be obtained even when there are less speech data corresponding to the training-target phonemic model. Therefore, to obtain a highly-accurate phonemic model by the adaptive training, it is necessary that, for an arbitrary phonemic model of a certain language, at least another phonemic model of the same language, or at least one of phonemic models of another language similar to the arbitrary phonemic model be obtained.
To obtain phonemic models similar to each other, a technique is known that enables to cluster phonemic models using a tree structure. This technique enables to obtain at least one phonemic model cluster including a set of phonemic models similar to each other, from among all phonemic models to be clustered (clustering-target phonemic models).
For example, Japanese Patent No. 3547349 and “Tree-based state tying for high accuracy acoustic modeling” (S. J. Young et al., Proceedings of the workshop on Human Language Technology, 1994, pp. 307 to 312, FIG. 2) has proposed techniques that enable to cluster phonemic models using a decision tree. In the techniques using a decision tree, questions associated with types of clustering-target phonemic models are applied starting from a root node including all of the clustering-target phonemic models, thereby hierarchically adding new child nodes each including a child set of phonemic models similar to each other, and generating a tree structure composed of nodes each including a set of phonemic models. A set of phonemic models included in a node having no child node (leaf node) in the generated tree structure is obtained as a cluster of phonemic models.
The set of phonemic models similar to each other can be obtained by focusing on the cluster of the phonemic models thus obtained. That is, with respect to an arbitrary phonemic model of a certain language, at least one of phonemic models belonging to a phonemic model cluster including the arbitrary phonemic model can be selected as another phonemic model similar to the arbitrary phonemic model.
In the techniques proposed in Japanese Patent No. 3547349 and “Tree-based state tying for high accuracy acoustic modeling”, however, the phonemic models are classified only based on similarity of the phonemic models. Therefore, a cluster including only phonemic models that allow utilizing only a small amount of speech data for training may be generated. In such a case, accuracy of an initial phonemic model is low, and therefore an enhanced accuracy of the phonemic model according to the adaptive training cannot be assured.