1. Technical Field of the Invention
The present invention relates to speech recognition and, more particularly, to a speech recognition system with a two-level decision tree.
2. Background Art
In large vocabulary continuous speech recognition systems, context-dependent phones, typically triphones, and continuous density HMM models are often used to get high accuracy acoustic models. The huge number of triphones and multivariate Gaussian mixture distributions results in too many parameters in a system. It is a problem to maintain a good balance between the model complexity and the number of parameters that can be robustly estimated from the limited training data. The use of phonetic decision trees provides a good solution to this problem. It has two advantages over the bottom-up based approaches. First, by incorporating the phonetic knowledge of the target language into the tree, it can synthesize unseen models or contexts, which do not appear in the training data but occur during recognition. Second, the splitting procedure of decision trees provides a way of maintaining the model complexity and the number of parameters to be robustly estimated.
A phonetic decision tree is a type of classification and regression tree (CART). In decision-tree based acoustic modeling, phonetic decision trees are constructed either for each phone model or for each HMM state of each phone. Since the state-based approach provides a more detailed level of sharing and outperforms the model-based approach, the state-based approach is widely used. The phonetic decision tree is a binary tree in which a yes-no question about the phonetic context is attached to each node. An example question is xe2x80x9cIs the phone on the right of the current phone a vowel?xe2x80x9d A set of states can be recursively partitioned into subsets according to the answers to the questions at each node when traversing the tree from the root node to its leaf nodes. All states that reach the same leaf nodes are considered similar and are clustered together. The question set can be either manually pre-defined using linguistic and phonetic knowledge of the language, or automatically generated.
The tree construction is a top-down data driven process based on a one-step greedy tree growing algorithm. The goodness-of-split criterion is based on maximum likelihood (ML) of the training data. Initially all corresponding HMM states of all triphones that share the same basic phone are pooled in the root node and the log-likelihood of the training data is calculated based on the assumption that all the states in the node are tied. This node is then split into two by the question that gives the maximum increase in log-likelihood of the training data when partitioning the states in the node. This process is repeated until the increase falls below a threshold. To ensure that each leaf node has sufficient training data to robustly estimate the state, a minimum data count for the leaf node is also applied.
Although the traditional method provides an effective and efficient way to build a decision tree for continuous density HMM models based on the maximum likelihood criterion, it has several problems. One is due to the assumption that the parametric form of the initial unclustered states should be based on only single mixture Gaussian distributions. After the tree is built, the clustered states have more training data and the number of Gaussian components in each state is increased by a mixture-splitting procedure until the performance of the model set peaks on a development set. The use of single Gaussian distributions during tree construction is due to the fact that the multiple mixture Gaussian distribution for a tree node needs to be re-estimated from the training data, whereas the parameters of the single mixture Gaussian distribution can be calculated efficiently from the cluster members without re-accessing the original training data. However, the single Gaussian distribution is a very crude representation of the acoustic space of each state and decision trees based on such initial models may not give good clustering of states. There are many efforts to address this problem. Another approach incorporates a so-called m-level optimal subtree into the traditional tree construction to get a multiple mixture Gaussian distribution parameterization of each node although each member state still has only single Gaussian distribution as in the traditional approach. Another approach directly estimates, by making some assumptions, the multiple mixture Gaussian distribution for a tree node from the statistics of the member states which also have multiple mixture Gaussian distributions. Both of their approaches achieve some improvement. Yet another approach estimates the multiple mixture Gaussian distributions of the un-clustered states by using the fixed state alignment provided by a previously trained and accurate model set. However, this approach has not been shown to give any improvement in terms of performance. Another problem with the standard tree-building process is due to the fact that construction of an optimal tree is an NP-hard problem. Instead, a sub-optimal one-step greedy algorithm is utilized. To make better decisions at each node split, look-ahead search may be used, yet no improvement is obtained. Many efforts address other aspects of the traditional decision-tree based state-clustering approach, such as applying other goodness-of-split criteria, using cross-validation to automatically determine the size of the trees by pruning back instead of using thresholds which have to be determined by many experiments, and expanding the question set to incorporate more knowledge of the language.