The present invention relates generally to statistical model-based speech recognition systems. More particularly, the invention relates to a system and method for improving the accuracy of acoustic models used by the recognition system while at the same time controlling the number of parameters. The discriminative clustering technique allows robust recognizers of small size to be constructed for resource-limited applications such as in embedded systems and consumer products.
Much of the automatic speech recognition technology today relies upon Hidden Markov Model (HMM) representation of features extracted from digitally recorded speech. A Hidden Markov Model is represented by a set of states, a set of vectors defining transitions between certain pairs of states, probabilities that apply to state-to-state transitions and further sets of probabilities characterizing observed output symbols and initial conditions. Frequently the probabilities associated with the Hidden Markov Model are represented as Gaussians expressed by representing the mean and variance as floating point numbers.
Hidden Markov Models can become quite complex, particularly as the number of states representing each speech unit is increased and as more complex Gaussian mixture density components are used. Complexity is further compounded by the need to have additional sets of models to support context-dependent recognition. For example, to support context-dependent recognition in a recognizer that models phonemes, different sets of Gaussians are typically required to represent the different allophones of each phoneme.
The above complexity carries a price. Recognizers with more sophisticated, and hence more robust, models typically require a large amount of memory and processing power. This places a heavy burden on embedded systems and speech-enabled consumer products, because these typically do not have much memory or processing power to spare. What is needed, therefore, is a technique for reducing the number of Gaussians needed to represent speech, while retaining as much accuracy as possible. For the design of memory-restricted embedded systems and computer products, the most useful solution would give the system designer control over the total number of parameters used.
The present invention provides a technique for improving modeling power while reducing the number of parameters. In its preferred embodiment, the technique takes a bottom-up approach for defining clusters of Gaussians that are sufficiently close to one another to warrant being merged. In its preferred form, the technique begins with as many clusters as Gaussians used to represent the states of the Hidden Markov Models. Clusters are then agglomerated, in tree fashion, to minimize the dispersion inside the cluster and to maximize the separation between clusters. The agglomerative process proceeds until the desired number of clusters is reached. The system designer may specify the desired number based on memory footprint and processing architecture. A Lloyd-Max clustering algorithm is then performed to move Gaussians from one cluster to another in order to further decrease the dispersion within clusters.
Unlike conventional systems that tend to merely average Gaussian mean and variance values together, the method of the present invention employs a powerful set of equations that provides the parameters representative of each cluster (e.g. centroid), so that the Bhattacharyya distance is minimized inside the cluster. This provides a far better way of estimating the parameters representative of the cluster, because it is consistent with the metric used to associate the Gaussians to the cluster itself. In the preferred implementation, the Bhattacharyya distance is minimized through an iterative procedure that we call the minimum mean Bhattacharyya center algorithm.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.