The present invention is in the field of methods and devices for probabilistic recognition of physical phenomena. More specifically, the present invention is directed to an improved method for speech recognition using a large set of simple probability functions to model speech units grouped into a limited number of clusters.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This application is filed with a paper appendices of 21 pages which are incorporated as part of this application.
This invention relates to speech recognition by logic devices or processes, and more particularly to recognizing speech from a large vocabulary using partially-tied probability function mixtures for model state recognition with a reduced number of clusters and an increased number of probability functions used in each cluster.
This art presumes a basic familiarity with statistics and probability recognition processes, as well as familiarity with the extensive state of the art in recognition systems.
U.S. Pat. No. 5,825,978, METHOD AND APPARATUS FOR SPEECH RECOGNITION USING OPTIMIZED PARTIAL MIXTURE TYING OF HMM STATE FUNCTIONS provides substantial background information related to the current patent and that patent, and each of its cited references, is incorporated herein by reference. The present invention, however, has applications to other probability recognition paradigms and should not be seen as limited by the above referenced patent.
Current, state-of-the-art Large-Vocabulary Continuous Speech Recognition (LVCSR) systems are typically based on state-clustered hidden Markov models (HMMs). Typically, these systems use thousands of state clusters, each represented by a Gaussian mixture model with a few tens of Gaussians. These systems use HMMs to model triphone speech units. The number of triphones is usually very large. For example, models with 10,000 triphones are common. Because each triphone is usually modeled by at least three HMM states, this results in about 30,000 HMM states. Each state is typically modeled by a Gaussian mixture model (GMM) with a few Gaussians. Thus, the total number of Gaussian parameters can be on the order of hundreds of thousands. Estimating a separate GMM for each triphone state would require a huge amount of training data. However, because training data is usually limited, it is not possible to reliably estimate such a large number of parameters.
In one of the first approaches to robust HMM estimation, called the Tied Mixture (TM) HMM, a single set of Gaussian distributions was shared (or tied) across all the states. [1,2] Because the Gaussians were shared, data could be pooled from different HMM states to train the states robustly. Each state was differentiated by a different mixture weight distribution to these shared Gaussians. The shared Gaussians along with the mixture weights defined the state-dependent GMMs. Because of robust parameter estimation, TM HMMs were found to perform significantly better than xe2x80x9cfully continuousxe2x80x9d HMMs, where each state used a separate GMM.
To get more detailed models than TM systems, phonetically tied mixture (PTM) systems were proposed. In these systems, a separate Gaussian codebook was shared among all triphone states corresponding to the same base phone. [3]
A further development in the art was state-clustered HMMs [4,5,6], where the amount of tying was decreased further. This represents the state of the art in speech recognition technology up to the time of the present invention. In this approach, the amount of tying is considerably less than in a TM or PTM system. HMM states are clustered according to acoustic similarity. The states in each cluster either share the same GMM [4,5], or only share the same set of Gaussians but use different mixture weights for each state. [6, 7] A small number of Gaussians is used for each cluster, and improved acoustic resolution is achieved by increasing the number of state clusters.
In previous work, state-clustered HMMs were experimentally shown to be superior to TM and PTM HMMs (e.g., see [6]). However, in these previous comparisons, the TM and PTM systems had a total of 256 and 4000 Gaussians, respectivelyxe2x80x94drastically fewer than the total number of Gaussians present in state-clustered system, which had about 24,000 Gaussians. [6] Other previous work with TM and PTM systems [2,8,9] also appears to have used very few Gaussians in comparison to that generally used in state-clustered systems.
Systems with small numbers of state clusters have previously been studied, but they were not properly explored in that few Gaussians (about 200 to 500) were used in the clusters. This led most practitioners in the art to turn to systems with large numbers of clusters each having few Gaussians.
What is needed is a speech recognition system or method that has the advantages of conceptually simpler mixture tying systems but gives equal or superior performance to state-clustered systems.
A further understanding of the invention can be had from the detailed discussion of specific embodiments below. For purposes of clarity, this discussion refers to devices, methods, and concepts in terms of specific examples. However, the method of the present invention may operate with a wide variety of types of recognition and recognition systems. In particular, while parts of the discussion refer to recognition models as hidden Markov models (HMM), it should be understood that HMM can refer to any type of recognition model unless the context requires otherwise. Likewise, while parts of the discussion refer to Gaussians Mixtures as mixtures used to model probability functions, it should be understood that other continuous and discrete or discrete basic probability functions may be used within the context of the invention.
Furthermore, it is well known in the art that logic systems can include a wide variety of different components and different functions in a modular fashion. Different embodiments of a system can include different mixtures of elements and functions and may group various functions as parts of various elements.
For purposes of clarity, the invention is described in terms of systems that include many different innovative components and innovative combinations of components. No inference should be taken to limit the invention to combinations containing all of the innovative components listed in any illustrative embodiment in this specification.
All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.
In accordance with the invention, speech recognition is provided that takes a new approach to clustering and tying and models used in speech recognition. While it was previously believed that the best performing-systems would be those that used very large numbers of state clusters (about 2000 or more) and few Gaussians per cluster (16 to 32), the current invention uses very few state clusters (about 40 in some embodiments, up to a few hundreds in alternative embodiments) and many Gaussians per state cluster (about 1000). In the present invention, models with far more parameter tying and therefore fewer clusters, like phonetically tied mixture (PTM) models, can give better performance in terms of both recognition accuracy and speed. The present invention can use a conceptually simpler PTM system to achieve faster and more accurate performance than current state-of-the-art state-clustered HMM systems.
Experimental results have shown between a 5 and 10% improvement in word error rate, while cutting the number of Gaussian distance computations in half, for three different Wall Street Journal (WSJ) test sets, by using a PTM system with 38 phoneclass state clusters, as compared to a state-clustered system with 937 state clusters. For both systems, the total number of Gaussians was fixed at about 30,000.
The invention will be better understood upon reference to the following detailed description, taken in conjunction with the accompanying drawings.