Pattern recognition is a problem which is encountered in many applications today. Two common areas in which pattern recognition problems are found are image and speech processing. In image processing, pattern recognition is used, e.g., to recognize a received image and thereby distinguish it from other possible images. In the speech processing context, pattern recognition techniques are used to, e.g., recognize speech. Numerous other pattern recognition applications exist. Pattern recognition generally involves the use of a statistical model. As will be discussed below, there is a need for improved methods and apparatus for generating and training such statistical models.
In as much as speech recognition is a good example of an application where model generation and training is particularly pertinent, the present invention will be discussed primarily in the context of speech recognition systems. However, the following discussion of the present invention as it applies to speech recognition is intended to facilitate the explanation of the present invention without necessarily limiting the invention to the field of speech recognition. In fact, the methods and apparatus of the present invention may be applied to a wide variety of pattern recognition applications.
Speech recognition involves the identification of words, phrases and/or sounds in speech. It generally involves using a speech recognition system including, e.g., a computer, that analyzes the speech using one or more speech recognition models. In speech recognition, different models can be used to represent different words, phrases or sounds. In order to accurately distinguish between similar sounds, the models need to be detailed enough to reflect the distinctions, e.g., in terms of the modeled features, between the sounds being modeled. A variety of different features and/or signal characteristic information may be included in a model with different model parameters normally being used to represent each of the different modeled features or signal characteristics. Hereinafter the terms signal features and signal characteristics will be used interchangeably.
At recognition time, automatic speech recognition (ASR) systems compare various characteristics detected in a received speech segment to speech characteristic information included in previously generated models. ASR systems are frequently implemented on computers using software comprising a speech recognition engine and a set of speech recognition models. The speech recognition engine is the portion of the ASR software that performs the actual speech recognition task using, e.g., the models which are also part of the ASR software. When the speech recognition engine detects a match between a received speech segment and a model, recognition of the word, phrase or sound corresponding to the model is normally said to have occurred.
The models used by the speech recognition engine may be thought of as the fuel that powers the speech recognition engine. As will be discussed below, because of memory constraints in may applications, it is desirable that the models used by a speech recognition engine, be as powerful and compact as possible. That is, given a model size constraint imposed, e.g., due to memory limitations, it is desirable that the models used for speech recognition purposes result in a recognition rate that is as accurate and high as possible. Since a fixed amount of data is often required to store each model component or parameter, a model size constraint may be expressed in terms of the maximum permissible number of model components or parameters.
Many automatic speech recognition systems in use today rely on speech recognition engines which are based on a statistical pattern recognition framework. In such systems, each speech sound to be recognized is modeled by a probabilistic distribution. Such a statistical pattern recognition approach relies heavily on accurate probabilistic models for the acoustics. The models represent run time data used by the speech recognition engine. The accuracy of a model is a function of: 1) the amount of data used to train the model, 2) the particular features which are modeled, and 3) the overall number of modeled features (often represented in terms of model components or parameters).
Limiting the size of the training data base and/or the number of features used in a model can adversely impact the accuracy of the model. In addition, a poor selection of the features to be modeled can result in a less accurate model, when a fixed number of features are included in a model, than could be achieved using a better selection of features to be included in the model. In addition, while adding model components will increase the accuracy of a model when there is adequate data to accurately model the components being added, the inclusion of model components representing features for which there is insufficient training data to accurately model the component being added, will normally result in a decrease in the accuracy of a speech recognizer. Each feature that is modeled may require the generation and storage of several parameters as part of the model.
In most cases, a statistical model is based on an assumption with regard to the form of the distribution of the data, e.g., speech characteristics. For example, assumptions may be made that the modeled data is, e.g., Gaussian, Laplacian, etc. in form. Once and assumption as to the form of the data has been made, a model may be represented as a series of parameters associated with the assumed form. For example, a model may be based on the assumption that a speech segment is mixed-Gaussian in nature, i.e., it can be represented by a sum of Gaussian components. In such an embodiment, each Gaussian component is normally represented by two features, e.g., mean and variance. Where models are based on the assumption that the speech being modeled is mixed Gaussian in nature, a model for a speech segment normally represents a sum of Gaussian components, where a parameter, referred to as a coefficient, is used to indicate the contribution of each component to the overall modeled sound. Additional parameters, e.g., mean and variance parameters, are used to specifically define each individual component included in the mixture. Accordingly, assuming that a model is based on the assumption that the sounds being modeled are mixed Gaussian in nature, three parameters will normally be stored for each Gaussian component, i.e., a first parameter representing the component's coefficient, a second parameter representing the component's mean and a third parameter representing the component's variance.
For example, a model X may be represented by the following equation: EQU .lambda.=C.sub.0 g(m.sub.0, v.sub.0)+C.sub.1 g(m.sub.1,v.sub.1 0+. . . +C.sub.i g(m .sub.i, v.sub.i)
where .lambda. represents a mixed Gaussian speech model PA1 C.sub.i is a parameter representing the coefficient for the i.sup.th Gaussian signal component; PA1 m.sub.i is a parameter representing the mean of the i.sup.th Gaussian signal component; and PA1 v.sub.i is a parameter representing the variance of the i.sup.th Gaussian signal component.
Hidden Markov Models are one example of speech recognition models in use today which may employ mixed Gaussian speech models.
As a practical matter, the cost associated with collecting training data limits, in many cases, the amount of data that can be collected and then used for training models at the time an ASR system is initially deployed. In addition, because of the cost of providing memory in ASR systems, in many cases, the size of the memory available for storing ASR software, e.g., the speech recognition engine and the models used for recognition purposes, is limited.
In actual implementations, the models used for speech recognition purposes may be constrained to an amount of memory which is comparable to that dedicated to storing the speech recognition engine. For instance, a simple acoustic model employing a three-state hidden Markov model (HMM) with 16 Gaussian mixture components, for each of 47 context independent phones, may take up roughly 0.5 MB of memory. To implement an ASR system using such a model on a 1 MB digital signal processing (DSP) board would leave the relatively small amount of only 0.5 MB of memory for implementing the recognition engine and all other supporting routines.
In view of the above, it becomes apparent that it is desirable to limit the complexity and thereby the size of ASR software including the speech recognition engine and the speech recognition modules.
One known and commonly used approach to reducing the size of an acoustic model is referred to as the tied-mixture or the tied-state method. The tied mixture approach involves setting up an equivalence relationship between HMM parameters in different model states or phoneme contexts. For example, similar phoneme contexts for which there is insufficient training data to accurately train each of the phoneme contexts may be grouped or tied together for modeling purposes. As a result of this grouping operation, the number of independent parameters in a model is reduced. The downside to this approach is a loss in acoustic resolution. In the case where the mixture components selected to be tied together are from two different phonetic classes, this method has the undesirable effect of smearing the boundary between the classes. Such a smearing effect reduces the discriminative power of the resulting model, i.e., the power of the model to distinguish between sounds in the different classes.
Another disadvantage of the known tied mixture approach is that both the recognition engine and the generated models are a function of the tying that is performed. As additional data is collected, e.g., after the initial deployment of and ASR system, more accurate modeling of phonetic classes becomes possible. Unfortunately, when the above tied mixture approach is used both the models and the speech recognition engine should be updated together when the tied mixtures are modified. The need to modify the speech recognition engine when updating models using the tied mixture method can result in undesirable down time and expense when upgrading speech recognition systems that are in use.
Accordingly, the known tied mixture approach has several disadvantages as compared to systems where tied mixtures are not used.
Two well known model training and parameter selection methods exist. The first of the model training techniques is referred to as an expectation maximization (EM) method. The second training method is referred to as a minimum classification error (MCE) method. Each of the two known training techniques involves generating an updated model having a fixed number of components from an initial input model having the same number of components. The input model used in the known system represents, e.g., an operator's guess as to what the model should look like. In addition to the input model, the known model training techniques generate the output model from a set of training data, a plurality of inequality constraints, and one overall equality constraint. Either of the two known training methods may be used with the tied mixture modeling technique discussed above.
In as much as the known modeling techniques do not reduce the model size during training, known approaches to generating models for ASR systems with limited memory for storing models involve the initial generation and subsequent training of models which are limited in terms of size (the number of model parameters) to that which can fit into the available memory.
Referring now to FIG. 1, there is illustrated an expected maximization (EM) model training circuit 100. As illustrated, the known circuit 100 receives as its inputs, a set of training data, an M component input model, M inequality constraints, and one overall equality constraint, where M is an integer. The M inequality constraints require the parameters, e.g., the coefficients, the means and the variances in the case of a mixed Gaussian model form, of the M components of the generated model to have non-zero values. The one overall equality constraint imposes the requirement that the sum of the coefficients of all the model components equals to one.
The inputs to the EM model training circuit 100 are supplied to a LaGrange multiplier 110 which is used to process the input data and generate therefrom an updated M component model by generating a new set of parameter values including, e.g., coefficients, mean, and variance values for each component. The selection of the parameter values is performed automatically as part of the process performed by the LaGrange multiplier. The updated model generated by the LaGrange multiplier 110 is supplied to the likelihood scoring circuit 120. The likelihood scoring circuit 120 also receives the set of training data as an input. Given the updated model, the likelihood scoring circuit 120 generates an estimate of how likely the training data is. This score is used as an indicator of the updated model's accuracy. The updated M component model and the score corresponding thereto is supplied to a feedback decision/model selection (FD/MS) circuit 130. The FD/MS circuit 130 is responsible for supplying the updated M component model as an input to the LaGrange multiplier 110. The updated model is processed and another updated model is generated therefrom and scored. The iterative model updating process is repeated a number of times, e.g., until some stop criterion, such as a steady state scoring result, is reached, at which point the iteration process is stopped. Upon achieving the stop criterion one of the generated models is output as the M component model generated by the EM model training circuit 100.
FIG. 2 illustrates a known minimum classification error (MCE) model training circuit 200. The MCE model training circuit 200 is similar in operation to that of the known EM model training circuit 100 in that it performs a model updating step, a scoring step, and a feedback/decision model selection step. The inputs and outputs of the MCE model training circuit 200 are the same as the previously described EM model training circuit 100. Notably, both circuits receive M component input models and output M component models. However, the MCE model training circuit 200 uses a generalized probabilistic descent algorithm performed by the circuit 210 to generate updated model parameters as opposed to using a LaGrange multiplier. Also note that the MCE model training circuit 200 uses an empirical error rate scoring circuit 220 to score the data. This is in contrast to the likelihood scoring circuit used by the EM model training circuit 100.
The known model training circuits 100, 200 illustrated in FIGS. 1 and 2 are useful for modifying model parameter values, e.g., model training, where the generated model is to include the same number of components, and thus parameters, as the input model. However, they fail to address the issue of how to determine the minimum number of model components that can be used, and which of a plurality of possible model components should be used, to achieve a desired level of recognition accuracy.
One known method which attempts to address the problem of determining the minimum number of model components and thus the minimum model size, with which a desired level of recognition accuracy can be achieved, involves a repetitive trial and error process. In accordance with this known trial and error technique, the model generation process is initiated using an initial model having the minimum number of components that the person initiating the training process believes may possibly achieve the desired level of recognition accuracy. The initial model is refined using, e.g., one of the above described known modeling techniques. The score of the generated model is then analyzed, and a determination is made as to whether or not the model will achieve the desired level of recognition accuracy.
In the known technique, if the initial model fails to achieve an acceptable degree of recognition accuracy, both the initial and generated models are discarded, and the process repeated using a new, often operator generated, initial model having more components than the previously generated model. The accuracy of the second generated model is reviewed. If the accuracy of the second model is found to be unacceptable, the process is again repeated using a new initial model having yet more components. The above described steps of model generation, review and discarding of unsatisfactory generated models is repeated until a model which achieves the desired degree of recognition accuracy is achieved.
While the known trial and error approach to determining the number of components that should be included in a model will normally result in a model that achieves a desired degree of recognition accuracy, the known technique has several disadvantages.
One significant disadvantage of the known technique is that it does not address the issue of how to select which of the possible signal components will be most useful in accurately modeling the distinguishing characteristics of the data of interest, e.g., the set of training data. The problem of selecting which components should be used in the model may be thought of in terms of efficiently using data to model specific portions of an image or sound.
For example, consider the case of image pattern recognition where the goal is to generate a model for recognizing different types of trees and each model component corresponds to a fixed amount of data representing a different portion of the tree. If the goal were to distinguish between different types of trees, in terms of model data allocation efficiency, it might be more effective to focus in the model on components that characterize the distinguishing features of the leaves as opposed to non-distinguishing features which characterize tree trunks. Using the known trial and error approach to the inclusion of model components, additional components would be added until a desired recognition rate was achieved. However, in order to achieve the desired recognition rate, the known system might include several unnecessary or relatively unimportant components characterizing the trunk of the tree before including the more important components characterizing the relatively distinguishing features of the tree leaves. The issue is further complicated when the adequacy of the training data is taken into account. Since there is no known method to insure that the collection of the training data will faithfully reflect the signal characteristics, it is more likely than not that the training data may exhibit certain unknown bias that must be precluded in the training process. This is particularly the case when the set of training data is smaller than is desirable. By progressively increasing the model size in a trial and error fashion, one runs the risk of over-fitting the data without an effective way of detecting it.
A frequently used approach to avoid this problem is to further divide the available data into several subsets, using all but one of them as the training set. The remaining one, usually referred to as the development set, is used as the testing data to assess whether over-training has occurred. Because both the training set and the development set may have their own biases, it is a common practice to rotate the data used for the training and the development set. Unfortunately, there is no known guideline as to how many iterations of such practices should be carried out. Even when using the above discussed techniques, the known trial and error method can over fit a model. Thus, the known method may waste precious memory space by including in a model relatively unimportant model components, and the resources (e.g., time) needed to obtain adequate models may be quite demanding.
The known trial and error method for determining the number of components to be included in a model also has the distinct disadvantage of requiring a relatively large amount of human involvement in the model generation process, assuming that the generation of multiple models having different numbers of components is required before obtaining the desired level of recognition accuracy. This is because human involvement is normally required in the known model generation and training methods when new models having a different number of components must be generated to serve as the input to the training circuits illustrated in FIGS. 1 and 2.
In view of the above, it becomes apparent that there is a need for new and improved model training and generation methods. It is desirable that such techniques be capable of being implemented in a more automated manner than the known trial and error techniques. In addition, it is desirable that the models produced by any new methods be compact, i.e., include relatively few components, and still be able to achieve relatively high recognition results. Furthermore, it is desirable that the model generation process allow for future updating of the models, e.g., when additional training data becomes available, without requiring that modifications be made to the search engine before the models can be used.