The invention described herein may be manufactured and used by or for the Government of the United States of America for governmental purposes without the payment of any royalties thereon or therefore.
(1) Field of the Invention
This invention relates to systems and methods for modeling physical phenomena, and more particularly to a system and method for modeling physical phenomena, such as speech, using a class-specific implementation of the Baum-Welch algorithm for estimating the parameters of a class-specific hidden Markov model (HMM).
(2) Description of the Prior Art
By way of example of the state of the art, reference is made to the following papers, which are incorporated herein by reference. Not all of these references may be deemed to be relevant prior art.
P. M. Baggenstoss, xe2x80x9cClass-specific features in classification,xe2x80x9d IEEE Trans. Signal Processing, December 1999.
S. Kay, xe2x80x9cSufficiency, classification, and the class-specific feature theorem,xe2x80x9d to be published IEEE Trans Information Theory.
B. H. Juang, xe2x80x9cMaximum likelihood estimation for mixture multivariate stochastic observations of Markov chains,xe2x80x9d ATandT Technical Journal, vol. 64, no. 6, pp. 1235-1249, 1985.
L. R. Rabiner, xe2x80x9cA tutorial on hidden Markov models and selected applications in speech recognition,xe2x80x9d Proceedings of the IEEE, vol. 77, pp. 257-286, February 1989.
L. E. Baum, xe2x80x9cA maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,xe2x80x9d Ann. Math. Stat., vol. 41, pp. 164-171, 1970.
E. H. Lehmann, Theory of Point Estimation, New York: Wiley, 1983.
S. Kay, Modern Spectral Estimation; Theory and Applications. Prentice Hall, 1988.
E. J. Hannan, Multiple Time Series, Wiley, 1970.
M. H. Quenouille, xe2x80x9cThe joint distribution of serial correlation coefficients,xe2x80x9d Ann. Math. Stat., vol. 20, pp. 561-571, 1949.
Many systems, e.g., communication, data processing and other information systems, can be described or characterized in terms of a series of transitions through a set of states. Hidden Markov models (HMMs) have found applications in modeling physical phenomena characterized by a finite number of states. Often these states represent distinct physical phenomena. In speech, for example, the human voice is characterized by distinct physical phenomena or modes, e.g., voiced speech, fricatives, stops, and nasal sounds. When applied to speech processing applications, the speech modes or components are first modeled by HMMs using an algorithm to estimate parameters for the HMMs (referred to as the training phase). The trained HMMs can then be used to determine which speech components are present in a speech signal (referred to as the recognition phase).
For the classical hidden Markov model (HMM), all observations are assumed to be realizations of a random statistical model that depends on the Markov state. Although the statistical models, i.e. the observation probability density functions (PDF""s), are different for each state, they are defined on the same observation space. The dimension of this observation space needs to be high to adequately observe the information content of the data for all states. The high dimension requires a large number of observations (or training samples) and leads to poor performance with limited amounts of training data.
In speech, for example, a different set of parameters control the uttered sound during each of the speech modes. Furthermore, a distinct type of signal processing or feature extraction is best suited to estimate the corresponding parameters of each mode. But, since one cannot know a priori which mode is in effect at a given instant of time and cannot change the observation space accordingly, it is necessary to operate in a unified observation space.
This requires a feature set that carries enough information for the estimation of all modes. This in turn leads to dimensionality issues since there is only a finite amount of data with which to train the observation PDF estimates. In effect, the observation PDF""s of each state are represented using a feature set with higher dimension than would be necessary if the other states did not exist. The amount of data required to estimate a PDF is exponentially dependent on feature dimension. Given limitations of computer storage and available data, feature dimensions above a certain point are virtually impossible to accurately characterize. As a result, one may be forced to use a subset of the intended feature set to reduce dimension or else suffer the effects of insufficient training data.
Consider a hidden Markov model (HMM) for a process with N states numbered S1 through SN. Let the raw data be denoted X[t], for time steps t=1,2, . . . , T. The parameters of the HMM, denoted xcex, comprise the state transition matrix A={aij}, the state prior probabilities uj, and the state observation densities bj (X), where i and j range from 1 to N. These parameters can be estimated from training data using the Baum-Welch algorithm, as disclosed in the papers by Rabiner and Juang. But, because X[t] is often of high dimension, it may be necessary to reduce the raw data to a set of features z[t]=T(X[t]). We then define a new HMM with the same A and uj but with observations z[t], t=1,2, . . . T and the state densities bj(z) (we allow the argument of the density functions to imply the identity of the function, thus bj(X) and bj(z) are distinct).
This is the approach used in speech processing today where z[t] are usually a set of cepstral coefficients. If z[t] is of low dimension, it is practical to apply probability density function (PDF) estimation methods such as Gaussian Mixtures to estimate the state observation densities. Such PDF estimation methods tend to give poor results above dimensions of about 5 to 10 unless the features are exceptionally, i.e., xe2x80x9cwell-behavedxe2x80x9d are close to independent or multivariate Gaussian. In human speech, it is doubtful that 5 to 10 features can capture all the relevant information in the data. Traditionally, the choices have been (1) use a smaller and insufficient features set, (2) use more features and suffer PDF estimation errors, or (3) apply methods of dimensionality reduction. Such methods include linear subspace analysis, projection pursuit, or simply assuming the features are independent (a factorable PDF). All these methods involve assumptions that do not hold in general.
The class-specific method was recently developed as a method of dimensionality reduction in classification, as disclosed in U.S. patent application Ser. No. 09/431,716 entitled xe2x80x9cClass Specific Classifier.xe2x80x9d Unlike other methods of dimension reduction, it is based on sufficient statistics and results in no theoretical loss of performance due to approximation. Because of the exponential relationship between training data size and dimension, even a mere factor of 2 reduction in dimension can result in a significant difference.
Accordingly, one object of the present invention is to reduce the number of data samples needed for training HMMs.
Another object of the present invention is to extend the idea of dimensionality reduction in classification to the problem of HMM modeling when each state of the HMM may have its own minimal sufficient statistic.
A further object of the present invention is to modify the Baum-Welch algorithm used to estimate parameters of class-specific HMMs.
The foregoing objects are attained by the method and system of the present invention. The present invention features a method of training a class-specific hidden Markov model (HMM) used for modeling physical phenomena characterized by a finite number of states. The method comprises the steps of receiving training data forming an observation sequence; estimating parameters of the class-specific HMM from the training data using a modified Baum-Welch algorithm, wherein the modified Baum-Welch algorithm uses likelihood ratios with respect to a common state and based on the sufficient statistics for each state; and storing the parameters of the class-specific HMM for use in processing signals representing the physical phenomena.
The step of estimating parameters of the class-specific HMM preferably includes the step of conducting a plurality of iterations of the Baum-Welch algorithm, wherein each iteration of the Baum-Welch algorithm includes a class-specific forward procedure for calculating forward probabilities, a class-specific backward procedure for calculating backward probabilities, and HMM reestimation formulas for updating the parameters of the class-specific HMM based upon the forward probabilities and the backward probabilities.
The present invention also features a system for training a class-specific hidden Markov model (HMM) used for modeling physical phenomena characterized by a finite number of states. The system includes means for receiving training data represented as an observation sequence; means for estimating parameters of the class-specific HMM from the training data using a modified Baum-Welch algorithm, wherein the modified Baum-Welch algorithm uses likelihood ratios that compare each of the states to a common state; and means for storing the parameters of the class-specific HMM for use in processing signals representing the physical phenomena.
According to one example, the HMMs are used in speech processing and the physical phenomena includes modes of speech.