Numerous variations relating to a standard formulation of Hidden Markov Models (HMM) have been proposed in the past, such as an Entropic-HMM, Variable-length HMM, Coupled-HMM, Input/Output-HMM, Factorial HMM and Hidden Markov Decision Trees, to cite but a few examples. Respective approaches have attempted to solve some deficiencies of standard HMMs given a particular problem or set of problems at hand. Many of these approaches are directed at modeling data, and learning associated parameters employing Maximum Likelihood (ML) criteria. In most cases, differences in modeling techniques lie in the conditional independence assumptions made while modeling data, reflected primarily in their graphical structure.
One process for modeling data involves an Information Bottleneck method in an unsupervised, non-parametric data organization technique. For example, Given a joint distribution P (A, B), the method constructs, employing information theoretic principles, a new variable T that extracts partitions, or clusters, over values of A that are informative about B. In particular, consider two random variables X and Q with their joint distribution P(X, Q), wherein X is a variable to be compressed with respect to a ‘relevant’ variable Q. The auxiliary variable T introduces a soft partitioning of X, and a probabilistic mapping P(T|X), such that the mutual information I(T;X) is minimized (maximum compression) while the relevant information I(T;Q) is maximized. A related approach is an “infomax criterion”, proposed in the neural network community, whereby a goal is to maximize mutual information between input and the output variables in a neural network.
Standard HMM algorithms generally perform a joint density estimation of the hidden state and observation random variables. However, in situations involving limited resources—for example when the associated modeling system has to process a limited amount of data in very high dimensional spaces; or if the goal is to classify or cluster with the learned model, a conditional approach may be superior to a joint density approach. It is noted, however, that these two methods (conditional vs. joint) could be viewed as operating at opposite ends of a processing/performance spectrum, and thus, are generally applied in an independent fashion to solve machine learning problems.
In yet another modeling method, a Maximum Mutual Information Estimation (MMIE) technique has been applied in the area of speech recognition. As is known, MMIE techniques can be employed for estimating the parameters of an HMM in the context of speech recognition, wherein a different HMM is typically learned for each possible class (e.g., one HMM trained for each word in a vocabulary). New waveforms are then classified by computing their likelihood based on each of the respective models. The model with the highest likelihood for a given waveform is then selected as identifying a possible candidate. Thus, MMIE attempts to maximize mutual information between a selection of an HMM (from a related grouping of HMMs) and an observation sequence to improve discrimination across different models. Unfortunately, the MMIE approach requires training of multiple models known a-priori,—which can be time consuming, computationally complex and is generally not applicable when the states are associated with the class variables.