Hidden Markov models (HMMs) are a class of statistical models used in modeling discrete time-series data. Problems that naturally give rise to such data include robot navigation, machine vision, and signal processing, and HMMs are at the core of many state-of-the-art algorithms for addressing these problems. In addition, many problems of natural language processing involve time-series data and can be modeled with HMMs, including: part-of-speech tagging, topic segmentation, speech recognition, generic entity recognition, and information extraction.
The so-called Markov assumption is a fundamental simplifying assumption that lies behind the efficiency of the algorithms used to train and apply an HMM. Under the Markov assumption, the probability of a given observation in a time series is supposed to be a function of only the current state of the process that produced it. While this assumption allows modelers to use dynamic programming to set model parameters and perform inference (for a description of the algorithms used, the Viterbi algorithm and the Baum-Welch algorithm, see Rabiner, L. R., “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” in Proceedings of the IEEE, 1989), it is demonstrably unjustified in many application domains, including problems involving natural language processing, such as information extraction.
Two basic approaches have been used to mitigate the Markov assumption, topological manipulation and “n-gram” Markov models. By manipulating model topology, it is possible to encode a finite contextual memory into a given model state simply by restricting the set of state paths that feed into it. However, this kind of manual model structuring is unwieldy for anything but small local contexts, and enlarging the model has the negative effect of rendering the statistics kept at individual states sparser and less certain. Examples of this approach can be found in Leek, T., “Information Extraction using Hidden Markov Models,” Masters Thesis, UC San Diego, 1997, and Freitag, D., and McCallum, A. K., “Information Extraction using HMMs and Shrinkage,” AAAI-99 Workshop on Machine Learning for Information Extraction, AAAI Technical Report WS-99-11, 1999.
In an “n-gram” Markov model the emission distribution at a state is defined over n-grams; at each time step, a state is presumed to emit the current word conditioned on the n−1 previous words. However, this too models only local context and requires special strategies to accommodate sparse statistics. An example of this approach is Bahl, L. R., et al, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-5, pp. 179–190, 1983.
Therefore, a method is needed whereby an HMM can be made to obey long-range constraints without sacrificing all the benefits of the Markov assumption.