Everywhere that one wants a computer to make sense of a signal, be it speech, electrocardiograms, engine pressure readings, music, video, sunspots, or text, there is a time-series inference problem. Markov models are the best understood and best performing statistical tool for time-series inference, and are used widely in industry and science. A variant called hidden Markov models is the tool of choice for speech recognition, and is a leading candidate for gene-processing and video-processing applications. As examples of their wide utility, hidden Markov models have been used to translate videos of American Sign Language into speech, judge martial arts moves, and predict the spread of disease.
Markov models are most often used for classification, e.g., to answer such questions as "Is the signal coming off the heart monitor most indicative of a healthy heart, a valve problem, an arrhythmia, or early cardiac arrest?" There are efficient algorithms for training and using Markov models, but these produce suboptimal classifiers, resulting in some degree of error. For some applications, the error is tolerably small or can be reduced, expensively, with very large amounts of training data. However, many applications are not yet feasible because the rates of error are still too high. Optimal models are possible in theory, but the mathematical analysis firmly states that training optimal models can take a very long time, even centuries of computation. Note that the algorithms for training suboptimal models run in seconds, minutes or at most hours.
Markov models and their variants provide a compact representation of how a class of sequences tends to evolve in time. These sequences can be text, speech, audio, video, sunspot data, or any non-random time-series. An important property of Markov models is that they can quickly be trained to model a set of example sequences. Then by comparing new sequences to the model, one can judge whether they belong to the same class as the training sequences. For example, it is possible to train one Markov model on texts written by Shakespeare, another on texts written by Conrad, and use the two to classify novel documents by author.
As mentioned above, a variant called hidden Markov models is used when dealing with continuous data, e.g. sequences of real numbers, that is contaminated with noise. This is typically the case when the data comes from a device that measures some physical quantity and returns a stream of numbers, for example, a microphone whose output is digitized. For example, speech recognition systems use a hidden Markov model for each word to calculate the most likely sequence of words given acoustic measurements from a microphone.
A Markov model defines a probability distribution over all possible sequences, in which some sequences are more probable and others are less. As used herein, the term training means estimating parameters for this distribution that maximize the probability of the training examples. There are efficient algorithms for finding the best set of parameters given a training set and an initial guess at the parameter values. This is called the maximum likelihood estimate or MLE. If one has two classes of examples, one trains two Markov models, one on each set. To classify a new example, one then asks which model is most probable for that example. A well-known theorem states that if a Markov model is the appropriate model for the process that generated the sequences, then the MLE parameters will yield classification with the lowest rate of error.
In practice, it is rare that a Markov model is a perfect fit. Consequently, there is some built-in error. One way of understanding the problem is that MLE parameters assign high likelihood not just to training examples, but also to a large range of similar sequences, which may include some examples that belong to another class.
By way of illustration, the probability distribution can be visualized like a topographic map, peaking in the middle of the positive examples. With conventionally trained models, it is not uncommon that some examples from one class "o" are assigned a high probability by the Markov model for the other class, because the training algorithm shapes the distribution to cover the "x" examples, but makes no attempt to avoid the "o" examples. This leads to classification errors. Often classifiers are visualized in terms of decision surfaces. Thus, if one maps each sequence onto a point in some high-dimensional space, the decision surface is the set of points that are assigned equal likelihood by two Markov models. One classifies a sequence by noting which side of the decision surface it lies on. MLE parameters result in decision surfaces that get most of the classifications right, but err with the most unusual examples.
Even so, MLE-based Markov model classifiers work well enough to be of scientific and economic value. Moreover, there is a large range of applications and potential applications that can become practical if the classification error can be further reduced. In the commercial realm, low accuracies are currently an impediment to widespread use of Markov-model based speech recognition systems, visual gesture recognition systems, and industrial process monitoring systems.
This problem is most acute with hidden Markov models. Like Markov models, hidden Markov models are rarely a perfect model for the data, and so the MLE parameters do not necessarily minimize classification error.
As to other approaches, there is a theorem which states that in the case of infinite training data, the optimal classifier has parameters which maximize a measure known as mutual information. This is called the maximum mutual information estimate MMIE. However, it is not known whether these MMIE parameters are preferable to MLE parameters for small or even finite amounts of training data. Unfortunately, training algorithms for MMIE parameters are extremely slow and may not converge to a desirable result in a practical time-scale. For this reason, they are not often used, even within the research community. Thus, the MMIE approach does not have a clear advantage over MLE approach.
Remember that MLE parameters are based only on positive examples of the class. One reason why hard out-of-class examples wind up on the wrong side of the decision surface is that the MLE computation never sees them, so there is no way for the probability distribution to avoid them.