This invention relates to determining the probability of data sequences given the most likely underlying cause of the data sequences, and, more particularly, to modeling data sequences as the output of smooth motion through a continuity map (CM) in which each possible item in the data sequence is produced with some probability from each position in the CM. This invention was made with government support under Contract No. W-7405-ENG-36 awarded by the U.S. Department of Energy. The government has certain rights in the invention.
Determining the probability of data sequences is a difficult problem with several applications. For example, current speech recognition algorithms use language models (which estimate the probability of a sequence of words) to increase recognition accuracy by preventing the programs from outputting nonsense sentences. The grammars that are currently used are typically stochastic, meaning that they are used to estimate the probability of a word sequence--a goal of the present invention. For example, in order to determine the probability of the sentence "the cow's horn honked", an algorithm might use stored knowledge about the probability of "cow's" following "the", "horn" following "cow's", and "honked" following "horn". Grammars such as these are called bigram grammars because they use stored information about the probability of two-word sequences.
Notice that, although cow's horns typically do not honk, a bigram grammar would consider this a reasonable sentence because the word "honk" frequently follows "horn". This problem can be alleviated by finding the probabilities of longer word sequences. A speech recognition algorithm using the probabilities of three-word sequences (trigrams) would be unlikely to output the example sentence because the probability of the sequence "cow's horn honked" is small. Using four, five, six, etc.-word sequences should improve recognition even more. While it is theoretically possible to calculate the probabilities of all three-word sequences or four-word sequences, as the length of the word sequence increases, the number of probabilities that have to be estimated increases exponentially, i.e., if there are N words in the grammar then we need to estimate N*N probabilities for a bigram grammar, N*N*N probabilities for a trigram grammar, etc.
In accordance with the present invention, methods for estimating the likelihood of data sequences are adapted for application to detecting anomalous data sequences, such as fraudulent medical histories or abnormal data from sensors on car engines, nuclear power plants, factories, and the like, indicating a possible malfunction of the sensors or of the engine/power plant/factory. Consider the problem of detecting fraud using large databases of medical histories, in which each patient's medical history is composed of a sequence of medical procedures performed on the patient. Given a large enough collection of medical histories, such as are available at large medical insurance companies or the various medical social services, sufficient statistical information should be present to infer whether the care that is being delivered, and the charges for the care, are normal or are unusual. If a patient's medical history is sufficiently anomalous, the patient's medical history or the physician's record should be further reviewed for evidence of fraud.
As with many problems where estimating the probability of a sequence is required, using the simplest technique for estimating the probability (multiplying the probability of the first symbol by the probability of the second symbol, etc.) would be very inaccurate. One source of inaccuracy is the failure to take into account the order of the symbols. After all, the likelihood of a patient's medical history depends on the order in which the medical treatments are delivered. This can be seen by noting that doctors are likely to perform simple, non-invasive tests before performing more complex and/or invasive tests. Thus, seeing the same set of tests performed in a different order may radically alter an expert's estimate of how probable is a medical history. Since the order of the medical procedures tells a great deal about the probability of the sequence of procedures, many patients who have relatively probable medical histories will look like potential cases of fraud (and thus overburden the fraud detection system) unless the order of the medical procedures is taken into account when evaluating the likelihood of a patient's medical history.
Furthermore, it is often the case that an estimate of the probability of a sequence per se is not wanted, but, rather, the likelihood of a sequence given the most likely underlying cause. To make this idea more concrete, imagine that there are three diseases with the following characteristics:
1. Disease 1 is very common and is always treated with procedure 1A followed by procedure 1B and then procedure 1C.
2. Disease 2 is also common, but is always treated with procedure 2A followed by procedure 2B and then procedure 2C.
3. Disease 3 is very rare and is always treated by procedure 3A followed by procedure 3B and finally procedure 3C.
Now imagine that the following medical history is seen: procedure 1A, procedure 2A, procedure 2C. Although each of the procedures is relatively common, an expert might consider this sequence to be more anomalous than the sequence: procedure 3A, procedure 3B, procedure 3C which is less probable than the first sequence. This discrepancy between the probability of a sequence and the suspiciousness of the sequence is not unreasonable. After all, if the second sequence is seen, it is assumed that the patient had disease 3 and received the normal treatment for that disease. On the other hand, there is no known disease that would cause a physician to perform the first sequence--making it seem suspicious. So the process of determining whether a sequence is anomalous is often something like this: the fraud investigator guesses the likely cause of the sequence and then decides whether the procedures are probable given the most likely cause. If the investigator can find no likely reason for performing a sequence of procedures, then the sequence is considered suspicious.
Thus, for detecting anomalies/fraud, it would be useful to have a technique for evaluating the likelihood of data sequences that takes into consideration the order of the elements in the sequences, and tries to infer the underlying cause of the sequence. However, as with language models, prior art techniques for determining the likelihood of such sequences require estimating many parameters, and typically do not try to estimate the underlying cause.
The invention described herein addresses these problems. After describing the invention, a preliminary experiment that demonstrates the ability of the invention to detect anomalous data sequences is presented.
Additional objects, advantages and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.