1. Field of the Invention
The present invention generally relates to a system and method for causal modeling and outlier detection. More particularly, the present invention relates to a system and a method for causal modeling and outlier detection using a novel sub-class of Bayesian Networks, which is hereinafter called “Reverse Bayesian Trees”, or synonymously a “Reverse Bayesian Forest”.
2. Description of the Related Art
The data that is available for many application domains are often lacking in label information. Label information provides an indication of whether each datum possesses a certain property of interest, such as, for example, being fraudulent, being faulty, or the like.
Even in cases where a limited amount of label information is available, more often than not, the labeled data are scarce. Exemplary applications having data which may exhibit this property include, for example, fraud detection, tool anomaly detection, network intrusion detection, and the like.
For the reasons stated above, conventional solutions that rely on labeled data are often not applicable in practice. Recently, alternative solutions to this problem, which rely on outlier detection methods that detect and flag unusual patterns in data, have drawn considerable attention. The majority of these methods, however, have relied upon nearest-neighbor type modeling and detection schemes, which are not capable of providing a useful explanation of why the flagged outliers are judged to be outliers.
One promising approach to addressing this issue of providing an effective outlier detection method with a useful explanation is to apply modeling methods that use Bayesian networks. (See, for example, Heckerman, David, “A Tutorial on Learning with Bayesian Networks”, in “Learning in Graphical Models, Jordan, M., editor, MIT Press, Cambridge, Mass., 1999.) A Bayesian network is a directed acyclic graph of nodes representing variables and arcs representing probabilistic dependency relations among the variables. However, this approach of using Bayesian Network estimation for the purpose of outlier detection has not been extensively pursued in the past, except in isolated cases, due to the high computational complexity required for estimation of Bayesian Networks. This is in part due to the fact that a search for a near optimal network structure within given data tends to require an amount of computation which is exponential or more than exponential in the number of variables in question, and this is often prohibitive in practical applications. For this reason, using an estimation procedure for the entire, unrestricted class of Bayesian networks for the purpose of causal modeling and outlier detection based upon data analysis performed on a large scale data set is not practical, and it is necessary that some restriction be placed on the class of networks to consider in the modeling process.
A representative method achieves computationally efficient modeling and outlier detection by restricting its attention on a sub-class of Bayesian networks called the “Chow-Liu expansions” (also known as “dependency trees”), (see, for example, Chow, C., and Liu, C. 1968, “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, 14(11):462-467.) While this method is efficient and has been applied in a number of practical applications, it suffers from the shortcoming that the subclass is not very expressive. More concretely, it is only able to capture dependency relations between two variables at a time, which in some applications may prove undesirable. This issue is particularly pressing in view of the fact that one of the most effective methods to determine causal dependency relations between variables requires that the relationship between three or more variables be observed.