With the advent of the Internet, and especially electronic commerce (“e-commerce”) over the Internet, the use of data analysis tools, has increased dramatically. In e-commerce and other Internet and non-Internet applications, databases are generated and maintained that have astronomically large amounts of information. Such information is typically analyzed, or “mined,” to learn additional information regarding customers, users, products, etc. This information allows businesses and other users to better implement their products and/or ideas.
Data mining (also known as Knowledge Discovery in Databases—KDD) has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data.” Data mining can employ machine learning, statistical and/or visualization techniques to discover and present knowledge in a form that is easily comprehensible to humans. Generally speaking, humans recognize or translate graphical items more easily than textual ones. Thus, larger amounts of information can be relayed utilizing this means than by other methods. As such, graphical statistical models have proven invaluable in data mining.
A Bayesian network is one type of a graphical statistical model that encodes probabilistic relationships among variables of interest. Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems. More recently, researchers have developed methods for learning Bayesian networks from data. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. First, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Second, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Third, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. And fourth, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the over fitting of data.
Bayesian network statistical model variations include decision trees and decision graphs. A decision tree data structure corresponds generally to an acyclic, undirected graph where nodes are connected to other respective nodes via a single path. The graph is acyclic in that there is no path that both emanates from a vertex and returns to the same vertex, where each edge in the path is traversed only once. A probabilistic decision tree is a decision tree that is used to represent a conditional probability distribution for a target variable given some set of predictor variables. As compared to a table, which is another way to represent a conditional probability distribution when all variables are discrete, a tree is generally a more efficient way of storing probabilities because of its ability to represent equality constraints within a conditional probability distribution.
A decision graph is a further generalization of a decision tree. Similar to a decision tree, a decision graph can represent equality constraints in a conditional probability distribution. In contrast to a decision tree, however, non-root nodes in a decision graph can have more than one parent. This characteristic enables a richer set of relationships to be represented by a decision graph than by a decision tree. For example, relationships between a non-root node and multiple parent nodes can be represented in a decision graph by corresponding edges interconnecting the non-root node with its parent nodes.
Graphical models facilitate probability theory through the utilization of graph theory. This allows for a method of dealing with uncertainty while reducing complexity. The modularity of a graphical model permits representation of complex systems by utilizing less complex elements. The connections and relationships of individual elements are identified by the probability theory, while the elements themselves are constructed by the graph theory. Utilizing graphics also provides a much more intuitive human interface to difficult problems.
Nodes of a probabilistic graphical model represent random variables. Their connectivity can indicate associative qualities such as dependence and independence and the like. If no connectivity (i.e., “arcs”) are present, this represents conditional independence assumptions, providing a representation of joint probability distributions. Graphical models can be “directed” or “undirected” depending on how they are constructed. Undirected graphical models have a more simplistic definition of independence, while directed graphical models are more complex by nature. Bayesian or “Belief” networks (BN) are included in the directed category and are utilized extensively in statistics and artificial intelligence to show causality between elements or “nodes.” They are also highly beneficial in supplying “inferences.” That is, they arc able to infer information based on a posterior probability (i.e., “likelihood”) utilizing Bayes' rule. Thus, for a given outcome, its cause can be probabilistically deduced utilizing a directed graphical model. Inferencing is a very powerful tool that is employed in many facets of society.
Often determining an exact inference requires significant computational power due to the complexity of the inference algorithm. It is helpful to approximate the inference in this case rather than compute the exact inference. Such methods of inference approximations include variational methods, sampling methods, loopy belief propagation, bounded cutest conditioning, and parametric approximation methods and the like. An example of variational methods is mean-field approximation. This method exploits the law of large numbers to approximate large sums of random variables by their means. A variational parameter is introduced for each node after they have been decoupled. These parameters are iteratively updated to minimize cross entropy between approximate and true probability distributions. Thus, updating the parameters becomes representative of inference.