The use of data analysis tools has increased dramatically as society has become more dependent on digital information storage. In e-commerce and other Internet and non-Internet applications, databases are generated and maintained that have astronomically large amounts of information. Such information is typically analyzed, or “mined,” to learn additional information regarding customers, users, products, etc. This information allows businesses and other users to better implement their products and/or ideas.
Data mining is typically the extraction of information from data to gain a new, insightful perspective. Data mining can employ machine learning, statistical and/or visualization techniques to discover and present knowledge in a form that is easily comprehensible to humans. Generally speaking, humans recognize or translate graphical items more easily than textual ones. Thus, larger amounts of information can be relayed utilizing this means than by other methods. As such, graphical statistical models have proven invaluable in data mining.
A Bayesian network is one type of a graphical statistical model that encodes probabilistic relationships among variables of interest. Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. Because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. A graphical model, such as a Bayesian network, can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Additionally, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the over fitting of data.
Bayesian network statistical model variations include decision trees and decision graphs. A decision tree data structure corresponds generally to an acyclic, undirected graph where nodes are connected to other respective nodes via a single path. The graph is acyclic in that there is no path that both emanates from a vertex and returns to the same vertex, where each edge in the path is traversed only once. A probabilistic decision tree is a decision tree that is used to represent a conditional probability distribution for a target variable given some set of predictor variables. As compared to a table, which is another way to represent a conditional probability distribution when all variables are discrete, a tree is generally a more efficient way of storing probabilities because of its ability to represent equality constraints within a conditional probability distribution.
A decision graph is a further generalization of a decision tree. Similar to a decision tree, a decision graph can represent equality constraints in a conditional probability distribution. In contrast to a decision tree, however, non-root nodes in a decision graph can have more than one parent. This characteristic enables a richer set of relationships to be represented by a decision graph than by a decision tree. For example, relationships between a non-root node and multiple parent nodes can be represented in a decision graph by corresponding edges interconnecting the non-root node with its parent nodes.
Graphical models facilitate probability theory through the utilization of graph theory. This allows for a method of dealing with uncertainty while reducing complexity. The modularity of a graphical model permits representation of complex systems by utilizing less complex elements. The connections and relationships of individual elements are identified by the probability theory, while the elements themselves are constructed by the graph theory. Utilizing graphics also provides a much more intuitive human interface to difficult problems.
Nodes of a probabilistic graphical model represent random variables. Their connectivity can indicate associative qualities such as dependence and independence and the like. If no connectivity (i.e., “arcs”) is present, this represents conditional independence assumptions, providing a representation of joint probability distributions. Graphical models can be “directed” or “undirected” depending on how they are constructed. Undirected graphical models have a more simplistic definition of independence, while directed graphical models are more complex by nature. Bayesian or “Belief” networks (BN) are included in the directed category and are utilized extensively in statistics and artificial intelligence to show causality between elements or “nodes.” They are also highly beneficial in supplying “inferences.” That is, they are able to infer information based on a posterior probability (i.e., “likelihood”) utilizing Bayes' rule. Thus, for a given outcome, its cause can be probabilistically deduced utilizing a directed graphical model.
A directed acyclic graph (DAG) such as a Bayesian network can also be applied to provide predictions relating to a time series. One such technique is to utilize an autoregressive (AR) method combined with a moving average (MA) method. This is known as “ARMA” or autoregressive, moving average. These types of models are models of persistence in a time series. By understanding the persistence of a time series, predictions can be made regarding future values of that time series. This proves invaluable in economics, business, and industrial arenas. Predicting behavior allows one to adjust parameters if the desired outcome is not the predicted outcome. Thus, for example, a company can predict its stock value based on current financial states and determine if they need to improve on cash reserves, sales, and/or capital investments in order to achieve a desired stock price. This also permits a study of the “influence” of various parameters on future values.