With the advent of the Internet, and especially electronic commerce (“e-commerce”) over the Internet, the use of data analysis tools, has increased dramatically. In e-commerce and other Internet and non-Internet applications, databases are generated and maintained that have astronomically large amounts of information. Such information is typically analyzed, or “mined,” to learn additional information regarding customers, users, products, etc. This information allows businesses and other users to better implement their products and/or ideas.
Data mining (also known as Knowledge Discovery in Databases—KDD) has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data.” Data mining can employ machine learning, statistical and/or visualization techniques to discover and present knowledge in a form that is easily comprehensible to humans. Generally speaking, humans recognize or translate graphical items more easily than textual ones. Thus, larger amounts of information can be relayed utilizing this means than by other methods. As such, graphical statistical models have proven invaluable in data mining.
A Bayesian network is a graphical statistical model that encodes probabilistic relationships among variables of interest. Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems. More recently, researchers have developed methods for learning Bayesian networks from data. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. First, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Second, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Third, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. And fourth, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the over fitting of data.
Although the Bayesian network has proven to be a valuable tool for encoding, learning and reasoning about probabilistic relationships, there are other methods of analysis, such as dependency networks. A dependency network, like a Bayesian network, is a graphical representation of probabilistic relationships. This representation can be thought of as a collection of regressions or classifications among variables in a domain that can be combined using the machinery of Gibbs sampling to define a joint distribution for that domain. The dependency network has several advantages and also disadvantages with respect to a Bayesian network. For example, a dependency network is not useful for encoding causal relationships and is difficult to construct using a knowledge-based approach. Nonetheless, there are straightforward and computationally efficient methods for learning both the structure and probabilities of a dependency network from data; and the learned model is quite useful for encoding and displaying predictive (i.e., dependence and independence) relationships. In addition, dependency networks are well suited to the task of predicting preferences—a task often referred to as collaborative filtering.
Other statistical models include decision trees and decision graphs. A decision tree data structure corresponds generally to an acyclic, undirected graph where nodes are connected to other respective nodes via a single path. The graph is acyclic in that there is no path that both emanates from a vertex and returns to the same vertex, where each edge in the path is traversed only once. A probabilistic decision tree is a decision tree that is used to represent a conditional probability distribution for a target variable given some set of predictor variables. As compared to a table, which is another way to represent a conditional probability distribution when all variables are discrete, a tree is generally a more efficient way of storing probabilities because of its ability to represent equality constraints within a conditional probability distribution.
A decision graph is a further generalization of a decision tree. Similar to a decision tree, a decision graph can represent equality constraints in a conditional probability distribution. In contrast to a decision tree, however, non-root nodes in a decision graph can have more than one parent. This characteristic enables a richer set of relationships to be represented by a decision graph than by a decision tree. For example, relationships between a non-root node and multiple parent nodes can be represented in a decision graph by corresponding edges interconnecting the non-root node with its parent nodes.
There are two traditional approaches for constructing statistical models, such as decision trees or decision graphs, namely, a knowledge-based approach and a data-based approach. Using the knowledge-based approach, a person (known as a knowledge engineer) interviews an expert in a given field to obtain the knowledge of the expert about the field of expertise of the expert. The knowledge engineer and expert first determine the distinctions of the world that are important for decision making in the field of the expert. These distinctions correspond to the variables in the domain of interest. For example, if a decision graph is to be used to predict the age of a customer based on the products that customer bought in a store, there would be a variable for “age” and a variable for all relevant products. The knowledge engineer and the expert next determine the structure of the decision graph and the corresponding parameter values that quantify the conditional probability distribution.
In the data-based approach, the knowledge engineer and the expert first determine the variables of the domain. Next, data is accumulated for those variables, and an algorithm is applied that creates one or more decision graphs from this data. The accumulated data comes from real world instances of the domain. That is, real world instances of decision making in a given field.
Typically, the data-based approach is more commonly utilized from a general stand point. Over the last few years, however, the sizes of these databases have been exponentially increasing as the ability to gather data more efficiently increases. This has produced enormous databases that take immense amounts of time to analyze, despite the ever increasing speeds gained in computer processing technology and storage access techniques.