The development and use of data analysis tools has been steadily increasing to help gain a greater understanding of information and relationships between different aspects of information. In e-commerce and other Internet and non-Internet applications, for example, databases are generated and maintained that have large amounts of information. Such information often is analyzed, or “mined,” to learn additional information regarding customers, users, products, etc.
One particular area relating to decision theory in which there is significant amount of research is Bayesian networks. A Bayesian network is a representation of the probabilistic relationships among distinctions about the world. Each distinction, sometimes called a variable, can take on one of a mutually exclusive and exhaustive set of possible states. By way of illustration, a Bayesian network can be represented as a directed graph where the variables correspond to nodes and the relationships (e.g., dependencies) between the nodes correspond to arcs connecting various nodes. When there is an edge between two nodes, the probability distribution of the first node depends upon the value of the second node when the direction of the edge points from the second node to the first node. The absence of edges in a Bayesian network conveys conditional independencies. However, two variables indirectly connected through intermediate variables are said to be conditionally dependent given lack of knowledge of the values (“states”) of the intermediate variables.
The variables in a Bayesian network can be discrete or continuous. A discrete variable is a variable that has a finite or countable number of states, whereas a continuous variable is a variable that has an infinite number of states. An example of a discrete variable is a Boolean variable. Such a variable can assume only one of two states: (e.g., “true” or “false”). An example of a continuous variable is a variable that may assume any real value between −1 and 1. Discrete variables have an associated probability distribution. Continuous variables, however, have an associated probability density function (“density”).
Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems. More recently, researchers have developed methods for learning Bayesian networks from data. The techniques that have been developed are new and still evolving, but they have been shown to be effective for some data-analysis problems.
The learning problem can be considered a classic heuristic-search problem: given a Bayesian network structure, there is a “score” that measures how well that structure fits with the data. The task is to utilize a search algorithm to find a good structure; typically, once a good structure has been identified, it is straight-forward to estimate the corresponding conditional probability distributions. Traditional approaches to learning Bayesian networks usually perform a greedy search though DAG space.
For example, in a conventional approach, the structure of a Bayesian network is a directed acyclic graph (DAG), and at each step of the learning process one considers (1) adding, (2) deleting, or (3) reversing an edge in the DAG. Typically, the process begins with a DAG containing no edges, and greedily performs these three operators until a local maximum is reached. However, many DAGs correspond to the same statistical model. Consider an example domain of two variables X and Y. A Bayesian network with structure X→Y (i.e. a directed edge from X to Y) can model all probability distributions where X and Y are dependent, as can the network structure X→Y. Thus, it is stated that these two models are equivalent. In contrast, a model with no edge between X and Y can only model those distributions where X and Y are independent.