In many application areas where graphical models are used and where their structure is learned from data, the end goal is neither prediction nor density estimation. Rather, it is the uncovering of discrete relationships between entities. For example, in computational biology, one can be interested in discovering which proteins within a large set of proteins interact with one another or which miRNA (e.g., micro ribonucleic acid—single-stranded ribonucleic acid molecules that regulate gene expression) molecules target which mRNA (e.g., messenger ribonucleic acid—a molecule of ribonucleic acid encoding a chemical “blueprint” for a protein product) molecules. In these problems, relationships can be represented by arcs in a graphical model. Consequently, given a learned model, it can be beneficial to determine how many of the arcs are real or non-spurious.
Previous attempts to address the problem of uncovering discrete relationships between entities have involved computing confidence measures on arcs (and other features) of induced Bayesian networks by using a bootstrap (or parametric bootstrap). By re-modeling data, and using a Bayesian network for each sampled data set, such attempts have enumerated the number of times a given arc has occurred and estimated a probability of that arc, {circumflex over (p)}i as the proportion of times it is found across all bootstrap samples. Nevertheless, such confidence measures typically do not estimate the number of non-spurious arcs. For example, applying a pathological search algorithm which systematically adds all arcs yields the estimate {circumflex over (p)}i=1 for every arc.
Other attempts to provide solutions to the foregoing problem of identifying discrete relationships between entities have included using MCMC samples over variable orderings to compute marginal probabilities of arc hypotheses. Although such approaches have characterized the performance of the MCMC method, they typically have not determined whether the exact (or approximated) posterior probabilities have been accurate of calibrated in the sense that hypothesis labeled, for example, 0.4 are true 40% of the time.
Other more recent attempts to reveal discrete relationships between entities have utilized stochastic, greedy structure search algorithms, running such algorithms numerous (e.g., in excess of 1000) times to local optimums, and scoring each arc according to the proportion of times the arc appeared across all local optima found. Although such an approach can provide asymptotic guarantees, the approach nevertheless fails to yield accurate estimates on finite data.
In yet a further attempt to uncover discrete relationships between entities a frequentist test for edge inclusion in graphical Gaussian models (GGMs) has been developed, the technique provides a reasonable model for null distributions of this test wherein a score is assigned to each edge based on how much it “hurts” the model when each edge is independently removed (this is assessed in the presence of all other possible edges being included in the model—one-backward-step search for each edge). These scores can then be employed to compute a false discovery rate for a given set of edges. However, in application such an approach has been found to associate low scores to a vast quantity of real arc hypotheses resulting in inaccurate estimates of the false discovery rate.