Technical Field
The disclosed subject matter relates generally to identifying associations in large data sets.
Background of the Related Art
Mutual Information
The information-theoretic concepts of entropy and mutual information are well-known. The Shannon entropy of a discrete random variable X is a measure of its information content and is defined by
                              H          ⁡                      (            X            )                          =                  -                                    ∑                              i                =                1                            B                        ⁢                                          p                ⁡                                  (                                      x                    i                                    )                                            ⁢              log              ⁢                                                          ⁢                              p                ⁡                                  (                                      x                    i                                    )                                                                                        (        0.1        )            where x1, . . . , xB are the possible values of X.
Given two random variables, X and Y, their mutual information quantifies how much of the information in each variable can be explained by the other and is defined by:I(X;Y)=H(X)+H(Y)−H(X,Y)  (0.2)where H(X,Y) refers to the entropy of the ordered pair (X,Y), and can be re-written directly as:
                              I          ⁡                      (                          X              ;              Y                        )                          =                  -                                    ∑                              i                ,                                  j                  =                  1                                            B                        ⁢                                          p                ⁡                                  (                                      i                    ,                    j                                    )                                            ⁢              log              ⁢                                                          ⁢                                                p                  ⁡                                      (                                          i                      ,                      j                                        )                                                                                                              p                      x                                        ⁡                                          (                      i                      )                                                        ⁢                                                            p                      y                                        ⁡                                          (                      j                      )                                                                                                                              (        0.3        )            where p is the joint probability distribution function and px and py are the marginal probability distribution functions of X and Y respectively. Intuitively, mutual information is a measure of the degree to which knowing the value of either variable gives information about the other. It has several properties that make it a desirable basis for a measure of association: it is invariant under order-preserving transformations of X and Y, it is always non-negative, and, most importantly, it equals zero precisely when X and Y are statistically independent.
The challenge in applying mutual information to detect relationships between two continuous variables lies in the fact that it is calculated over probability distributions. Methods for estimating the mutual information of the prior distribution of a set of measurements, called mutual information estimators, have been based on a range of techniques from histograms, to kernel density estimation, to clustering.
Mutual Information Estimation Using Gridding-Based Methods
Estimating mutual information from empirical data commonly involves two steps: estimating the joint distribution of the data, and then calculating the mutual information of that estimated distribution. Early methods for mutual information estimation impose a grid on the data in question to obtain an estimate of their joint probability distribution as illustrated in FIG. 1 (a). To this distribution they apply Equation 0.3 above to obtain a mutual information score. When formulating an approach to gridding the plot of a set of ordered points, there are two main questions to be considered: how many cells to allow in the grid, and where to place the grid lines. Equispatial partitioning (i.e. the histogram approach) simply chooses fixed numbers of rows and columns and splits the scatter plot into equal-sized rectangles. Equipartitioning (called adaptive partitioning in the literature) chooses fixed numbers or rows and columns and splits the scatter plot while attempting to place the same number of points in each row and column. Alternatively, the more advanced Fraser-Swinney algorithm continues refining its partition until a condition about the equi-distribution of the points in the cells is met. Similarly, algorithms for Bayesian bin density estimation can be used to evaluate the posterior probabilities of all possible binnings in estimating the joint distribution.
In addition to being heavily impacted by the partitioning method chosen, grid-based approaches are significantly affected by the numbers of rows and columns of the chosen grid (FIG. 1 (b-d)). To illustrate this point, imagine a set of 1000 uniformly distributed points in the x-y plane. If each axis is partitioned in half (making 4 total “boxes”) then each quadrant of the x-y plane will have approximately the same number of points in it, and the estimated mutual information of this distribution is very close to zero. If, on the other hand, each axis is partitioned into 1,000 pieces (making a total of 106 boxes), most of the columns will contain close to one non-empty box and the same is true for most of the rows. This will lead to a very high mutual information score since knowing the column in which a point falls is almost tantamount to knowing the row in which it falls and vice versa. This illustrates a standard dilemma in density estimation: if the number of partitions used is too small, then only very coarse details of the data become visible, but if it is large, then spurious fine details receive too much emphasis, resulting in artificially high mutual information scores.
When attempting to estimate the mutual information of a distribution given a finite set of points drawn from that distribution, equipartitioning performs significantly better than equispatial partitioning provided the numbers of rows and columns are large enough and the sample size is even larger. However, as discussed above both of these methods suffer from the fact that the numbers of rows and columns chosen affect the resultant scores. The Fraser-Swinney algorithm does not provide a significant improvement over either method. Finally, in cases where N<<MXY, approaches like Bayesian bin density estimation tend to work well.
Limitations of Mutual Information
Mutual information is a nonparametric statistic that is designed to detect any deviation from statistical independence, regardless of the form of that deviation. Because of this attractive property, mutual information has been used in several settings to measure the strength of associations of specific types. However, it has not been effectively used to compare relationships of different types to each other. This is because the mutual information of different types of functional relationships depends heavily on the specific functions governing those relationships. Simply put, not all functions preserve information to the same extent. For example, a perfect linear relationship has a higher mutual information than a perfect parabolic one because a linear function maps every x value to a unique y value while a parabolic function does not.
Mutual information does not give a meaningful measure of the strength of a dependence that can be compared across relationship types. The aim of all of the above mutual information estimation methods is to accurately estimate the mutual information of the “true” distribution governing a set of sample points. However, given that different distributions have inherently different mutual information scores, mutual information, no matter how well estimated for a given set of points, is not sufficient for comparing different distributions. This limitation has prevented the realization of the full potential of mutual information to detect a wide range of relationships.
Finally, not only is the mutual information of different classes of distributions different, but mutual information estimation is sensitive to sample size, making it very difficult even to compare the mutual information scores of the same types of distributions with different sample sizes. This, compounded by the fact that mutual information is unbounded and grows to infinity with increased sample size, makes interpreting mutual information scores extremely difficult in most settings. Mutual information can of course be normalized, but the differences between differential entropy and discrete entropy limit the scope of normalized mutual information variants that employ entropy to discrete random variables.
In this application, the partiality of a statistic will refer to a tendency of that statistic to give higher scores to certain relationship types (i.e. lack of equitability). A practical example of how the partiality of mutual information affects its ability to agnostically score different types of relationships can be observed using a dataset containing data from the World Health Organization about all the countries in the world (which is explored further below as a detailed example of the application of the disclosed subject matter). FIG. 7(i) shows a quartet of relationships from the WHO dataset. Of these four relationships, the left two appear relatively noiseless while the right two appear more noisy. However, mutual information as estimated by Kraskov et al. assigns the top two relationships similar scores and assigns the bottom two relationships similar scores as well.