Analyzing a very large volume of data is often critical to resolving a particular problem. “Too many reports to study” is an often heard comment from business analysts who do not have the time to read the many reports available regarding a particular issue. The current state of computer technology, the number of software products, and ease of their use have generally made the generation of a great number of reports relatively easy.
One prior approach that has been used involves correlation of a predefined report set. If a user reads a particular subset of reports together, then these reports may be correlated into the same neighborhood. Therefore, when another user picks out one of these reports for reading, a recommendation engine recommends the other reports that had been previously correlated together. Unfortunately, this technique requires starting with a predefined report set with a qualitative correlation rule, and it needs a certain time period for building up the neighborhoods. Also, this technique cannot provide objective evidences on why a particular report is important and cannot prioritize the importance of the various reports. In general, there is no standard, quantitative analysis technique that can be used to analyze dynamically changing reports and produce a result that can be repeated.
In the field of mathematics, probabilities, statistics and information theory are known. For example, conditional probabilities are useful in the analysis of observations. Suppose that the sample space of n independent observations, □, is partitioned into the disjoint sets S1 and S2, such that S1∩S2=0 and S1US2=□. If the sample point xεS/, hypothesis H1 is accepted and H2 is rejected. If the sample point xεS2, hypothesis H2 is accepted and H1 is rejected. The probabilities can be defined as follows.α=Prob(xεS1|H2)=p2(S1): The probability of incorrectly accepting hypothesis H1.1−α=Prob(xεS2|H2)=p2(S2): The probability of correctly accepting hypothesis H2.β=Prob(xεS2|H1)=p1(S2): The probability of incorrectly accepting hypothesis H2.1−β=Prob(xεS1|=H1)=p1(S1): The probability of correctly accepting hypothesis H1.Then, if we let S1 be the positive set and S2 be the negative set, α is false positive probability and β is false negative probability.The Kullback-Leibler divergence uses the following parameters.                n: number of bins.        f1i: the probability of seeing a giving sample in bin i of distribution 1. Note that f1i≧0 and Σni=1f1i=1.        f2i: the probability of seeing a giving sample in the corresponding bin i of distribution 2. Note that f2i≧0 and Σni=1f2i=1.Self-Entropy for fk: −Σfki ln(fki), where kε(1, 2).Cross-Entropy in favor of f1 against f2: −Σf1i ln(f2i).Relative-Entropy (Kullback-Leibler divergence) in favor of f1 against f2: Er(f1, f2)=Σf1i ln(f1i/f2i).        
Cross-Entropy is, theoretically, always greater than or equal to Self-Entropy. Their difference is defined as Relative-Entropy (Cross-Entropy-Self-Entropy). Relative-Entropy is the information for discrimination. Statistically, it is the expectation of the logarithmic difference between the probabilities f1 and f2. (The expectation is taken based on the probabilities f1. This is what “in favor of f1 against f2” meant.) The Kullback-Leibler divergence is only defined if (f2i>0 and f1i≧0) or (f2i=0 and f1i=0). Note that f1i ln(f1i/f2i) approaches 0 when both f1i and f2i approach to 0.
The L1 norm of a vector is useful in representing the total sample count of a distribution. Given a vector x=(x1, x2, . . . , xn)T with size n, an L1 norm of x is defined as |x|=Σni=1|xi|. Let each element of a vector x represents the corresponding histogram count of a distribution f. The volume of the distribution (i.e., the total count of samples) is defined by an L1 norm, |x|. Let x1 and x2 represent two histogram counts of two distributions, then the volume ratio for f1 against f2 is defined as |x1|/|x2|.
Even though many mathematical techniques are known, none have been applied successfully to the problem of too many reports and how to filter and prioritize them. It would be desirable to develop a technique to automatically ignore unimportant reports while at the same time prioritizing the truly important reports that should be read.