The advent of the modern electronic computer has exacerbated several issues of dealing with information analysis. Although not a complete list of issues, among these issues are:                a. The increased demand for the collection and storage of data.        b. The increased complexity for retrieval of data from storage.        c. The increased complexity in understanding of relationships within data.        
These areas are becoming even more complex due to an increased rate at which data is being generated and collected in computer databases, or embodied in data streams. This invention deals with facilitating computations for finding and understanding the information content of such data. In particular, this invention is concerned with understanding the patterns that occur within data and their frequency.
Areas in which such data analysis issues arise include:                1. Search Engines,        2. Data Mining,        3. Market Research,        4. Census/Demographic Analysis,        5. Fraud/Security Analysis,        6. Analysis of financial markets,        7. Bioinformatics data analysis        
Although this list is not complete, it suffices to introduce the type of subject areas in which frequency of patterns in computerized data play an important role.
A basic concept in dealing with data is that of a record, comprising a collection of attributes or attribute entries chosen from a set of possibilities. For example, a person may have a medical record containing information about that person's medical history. The collection of attributes from which a record is formed can be large. Furthermore, the number of such records that may be under consideration may also be large. To summarize, there are many records whose entries are from among many possibilities. This is a key reason why computers have exacerbated issues of dealing with data. It is a cycle: the better computers deal with data, the more data they are required to deal with. This cycle has continued in an ever-increasing manner, and has been brought to an unprecedented level recently. There is little chance that this trend will slow down.
Consider a record consisting of information about demographics, a person's medical history, etc. By enumerating the attributes (e.g., by encoding them as integers), one may consider a record to be collection of numbers representing the attribute entries. For example, RECORD—1 may be represented as {1, 30, 43, 89, 20345} and RECORD—2 as {1, 3, 43, 89, 2235}. It is observable that the “patterns”{{1}, {43}, {89}, {1,43}, {1,89}, {43,89}, {1,43,89}}occurred twice and all other patterns occurred once. As the number of records and attributes increase, the complexity of computing the number of occurrences of patterns consisting of one, two, three or more attributes requires additional methods beyond a casual inspection. Notice each of these patterns is also a set of attributes. Some patterns involve one attribute. While others involve pairs, triples, etc., of attributes. The occurrence of single attribute patterns is easily calculated, and it should be noted that the occurrence of pairs of attributes is also fairly easily calculated. Triple attribute patterns, and higher order patterns (including four, five, etc. attributes) have increased complexity, and require more powerful approaches, in order to be handled with available computing resources.
Furthermore, a record may be more complicated than as expressed above. The attributes may in fact have additional auxiliary attributes. For instance, a date or time-stamp may be included and associated with one or more attributes. We may be interested in patterns that meet some constraint about the auxiliary attributes. For example, one may only be interested in patterns with attributes that have occurred within 30 days of each other. Otherwise, the pattern may have no use to the analysis. In such cases, the full data record collection may be segmented or filtered before analysis to include only records meeting certain criteria. What constitutes an attribute or an auxiliary attribute depends in part on the data and the specifics of what one cares to learn about that data. There are numerous ways in which data is handled and represented, and as such the distinction between an attribute and an auxiliary attribute is dependent on the specific analysis being sought for the record collection.
The present invention involves both data mining and statistical analysis of data. Although not a precise definition of either discipline, the following statements show that there may be a useful distinction.                Statistical Analysis—Ideally, one designs and conducts experiments and then tests the validity of hypotheses from data collected. One gains an understanding of the properties of the data from the underlying distributions. The validity of a hypothesis is established from analyzing the distributions. Typically, the hypothesis defines and limits what patterns are of interest in the data and what computations are done on the data.        Data Mining—In many cases, the data does not represent the outcome of a structured experiment. In such cases, methods that allow for the discovery of patterns anywhere in the data are needed. Methods for determining significant patterns in data are usually referred to as Data Mining. Furthermore, the data from these unstructured experiments tends to be enormous. Data Mining methods typically make or require assumptions in order to control computational complexity.        
U.S. Pat. No. 6,389,416 for Depth First Method for Generating Itemsets describes a data mining system and method for generating associations among items in a large database. In particular, it refers to a set {I} comprising all items in a database of transactions. The patent further refers to an association rule between a set of items {X} that is a subset of {I} and another set {Y} that is also a subset of {I}. The association is expressed as {X}→{Y}, which indicates that the presence of items X in the transaction also indicates a strong possibility of the presence of the set of items Y. The patent notes that the measures used to indicate the strength of such an association rule are support and confidence, where the support of the rule {X}→{Y} is the fraction of transactions containing both X and Y. The confidence of the rule {X}→{Y} is the fraction of the transactions containing X that also contain Y. The patent further states:                In the association rule problem, it is desired to find all rules above a minimum level of support and confidence. The primary concept behind most association rule algorithms is a two phase procedure: In the first phase, all frequent itemsets (or large itemsets) are found. An itemset is “frequent” or large if it satisfies a user-defined minimum support requirement. The second phase uses these frequent itemsets in order to generate all the rules which satisfy the user specified minimum confidence.        
Having posited these propositions, the patent goes on to describe a depth first search technique in order to construct a lexicographic tree of itemsets. It claims that this method substantially reduces the CPU time for counting large itemsets. It further describes counting methods for higher level nodes of the lexicographic tree structure and a two phase bucketing procedure for counting support at lower level nodes. The patent asserts that these optimize processing time and database storage utilization.
While the teachings of U.S. Pat. No. 6,389,416 may help with efficient use of computing resources when large and complex databases are being analyzed, they still have (at least) the disadvantage that user defined minimum levels of support and confidence are used to determine what itemsets are to be developed and their occurrence counted. This necessarily results in some data being declared a priori not of interest and therefore discarded or not developed. The following example shows the problems with such an approach.
The example uses two criteria commonly used in analyzing data and discusses how they interplay with one another (1) McNemar's test, and (2) thresholding based on item-set frequency.
We consider McNemar's test because it is common, simple and relevant to our discussion, but it is not meant to be the most general situation. McNemar's test may be used to determine the significance of change when a subject is used that has its own control in a “before” and “after” test with nominal or categorical measurements. In plain English, we take a before and after survey of an event using the same people, and we record and count the positive and negative opinions.
BeforeAfter+−+AB−CD
This table is a 2×2 contingency table and the McNemar test uses the values in the table to compute the statistic
      χ    2    =                              (                                                                  B                -                C                                                    -            1                    )                2                    B        +        C              .  
This value is then used to determine a “P-Value” and ultimately to determine the degree to which the change is considered statistically significant.
As noted above, in the data mining of associations, one typically confines the associations analyzed to those having a minimum “support” and minimum “confidence” levels. These thresholds are used as criteria for determining whether an association is “interesting”. Criteria such as these, throw away any information that does not happen often enough within the data for which the levels are calculated.
The issue that arises between overly simplified thresholding and criteria like the McNemar test is the following. Suppose we wish to discover interesting associations between attributes of surveyed people and whether their changed responses to a test survey are to be considered statistically significant. If we threshold in a simple data mining manner, then we may not fill out the contingency tables that reflect the most statistically significant changes, and these cases are not considered in our analysis. Throwing away information can lead to missing the most significant changes. In the following, we detail an example to illustrate how this can happen.
We will consider a very simple survey consisting of one question asked before and after a Presidential debate: Which candidate do you prefer, A or B?
Suppose this survey is conducted over a panel of 1,000 people. A standard McNemar test is used to determine whether opinion shifts between surveys are statistically significant. In addition to the survey data, we also have attributional data associated to the people surveyed. For example, we might know the gender, age, parent status, educational level, profession, income, location, etc. Table 1 shows 300 people preferred candidate A both before and after, 188 people preferred candidate B before but candidate A after, etc.
The values in Table 1 have been chosen so that there is no difference and hence no statistically significant difference before and after the debate as measured across the entire sample. We want to discover sets of attributes that individuals share, for which there exist statistically significant differences from before and after the debate. More precisely, we are interested in discovering which attributes result in contingency tables for which the McNemar test shows statistical significance and use the McNemar test as a measure to the degree of asymmetry in the tables. This is both a Data Mining problem and a Statistics problem. We wish to use methodology from both disciplines. We do not want to apriori guess about which attributes to use for the conditioning; we wish to discover them by use of an efficient method. After an association is discovered one can decide whether to conduct a specific study to analyze the association further.
TABLE 1Candidates Approval RatingsEntire Survey with no conditioningAfterBeforeABTotalA300188488B188324512Total4885121000 x2 = 0.003 1 df,P-Value = 0.9589
Similarly, the values in Table 2-a and Table 2-b show that segmenting on gender does not produce sub-surveys with significant statistical differences from before and after the debate. In fact, there are still no actual differences from before and after the debate. (The values were chosen using 49 and 51 percent of the values of the original table.)
TABLE 2-aCandidates Approval RatingsMen OnlyAfterBeforeABTotalA147 92239B 92159251Total239251490x2 = 0.005 1 df,P-Value = 0.9412
TABLE 2-bCandidates Approval RatingsWomen OnlyAfterBeforeABTotalA153 96249B 96165261Total249261510Suppose we further segmented on the Women Only table to beWomen-Single-Parent and Women-Not-Single-Parent, the resultingtables might appear asAfterBeforeABTotalA153 96249B 96165261Total249261510x2 = 0.005 1 df,P-Value = 0.9425
TABLE 2-b1Candidates Approval Ratings(Women, Single Parent) OnlyAfterBeforeABTotalA544498B245882Total78102 180 x2 = 5.309 1 df,P-Value = 0.0212
TABLE 2-b2Candidates Approval Ratings(Women, Not Single Parent) OnlyAfterBeforeABTotalA99 52151B72107179Total171 159330x2 = 2.911 1 df,P-Value = 0.0880
Using McNemar's test, the Women-Single-Parent table (2-b1) shows statistically significant differences from before and after the debate. In contrast, the Women-Not-Single-Parent table (2-b2) indicates a difference that is not quite statistically significant.
In a analogous manner, the following segmentation of the “Men only” table based on age might appear as:
TABLE 2-aCandidates Approval RatingsMen OnlyAfterBeforeABTotalA147 92239B 92159251Total239251490x2 = 0.005 1 df,P-Value = 0.9412
TABLE 2-a1Candidates Approval Ratings(Men 18–25 years old) OnlyAfterBeforeABTotalA25 833B232851Total483684x2 = 6.323 1 df,P-Value = 0.0119
TABLE 2-a2Candidates Approval Ratings(Men 26–35 years old) OnlyAfterBeforeABTotalA96 70166B50103153Total146 173319x2 = 3.008 1 df,P-Value = 0.0828
TABLE 2-a3Candidates Approval Ratings(Men ≧ 36 years old) OnlyAfterBeforeABTotalA261440B192847Total454287x2 = 0.485 1 df,P-Value = 0.4862and in this situation, Table 2-a1 is statistically significant, Table 2-a2 is not quite statistically significant, and 2-a3 is not statistically significant.
We summarize the breakdown according to attributes as follows.    1. Entire Panel—Change not significant            a. Women Only—Change not significant                    i. Women Single Parent—Change significant            ii. Women Not Single Parent—Change not quite significant                        b. Men Only—Change not significant                    i. Men 18–25 years old—Change significant            ii. Men 26–35 years old—Change not quite significant            iii. Men≧36 years old—Change not significant                        
The data mining problem of finding statistically significant sub-surveys can be established by finding attribute sets with associated p-values (the p-value of the contingency table restricted to the individuals corresponding to the attribute set) less than a small threshold value. The p-value represents the probability that we can reject the hypothesis that the change is statistically significant. The smaller the p-value, the more unlikely it is that we can reject the case.
Statistically Significant Demographic based sub-surveys, p-value ≦ 0.1000Demographic SetTable Supportχ2P-ValueQual. Measure1Men, 18–25 years old0.0846.3230.0119significant2Women, Single Parent0.1805.3090.0212significant3Men, 26–35 years old0.3193.0080.0828not quite significant4Women, Not Single Parent0.3302.9110.0880not quite significant.. . ....... . ....... . .....
The values chosen were constructed to deliberately show that as the support for the contingency tables increases, the statistical significance of change from before and after the debate may actually decrease. This is not a surprise; we exploited the fact that different factions can “average out” over large enough groups and mask the relevant events.
Traditionally data mining places conditions on whether an “item-set” or data pattern is relevant or of interest. For example, the event (male, 18–25 years old, A before, B after) only occurs 8 times in a thousand. If the item-set support threshold is set to 3% (0.03), Tables 2-a1 and 2-a3 would never be considered. This is fine in the case of Table 2-a3, because the McNemar test would reject the case, but we still lose the case corresponding to Table 2-a1. Given this segmentation, it is only upon lowering the item-set threshold to 0.008 that the case 2-a1 would emerge as the most statistically significant change of opinion. Note that the table support for the demographic set is 8.4% of the sample, which is not trivial.
We conclude that even in cases where the entire survey produces no statistical or actual differences in the change of opinion, the underlining distribution can hide important aspects that cancel each other out and are statistically significant as measured by the McNemar test. Furthermore, simple thresholding used in conventional data mining will not discover these events.
What is needed in the art is a method and system for more efficiently and completely analyzing computerized data records that contain large numbers of attributes to develop information on complex patterns that may exist within the records.