The present invention relates generally to database technology and more particularly to systems of data mining.
A number of systems for classifying data and performing data mining have been proposed. A number of these techniques involve decision tees, linear classification trees, and association rules. However, each of these techniques has significant disadvantages.
Conventional techniques do not handle well biased sample data which has a large volume and high dimensions. They tend to ignore weak signals. Thus, a need clearly exists for an improved system of data mining and classifying data.
The aspects of the invention, Classification by Aggregating Emerging Patterns (CAEP), are directed towards a system of extracting rules in the form of emerging patterns (EP) and constructing a classifier from correctly labelled data (e.g. DNA sequences) to decide which category a sample belongs to and/or making a prediction on the sample. The system discovers features or signals that differentiate one category of data from another and builds a system to classify such data.
The system is able to nicely handle biased data that has a large volume and high dimensions. Also the system does not ignore weak signals. The EPs are associated with supports and the ratios of change in supports. The system is robust in the presence of biased sample data and is scalable in terms of large numbers of samples and in terms of dimensions in practical situations.
An EP is a signal/itemset whose supports increase significantly from one class of data to the next. In other words, it is a differentiating factor between the two classes.
The aggregation of the differentiating strengths, in terms of their supports and ratio of change, of all of some set of discovered EPs (whose cardinality is not bounded before classifier construction) that occur in a new case in a decision step is novel.
The normalization by dividing by a base score chosen at some percentile (such as 50%) across training instance of all classes is novel.
One way to find emerging pattern is based on a border-based representation of very large collections of itemsets, and processes which derive EPs by operating (such as differentials) on some borders. These borders can be first efficiently discovered using the Max-Miner technique which is scalable in terms of large number of tuples and high dimensions in practical situations.
The EPs can be used in the protein translation start-site identification problem. This is an example application of CAEP to datamining in Molecular Biology.
The CAEP classifier (i) extracts emerging patterns (EPs), (ii) uses each of these EPs as a multiple-attribute test, (iii) aggregates the power of individual EPs to get raw scores, and (iv) normalizes the raw scores by dividing them using some base scores chosen from a certain percentile of the scores of the training instances. CAEP has near equal prediction accuracy on all classes. CAEP is based on a novel border-based representation of very large collections of itemsets. It derives EPs by operating on some borders (which can also be efficiently discovered). It is scalable in terms of large number of tuples and high dimensions in practical situations.
In accordance with a first aspect of the invention, there is disclosed a method of classifying data by aggregating emerging patterns in the data using datasets for a plurality of classes using a computer processor. In the method, for each of the classes, an emerging pattern set is mined dependent upon instances of the set and opponent instances dependent upon predetermined growth rate and support thresholds. Aggregate scores of the instances are calculated for all of the classes. Base scores are then determined for each of the classes. For each test instance, the following sub-steps are performed: aggregate and normalized scores of test instance are calculated for each class; and a specified class is assigned to the test instance for which the test instance has a largest normalized score.
The method assumes the preparatory step of partitioning an original dataset into a predetermined number of datasets to form the datasets. The predetermined number of datasets is dependent upon the number of classes,
Preferably, the method further includes the step of reducing the number of emerging patterns dependent upon growth rates and supports of the emerging patterns.
Preferably, the mining step includes the following steps: borders of large itemsets are determined using a large-border discovery technique; and supports and growth rates of emerging patterns are determined for the class. Optionally, the large-border discovery technique is the Max-Miner technique.
Optionally, the mining step includes the following steps: two borders of large itemsets (large borders for short) are determined of instances of the class and of the opponent class; and all emerging pattern borders are found using multiple border pairs.
In accordance with a second aspect of the invention, there is disclosed an apparatus having a computer processor for classifying data by aggregating emerging patterns in the data using datasets for a plurality of classes. The apparatus includes:
a device for, for each of the classes, mining an emerging pattern set dependent upon instances of the class and opponent instances dependent upon predetermined growth rate and support thresholds;
a device for calculating aggregate scores of the instances for all of the classes;
a device for determining base scores for each of the classes; and
a device for, for each test instance, performing specified operations, the performing device including:
a device for calculating aggregate and normalized scores of test instance for each class; and
a device for assigning a specified class to the test instance for which the test instance has a largest normalized score.
In accordance with a third aspect of the invention, there is disclosed a computer program product having a computer readable medium having a computer program recorded therein for classifying data by aggregating emerging patterns in the data using datasets for a plurality of classes. The computer program product includes:
a computer program source code module for, for each of the classes, mining an emerging pattern set dependent upon instances of the class and opponent instances dependent upon predetermined growth rate and support thresholds;
a computer program source code module for calculating aggregate scores of the instances for all of the classes;
a computer program source code module for determining base scores for each of the classes; and
a computer program source code module for, for each test instance, performing specified operations, the computer program source code performing module includes:
a computer program source code module for calculating aggregate and normalized scores of test instance for each class, and
a computer program source code module for assigning a specified class to the test instance for which the test instance has a largest normalized score.
In accordance with a fourth aspect of the invention, there is disclosed a system for extracting emerging patterns from data using a processor. The system includes:
a device for mining emerging patterns for all of a number of categories of the data;
a device for computing aggregate differentiating scores for all samples of the data and the categories; and
a device for computing base scores for the categories.
Preferably, the system further includes a device for extracting the emerging patterns from the mined emerging patterns dependent upon the aggregated differentiating scores and the base scores.
Preferably, the system further includes a device for reducing the number of related emerging patterns. Two emerging patterns are related if one is a sub-pattern or subset of the other. Optionally, the system further includes a device for indicating whether the set of derived emerging patterns is to be reduced, operations of the reducing device being dependent upon the indicating device.
Optionally, the system further includes a device for reproducing in a displayable manner extracted emerging patterns. The device may print or display the extracted emerging patterns.
Optionally, the system further includes;
a device for obtaining samples from different input categories; and
a device for adjustably discretizing the obtained samples.
Preferably, the system further includes a device for storing and managing the obtained samples and or the discretized samples.
Preferably, the emerging patterns are derived dependent upon one or more predetermined conditions including:
a support level threshold of a pattern in a category;
a growth rate threshold between categories;
a monotonically increasing weighting function for a growth rate; and
a score specifying an aggregate differentiating score of a discretized sample and a set of emerging patterns of a category, the score being dependent upon supports and weighted growth rates of emerging patterns in a category; and
a base score on the aggregate differentiating score for each category.
For a threshold on support level, the support level of a pattern (or an itemset) I in a category C is defined as suppc(I)=the percentage of samples in C that exhibit that I If this threshold is not given, a default is preferably used or derived from the input data. For example, the support level threshold can be 1%.
For a threshold on growth rate, given two categories Ca and Cb of samples, the growth rate of a pattern (or itemset) I from Cb to Ca is defined as:             growthrate              Ch        ->        Ca              ⁡          (      I      )        =      {                                        0            ,                                          if                ⁢                                  xe2x80x83                                ⁢                                                      supp                    Cb                                    ⁡                                      (                    I                    )                                                              =                                                0                  ⁢                                      xe2x80x83                                    ⁢                  and                  ⁢                                      xe2x80x83                                    ⁢                                                            supp                      Ca                                        ⁡                                          (                      I                      )                                                                      =                0                                                                                      infinity            ,                                          if                ⁢                                  xe2x80x83                                ⁢                                                      supp                    Cb                                    ⁡                                      (                    I                    )                                                  ⁢                                  xe2x80x83                                ⁢                but                ⁢                                  xe2x80x83                                ⁢                not                ⁢                                  xe2x80x83                                ⁢                                                      supp                    Ca                                    ⁡                                      (                    I                    )                                                              =              0                                                                        otherwise            ,                                                            supp                  Ca                                ⁡                                  (                  I                  )                                            /                                                supp                  Cb                                ⁡                                  (                  I                  )                                                                        
If this threshold is not given, a default can be chosen or derived from the input training data. For example, the threshold can be 5. Given a category C, its opponent category is Cxe2x80x2 which is defined to contain all instances not in category C. A pattern or itemset having a growth rate that exceeds this threshold is considered an emerging pattern from Cxe2x80x2 to C, or simply an emerging pattern of category C.
With respect to a function weight(g) for weighting growth rate g (for growth rate  greater than 1, this function should be a monotonic increasing function that takes nonnegative values. If this function is not given, a default will be chosen. For example, the function can be weight(g)=g/(g+1).
The function score(s, C) specifies the aggregate differentiating score of a sample s and a set E(C) of emerging patterns of a category C. This function involves (i) the support of emerging patterns in E(C) and (ii) weighted growth rates of emerging patterns in E(C). If this function is not given, a default will be chosen. For example, the function can be score(s, C)=sum of suppc)*weight(growthrate(I)) over all pattern I that appears in s and in E(C).
The base score base_score(C) on the aggregate differentiating score for each category C can be given as a percentile of the range of aggregate differentiating scores of the training samples of that category. If this threshold is not given, a default can be chosen, for example, the 50th percentile.
An indication can be given of whether the set of derived emerging patterns should be reduced. If this indication is not given, a default decision can be made. For example, some related emerging patterns may be eliminated unless the reduction leads to poor coverage (i.e. leads to a larger number of zero scores) on training samples.
Preferably, the system further includes a device for storing and managing derived emerging patterns and the one or more conditions for deriving the emerging patterns.
Optionally, the system further includes a device for selecting patterns that cover more training samples and have stronger differentiating power, the pattern selecting device including:
a device for sorting emerging patterns between two categories into a list in decreasing order of growth rate and support;
a device for initializing a set of essential emerging patterns, essE, to contain a first emerging pattern in the list;
a device for, for each next pattern in the list, ordering the set of essential emerging patterns, the ordering device including:
a device for, for each J in the set of emerging patterns essE such that I is a sub-pattern or subset of J, replacing J by I if either of the following conditions is true:
growthrate cxe2x80x2xe2x86x92c(I) exceeds growthrate cxe2x80x2xe2x86x92c(J),
suppc(I) greatly exceeds suppc(J) and growthrate cxe2x80x2xe2x86x92c(I) exceeds the threshold on growth rate;
a device for adding I to the set of emerging patterns essE if both of the above conditions are false and I is not a super-pattern or superset of any pattern in the set of emerging patterns essE.
Preferably, the mining device includes a device for manipulating borders. Each border is an ordered pair (L, R), if each of L and R is an anti-chain collection of sets. Each element of L is a subset of an element of R, and each element of R is a superset of some element in L. A collection of sets represented by, or a set interval of, such a border are sets Y such that Y is superset of an element of L and is subset of an element of R.
Optionally, the mining device includes:
a device for determining two large borders of large itemsets in two categories having predetermined support thresholds;
a device for finding emerging pattern borders using MBD-LLBORDER processing;
a device for enumerating emerging patterns contained in found emerging pattern borders; and
a device for checking through actual supports and growth rates of samples in the two categories,
Given the categories Cxe2x80x2 and C of samples, the MaxMiner technique can be used to discover the two large borders of the large itemsets in Cxe2x80x2 and C having appropriate support thresholds. Then, the MBD-LLBORDER technique is used to find all the emerging pattern borders. Finally, we enumerate the emerging patterns contained in these borders, and go through samples in Cxe2x80x2 and C to check their actual supports and growth rates. Assuming that LARGE BORDERd(Cxe2x80x2) and LARGERBORDERr(C) have been found for some d and r satisfying t=p*d, where p is an appropriate threshold on growth rate, the MBD-LLBORDER technique is used to find all emerging patterns such that their supports in C exceed d*p but their supports in Cxe2x80x2 are less than d. The MBD-LLBORDER technique is as follows:
Let LAGEBORDBRd(Cxe2x80x2) be ({{ }}, {C1, C2, . . . , Cm})
Let LARGERBORDERt(C) be ({{ }}, {D1, D2, . . . , Dn})
MBD-LLBORDER(LAGEBORDEn(Cxe2x80x2), LARGEBORDERt(C)
EPBORDERS less than "THgr"{ });
for j from 1 to n do
if some CI is superset of DJ then continue;
{CIxe2x80x2, . . . , Cmxe2x80x2}"THgr"{(CI intersect Dj, . . . Cm intersect Dj};
RIGHTBOUND"THgr"the set of all maximal itemsets in {CIxe2x80x2, . . . , Cmxe2x80x2};
add BORDER-DIFF(({{ }}, Dj), ({{ }}, RIGHTBOUND)) into EPBORDERS
return EPBORDERS;
The function BORDER-DIFF is employed to derived differentials between a pair of borders of special forms: Given a pair of borders ({{ }}, {U}) and ({{ }},R), BORDER-DIFF derives another border (L, {U}) such that the collection of sets represented is exactly those sets represented by ({{ }}, {U}) but not represented by ({{ }}, R). Importantly, BORDER-DIFF achieves this by manipulating only the itemsets in the borders:
BORDER-DIFF(({{ }}, {U}), ({{ }}, {SI, . . . , Sk}))
In accordance with a fifth aspect of the invention, there is disclosed a system for classifying data using a processor. The system includes:
a device for inputting samples of the data to be classified;
a device for mining emerging patterns for all of a number of categories of the data;
a device for computing aggregate differentiating scores for all samples of the data and the categories;
a device for computing base scores of aggregate differentiating scores for all samples and categories; and
a device for assigning a category to each sample, the category assigned to a sample having a normalized score that is maximal for the sample.
Preferably, the system further includes a device for outputting classification decisions on the samples.
In accordance with a sixth aspect of the invention, there is disclosed a system for ranking and classifying data using a processor. The system includes:
a device for inputting samples of the data to be classified;
a device for mining emerging patterns for all of a number of categories of the data;
a device for computing aggregate differentiating scores for all samples of the data and the categories;
a device for computing base scores of aggregate differentiating scores for all samples and categories; and
a device for ranking each category against each sample by measuring a normalized score for the sample, where the greater a normalized score, the higher is the rank of the category for the sample, the normalized score with respect to a category is formed by dividing the aggregate differentiating score by a corresponding base score.
Preferably, the system further includes a device for outputting ranked classification decisions on the samples.