Advanced database system research faces a great challenge necessitated by the emergence of massive, complex structural data (e.g., sequences, lattices, trees, graphs and networks) which are encountered in applications such as bio-informatics, geo-informatics and chem-informatics. A particular challenge involves graph classification, i.e., correctly assigning molecules or chemical compounds to various classes, e.g., toxic versus nontoxic, or active versus inactive.
Graphs are the most general form of structural data, and thus are used extensively in chem-informatics and bio-informatics datasets. In chem-informatics, an important task is to infer chemical or biological properties of a molecule from its structure. Similarly, in drug design process, one of the key steps is the identification of chemical compounds that display the desired and reproducible behavior against a specific biomolecular target. In computer vision and pattern recognition, where graphs are used to represent complex structures, such as hand-drawn symbols, three-dimensional objects and medical images, it is also desirable to perform graph classification, such as letter or digit classification, as well as face recognition.
A number of methods have been developed to perform classification on complex structural data. See, for example, A. Inokuchi et al., An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data, In Proc. 2000 European Symposium On The Principle Of Data Mining And Knowledge Discovery (PKDD'00), pgs. 13-23 (2000); M. Deshpande et al., Frequent Substructure-based Approaches for Classifying Chemical Compounds, 17(8) IEEE Trans. On Knowledge And Data Engineering, pgs. 1036-1050 (2005); N. Wale et al., Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, In Proc. 6th International Conference On Data Mining (ICDM'06), pgs. 678-689 (2006), the disclosures of which are incorporated by reference herein.
While these methods are very useful, they do have important limitations. Specifically, none of the cited methods accommodate the skewed class distribution of the real graph datasets, which is quite common in real applications. If traditional learning methods are directly applied on skewed data, they tend to be biased towards the majority class and ignore the minority class, since the goal of such methods is to minimize the error rate. However, the primary purpose of the graph classification is to identify the rare active class from the vast inactive class. The cost of misclassifying minority examples is usually very huge. Therefore, an effective solution to handle the skewed distribution problem would be desirable.