A number of applications provide data in string format, including market basket analysis, customer tracking, DNA (deoxyribonucleic acid) analysis, and text. In many cases it is desirable to find subpatterns in these strings having particular formats. The traditional classification technique is defined as follows: A set of strings are given, each of which is labeled with a class drawn from the set C1 . . . Ck. The correct classification label is then found for a given test data record for which the class label was originally unknown. This traditional classification technique only concerns the classification of entire strings, and does not address the significantly more complex problem of finding particular classes of substructures within strings.
In many applications the problem of finding substructures in strings is significantly more important than that of classifying the string itself. For example:
(1) In a genome application different kinds of protein sequences may be sought, which are embedded in very long patterns. A given pattern may contain one or multiple occurrences of such a sequence substructure. The number and types of such occurrences may not even be known a priori.
(2) In a text application it may be desirable to find particular segments or paragraphs which belong to a particular topic or satisfy other conditions, which cannot be expressed in closed form, but only in the form of examples.
The examples above can present complex variations in which the same data can have different classes of substructures and in which some classes of substructures are embedded within others. This often occurs in the biological domain. In a text article certain segments of interest may be divided into further subsections of a particular kind. Thus, there may be a hierarchical aspect to the class behavior of the data. This illustrates a generalized classification problem since the correct classification of a given substructure must be found and the exact location and extent of the substructure in the data must be determined.
The standard classification problem for strings has been studied in the computational biology, database and data mining fields, see, e.g., reports such as C. C. Aggarwal, “On Effective Classification of Strings with Wavelets,” ACM KDD Conference, 2002; G. A. Churchill, “Stochastic Models for Heterogeneous DNA Sequences,” Bull. Math Biol, 59, pp. 79–91, 1989; M. Deshpande. et al., “Evaluation of Techniques for Classifying Biological Sequences,” Technical report, TR 01–33, University of Minnesota, 2001; S. Subbiah et al., “A Method for Multiple Sequence Alignment with Gaps,” Journal of Molecular Biology, 209, pp. 539–548, 1989; and M. S. Waterman, “Sequence Alignments,” In: Mathematical Methods for DNA Sequences, Waterman M. S. ed. CRC Press, 1989. A good comparative study of the most important classification methods for the strings can be found in M. Deshpande et al., “Evaluation of Techniques for Classifying Biological Sequences,” Technical report, TR 01–33, University of Minnesota, 2001. Recently, a wavelet classification method for the string problem was discussed in C. C. Aggarwal, “On Effective Classification of Strings with Wavelets,” ACM KDD Conference, 2002. This technique has been shown to be more effective than other string classifiers discussed by M. Deshpande et al.
The nature of substructure mining is inherently more difficult than standard classification. Standard classification methods such as rule based methods, decision trees, and nearest neighbor classifiers are designed only for the task of labeling entire records. These methods cannot be easily extended to the generalized substructure classification problem which is inherently more complex in its nature.
Thus, there exists a need for techniques which overcome the drawbacks associated with the approaches described above, as well as drawbacks not expressly described above, and which thereby provide more efficient and scalable solutions to the problems associated with string substructure classification.