In recent years, advances in data storage technology have enabled the storing of data for real time transactions. However, such transactions may produce data that grows without limits, and is commonly referred to as a data stream. There have been recent advances in data stream mining, see, for example, B. Babcock et al., “Models and Issues in Data Stream Systems,” ACM PODS Conference, 2002; P. Domingos et al., “Mining High-Speed Data Streams,” ACM SIGKDD Conference, 2000; J. Feigenbaum et al., “Testing and Spot-Checking of Data Streams,” ACM SODA. Conference 2000; J. Fong et al., “An Approximate Lp-difference Algorithm for Massive Data Streams,” Annual Symposium on Theoretical Aspects in Computer Science, 2000; J. Gehrke et al., “On Computing Correlated Aggregates over Continual Data Streams,” ACM SIGMOD Conference, 2001; S. Guha et al., “Clustering Data Streams,” IEEE FOCS Conference, 2000; L. O'Callaghan et al., “Streaming-Data Algorithms for High-Quality Clustering,” ICDE Conference, 2002; and B-K. Yi et al., “Online Data Mining for Co-Evolving Time Sequences,” ICDE Conference, 2000.
An important data mining problem that has been studied in the context of data streams is that of classification, see, for example, R. Duda et al., “Pattern Classification and Scene Analysis,” Wiley, New York, 1973; J. H. Friedman, “A Recursive Partitioning Decision Rule for Non-Parametric Classifiers,” IEEE Transactions on Computers, C-26, pp. 404-408, 1977; M. Garofalakis et al., “Efficient Algorithms for Constructing Decision Trees with Constraints,” KDD Conference, pp. 335-339, 2000; J. Gehrke et al., “BOAT: Optimistic Decision Tree Construction,” ACM SIGMOD Conference Proceedings, pp. 169-180, 1999; and J. Gehrke et al., “RainForest: A Framework for Fast Decision Tree Construction of Large Data Sets,” VLDB Conference Proceedings, 1998.
Further, research in data stream mining in the context of classification has concentrated on one-pass mining, see, for example, P. Domingos et al., “Mining High-Speed Data Streams,” ACM SIGKDD Conference, 2000; and G. Hulten et al., “Mining Time-Changing Data Streams,” ACM KDD Conference, 2001.
The nature of the underlying changes in the data stream can impose considerable challenges. Previous attempts at stream classification treat the stream as a one pass mining problem, which does not account for the underlying changes which have occurred in the stream. Often, test instances of different classes within a data stream arrive in small bursts at different times. When a static classification model is used for an evolving test data stream, the accuracy of the underlying classification process is likely to drop suddenly when there is a sudden burst of records belonging to a particular class. A classification model constructed using a smaller history of data is likely to provide better accuracy. On the other hand, if the stream has been relatively stable over time, then using a longer history for training makes greater sense.
Research on time changing data streams having a focus on providing effective methods for incremental updating of the classification model have also been proposed, see, for example, G. Hulten et al., “Mining Time-Changing Data Streams,” ACM KDD Conference, 2001. However, since such a model uses the entire history of the data stream, the accuracy of such a model cannot be greater than the best fixed sliding window model on a data stream. Therefore, a more temporally adaptive philosophy is desirable to improve the effectiveness of the underlying algorithms.