The advent of new computing technologies such as ubiquitous computing, e-commerce, and sensor networks has lead to intensive research on manipulation and use of data from data streams. Although a great deal of information may be gleaned from a data stream, making accurate predictions from the data stream is of particular interest to many users. Traditional methods for searching data streams for particular data (also referred to as “data mining”) have involved the use of algorithms that are tailored to specific aspects of data within the data stream. Unfortunately, with the advancement of data complexity, complications have arisen when applying such traditional methods.
A pervasive challenge with mining a data stream involves the changeable nature of the data. That is, typically, a data stream changes over time exhibiting what is commonly referred to as “concept drift.” Reference may be had to the publications “Mining time-changing data streams” Hulten, G., Spencer, L., and Domingos, P., SIGKDD, ACM Press, San Francisco, Calif., 2001, pp-97-106 and “Mining concept-drifting data streams using ensemble classifiers” Wang, H. SIGKDD, ACM Press, San Francisco, Calif., 2003.
If the nature of the data distribution is static, a subset of the data can be used to create (i.e., learn) a model and use it for all future data. Unfortunately, when the data distribution is constantly changing, the model is quickly outmoded. This means static models must constantly be revised to reflect the current data features. It is not difficult to see why model update incurs a major cost in terms of analyses and time, and typically require complex and comprehensive coding. Models for classifying data streams (referred to as “stream classifiers”) that handle concept drifts can be roughly divided into two categories.
A first category of stream classifiers is known as “incrementally updated classifiers.” One exemplary type of incremental classifier, referred to as using a “CVFDT” approach, uses a single decision tree to model streams with concept drifts. Reference may again be had to the publication “Mining time-changing data streams” Hulten, G., Spencer, L., and Domingos, P., SIGKDD, ACM Press, San Francisco, Calif., 2001, pp-97-106. However, in this approach, even a slight drift of the concept may trigger substantial changes in the tree (e.g., replacing old branches with new branches, re-growing or building alternative sub-trees), which severely compromises learning efficiency. Aside from this undesirable aspect, incremental methods are also hindered by their prediction accuracy. For example, incremental classifiers discard older examples (input data used for model building) at a fixed rate (without regard for any change of the concepts). Thus, the learned model is supported only by the data in a current window—a relatively small amount of data from the data stream. The inherent flaws in incremental classifiers cause large variances in data prediction. Reference may also be had to FIG. 1 and FIG. 2 that depict aspects of challenges encountered when attempting to fit a model using incrementally updated classifiers to a data stream.
In FIG. 1 and FIG. 2, a data stream 100 that includes a series of records 101 is mined by a model using incremental classifiers. Relevant data 110 is defined by a line, depicting an optimum boundary. As can be seen, particularly with regard to FIG. 2, the definition of relevant data provided by the model can easily misclassify data.
A second category of stream classifiers is known as “ensemble classifiers.” Instead of maintaining a single model, the ensemble approach divides the data stream into data chunks having a fixed size and learns a classifier from each of the chunks. Reference may again be had to the publication “Mining concept-drifting data streams using ensemble classifiers” Wang, H. SIGKDD, ACM Press, San Francisco, Calif., 2003. In order to use ensemble classifies to make a prediction, the model used must evaluate all of the classifiers in the ensemble, which is an expensive process. The ensemble approach has high model update cost as i) new models on new data are constantly being learned, whether or not the data contains concept drifts; and, ii) the accuracy of older models are constantly evaluated by applying each of them to new training data. Further, the classifiers are homogeneous and therefore discarded as a whole. This approach introduces a considerable cost in modeling of high-speed data streams. If timely updating of models is not completed because of the high update cost, prediction accuracy drops as a result. This causes problems in data predictions, particularly for applications that handle large volume streaming data at very high speeds. Aspects of data mining using the ensemble approach are depicted in FIG. 3 and FIG. 4.
In FIG. 3 and FIG. 4, the data stream 100 is portioned into fixed size chunks of data 102. The chunks of data 102 are mined by a model using an ensemble of classifiers. Instances of relevant data 110 (identified by the shading in the depiction of the chunks of data 102) are collected. Classification error, depicted in FIG. 4, comes at a high cost to the data mining.
Current approaches for classifying stream data are adapted from algorithms designed for static data, for which monolithic models typically perform adequately. However, dynamic data streams present problems for developing meaningful predictions. For incrementally updated classifiers, the fact that even a small disturbance in the data may bring a complete change to the model indicates that monolithic models are not appropriate for data streams. The ensemble is not semantic-aware. That is, in the face of concept drifts, it is still very costly to tell which components are affected and hence must be replaced, and what new components must be brought in to represent the new concept. What is needed is a model for mining data streams that accounts for drifting concepts within the data and is a computationally low-cost model.
Typically, data mining involves a large volume of data produced at a high-speed. Multiple data streams may be involved, so classification plays an important role in filtering out irrelevant information. Data mining competes with other processing elements for resources (CPU, memory, etc, . . . ). Thus, flexible and efficient data mining systems are needed. New models for mining data streams are increasingly important as traditional classification methods work on static data, and usually require multiple scans of training data in order to build a model.