The scalability of data mining methods is constantly being challenged by real-time production systems that generate tremendous amount of data at unprecedented rates. Examples of such data streams include network event logs, telephone call records, credit card transactional flows, sensoring and surveillance video streams, etc. Other than the huge data volume, streaming data are also characterized by their drifting concepts. In other words, the underlying data generating mechanism, or the concept that we try to learn from the data, is constantly evolving. Knowledge discovery on streaming data is a research topic of growing interest. A need has accordingly been recognized in connection with solving the following problem: given an infinite amount of continuous measurements, how do we model them in order to capture time-evolving trends and patterns in the stream, and make time-critical predictions?
Huge data volume and drifting concepts are not unfamiliar to the data mining community. One of the goals of traditional data mining algorithms is to learn models from large databases with bounded-memory. It has been achieved by several classification methods, including Sprint, BOAT, etc. Nevertheless, the fact that these algorithms require multiple scans of the training data makes them inappropriate in the streaming environment where examples are coming in at a higher rate than they can be repeatedly analyzed.
Incremental or online data mining methods are another option for mining data streams. These methods continuously revise and refine a model by incorporating new data as they arrive. However, in order to guarantee that the model trained incrementally is identical to the model trained in the batch mode, most online algorithms rely on a costly model updating procedure, which sometimes makes the learning even slower than it is in batch mode. Recently, an efficient incremental decision tree algorithm called VFDT was introduced by Domingos et al. (“Mining High-Speed Data Streams”, ACM SIG KDD, 2000).
For streams made up of discrete type of data, Hoeffding bounds guarantee that the output model of VFDT is asymptotically nearly identical to that of a batch learner.
The above mentioned algorithms, including incremental and online methods such as VFDT, all produce a single model that represents the entire data stream. It suffers in prediction accuracy in the presence of concept drifts. This is because the streaming data are not generated by a stationary stochastic process, indeed, the future examples we need to classify may have a very different distribution from the historical data.
Generally, in order to make time-critical predictions, the model learned from streaming data must be able to capture transient patterns in the stream. To do this, a need has been recognized in connection not only with revising a model by incorporating new examples, but also with eliminating the effects of examples representing outdated concepts. Accordingly, a need has been recognized in connection with addressing the challenges of maintaining an accurate and up-to-date classifier for infinite data streams with concept drifts, wherein the challenges include the following:
With regard to accuracy, it is difficult to decide what are the examples that represent outdated concepts, and hence their effects should be excluded from the model. A commonly used approach is to “forget” examples at a constant rate. However, a higher rate would lower the accuracy of the “up-to-date” model as it is supported by a less amount of training data and a lower rate would make the model less sensitive to the current trend and prevent it from discovering transient patterns.
With regard to efficiency, decision trees are constructed in a greedy divide-and-conquer manner, and they are non-stable. Even a slight drift of the underlying concepts may trigger substantial changes (e.g., replacing old branches with new branches, re-growing or building alternative subbranches) in the tree, and severely compromise learning efficiency.
With regard to ease of use, substantial implementation efforts are required to adapt classification methods such as decision trees to handle data streams with drifting concepts in an incremental manner. The usability of this approach is limited as state-of-the-art learning methods cannot be applied directly.