1. Technical Field
The present invention relates to data stream processing and more particularly to a system and method for classifying data streams with scarce and/or skewed training data.
2. Description of the Related Art
The recent growth of e-commerce, sensor networks, and ubiquitous computing has led to the massive amount of data available in stream format. Mining data streams for actionable insights in real-time has become an important and challenging task for a wide range of applications. Compared to traditional data mining, mining data streams poses new challenges as data are streaming through instead of being statically available. As the underlying data generating mechanism is evolving over time, so are the data patterns that data mining systems intend to capture. This is known as concept drifting in the stream mining literature.
To cope with concept drifts, stream mining systems update their models continuously to track the changes. Moreover, to make time-critical decisions for streaming data of huge volume and high speed, the stream mining systems need to be efficient enough in updating the models.
There are some naive approaches for handling streams with concept drifts. One is to incrementally maintain a classifier that tracks patterns in the recent training data, which is usually the data in the most recent sliding window. Another is to use the most recent data to evaluate classifiers learned from historical data and create an ensemble of “good” classifiers. Both of these two approaches are subject to the same problem, namely, model overfitting, which has been known to affect the accuracy of a classifier.
Overfitting refers to the problem that models are too specific, or too sensitive to the particulars of the training dataset used to build the model. The following known issues can lead to model overfitting and have become more prevalent in the data streaming environment. These may include: 1) Insufficient training data. In a streaming environment, it is essential to avoid having conflicting concepts in a training dataset. For this purpose, stream classifiers, such as the two approaches discussed above, enforce a constraint by learning models from data in a small window, as small windows are less likely to have conflicting concepts. However, a small window usually contains only a small number of training instances. Thus, the constraint makes the well-known cause of overfitting more prevalent. 2) Biased training data. Stream data has the nature of being bursty. A large number of instances may arrive within a very short time, which seems to give us sufficient training data free of conflicting concepts. However, in many real-time applications, stream data that arrive within a short time interval tend to be concentrated in parts of the feature space.
For example, a large amount of packets arrive in a bursty manner may all have the same source IP (Internet Protocol) address. Models learned from or validated by such data will not generalize well for other data.
In mining static datasets, the problem of overfitting usually can be addressed by two approaches. First, enlarge the training dataset to reduce the risk of overfitting caused by insufficient training data. Second, use an evaluation data set to detect overfitting caused by biased training data—if a classifier's prediction accuracy relies on particular characteristics in the training data (e.g. the source IP address of the incoming packets), then the classifier's performance will be poor on an evaluation dataset as long as the evaluation dataset does not share these idiosyncrasies.
Unfortunately, in the streaming environment, these methods are not applicable. When there are concept drifts, the enlarged part of the training dataset or the evaluation dataset may come from a different class distribution, which undermines the purpose of reducing overfitting.