1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to a computer implemented method and system for processing data streams. Still more particularly, the present invention relates to a computer implemented method, system, and computer usable program code for classifying data streams using high-order models.
2. Description of the Related Art
Stream processing computing applications are applications in which the data comes into the system in the form of information flow, satisfying some restriction on the data. With this type of data, the volume of data being processed may be too large to be stored; therefore, the information flow calls for sophisticated real-time processing over dynamic data streams, such as sensor data analysis and network traffic monitoring. Examples of stream processing computing applications include video processing, audio processing, streaming databases, and sensor networks.
Classifying data streams is extremely important for various practical purposes. For example, data streams need to be classified in order to detect credit card fraud and network intrusions. Classifying data streams is difficult because of the large volume of data coming into a system at very high speeds. Additionally, data distribution within the data streams is constantly time-changing.
Classification plays an important role in filtering out uninteresting patterns or those that are irrelevant to the current classification scheme. Often, classifiers may compete with other processing elements for resources, such as processing power, memory, and bandwidth. Some current solutions incrementally update classifiers using models. These models are referred to as decision trees and are repeatedly revised so that the decision tree always represents the current data distribution. Decision trees are unstable data structures. As a result, a slight drift or concept shift may trigger substantial changes. Concept drift is defined as changes in underlying class distribution over time. For example, in a classification system for fraud detection, transactions may be classified into two classes: fraudulent or normal. As the spending pattern of a credit card user evolves over time, the set of transactions that are classified to be normal and fraudulent should also be changing.
In another solution, stream processing applications repeatedly learn new independent models from streaming data to grow and remove new sub-trees. Decision trees with the highest classification accuracy are selected based on new data arriving. Learning costs associated with removing and growing decision trees are very high and accuracy is low. Low accuracy may result from model overfitting due to lack of training data or conflicts of concepts due to abundance of training data.
Ensemble classifiers may also be used to partition data streams into fixed size data segments. Ensemble classifiers have high costs because the classifiers are learned for each new segment. Furthermore, every classifier is evaluated for each test example. The classifiers are homogeneous and discarded as a whole. As a result, current classification process for data streams are time consuming and unable to effectively process high-speed data streams with changing data distributions.