Herebelow, numerals in brackets—[ ]—are keyed to the list of references found towards the end of the instant disclosure.
The scalability and accuracy of data mining methods are constantly being challenged by real-time production systems that generate tremendous amount of data continuously at an unprecedented rate. Examples of such data streams include security buy-sell transactions, credit card transactions, phone call records, network event logs, etc. The most important characteristic of streaming data is evolving pattern. Both the underlying true model and distribution of instances evolve and change continuously over time. Streaming data is also characterized by large data volumes. Knowledge discovery on data streams has become a research topic of growing interest [2, 4, 5, 10]. A need has accordingly been recognized in connection with solving the following: given an infinite amount of continuous measurements, how do we model them in order to capture time-evolving trends and patterns in the stream, and make time critical decisions?
Most previous work on mining data streams concentrates on capturing time-evolving trends and patterns with “labeled” data. However, one important aspect that is often ignored or unrealistically assumed is the availability of “class labels” of data streams. Most algorithms make an implicit and impractical assumption that labeled data is readily available. Most works focus on how to detect the change in patterns and how to update the model to reflect such changes. However, for many applications, the class labels are not “immediately” available unless dedicated efforts and subsequent costs are spent to obtain these labels right away. If the true class labels were readily available, data mining models would not be very useful.
To name a few, let us look at credit card fraud detection. In credit card fraud detection, we usually do not know if a particular transaction is a fraud until at least one month later after the account holder receives and reviews the monthly statement. However, if necessary, the true label for a purchase is typically just a phone call away. It is not feasible to verify every transaction, but verifying a small number of suspicious transactions are practical.
As another example, in large organizations, data mining engine normally runs on a data warehouse, while the real-time data streams are stored, processed and maintained on a separate production server. In most cases, the data on the production server is summarized, de-normalized, cleaned up and transferred to the data warehouse periodically such as over night or over the weekend. The true class labels for each transaction are usually kept and maintained in several database tables. It is very hard to provide the true labels to the learner at real time due to volume and quality issues. Nevertheless, the true labels for a small number of transactions can be obtained relatively more easily by running a simple query to the database on these transactions.
Due to these considerations, most current applications obtain class labels and update existing models in preset frequency, usually synchronized with data refresh. As a summary, the life cycle of today's stream data mining tends to be: “given labeled data→train initial model→classify data stream→passively given labeled data→re-train model . . . ”. The effectiveness of the algorithm is dictated by some “application-related and static constraints”, resulting in a number of potential undesirable consequences that contradict the notions of “streaming” and “continuous”. Among these constraints are:                Possible loss due to neglected pattern drifts: If either the concept or data distribution drifts rapidly at an unforecast rate that application-related constraints do not catch up, the models is likely out-of-date on the data stream and important business decisions might be missed or mistakenly made.        Unnecessary model refresh: If there is neither conceptual nor distributional change, periodic passive model refresh and re-validation is a waste of resources.        
In view of the foregoing, a general need has been recognized in connection with improving upon the disadvantages and shortcomings presented by known arrangements.