Herebelow, numeral in square brackets—[ ]—are keyed to the numbered list of references found towards the end of the disclosure.
During the last two decades, our ability to collect and store data has significantly out-paced our ability to analyze, summarize and extract “knowledge” from the continuous stream of input. Traditional data mining methods that require all data to be held in memory are becoming inadequate. Securing an effective interface between data mining and very large database essentially requires scalability. The scalability and accuracy of data mining methods are constantly being challenged by real-time production systems that generate tremendous amount of data continuously at unprecedented rate. Examples of such data streams include security buy-sell transactions, credit card transactions, phone call records, network event logs, etc.
A very significant characteristic of streaming data is “evolving pattern”. In other words, both the underlying true model and distribution of instances evolve and change continuously over time. Streaming data is also characterized by large data volumes. Knowledge discovery on data streams has become a research topic of growing interest. A need has thus been recognized in connection with solving the following problem: given an infinite amount of continuous measurements, how do we model in order to capture time-evolving trends and patterns in the stream, and make time critical decisions?
Most recent research on scalable inductive leaning over very large streaming dataset focuses on eliminating memory-constraints and reducing the number of sequential data scans, particularly for decision tree construction. State-of-the-art decision tree algorithms (SPRINT [9], RainForest [5], and later BOAT [6] among others) still scan the data multiple times, and employ rather sophisticated mechanisms in implementation. Most recent work [8] applies the Hoeffding inequality to decision tree learning on steaming data in which a node is reconstructed if it is statistically necessary. Outside of decision trees, there hasn't been much research on reducing the number of data scans for other inductive learners. A need has thus been recognized in connection with developing a general approach for a wide range of inductive learning algorithms to scan the dataset less than once (which can be interpreted as “less than one full time” or “less than one time in entirety”), and for the approach to be broadly applicable beyond decision trees to other learners, e.g., rule and naive Bayes learners.
“Ensemble of classifiers” has been studied as a general approach for scalable learning. Previously proposed meta-learning [2] reduces the number of data scans to 2. However, empirical studies have shown that the accuracy of the multiple model is sometimes lower than respective single model. Bagging [1] and boosting [4] are not scalable since both methods scan the dataset multiple times. In this context, a need has thus been recognized in connection with being able to scan the dataset less than once and to provide higher accuracy than a single classifier.