Numerals appearing in square brackets herebelow—[ ]—are keyed to the list of references found at the end of the disclosure.
Data streams arise with the introduction of new application areas, including ubiquitous computing and electronic commerce. Mining data streams for knowledge discovery is important to many applications, such as fraud detection, intrusion detection, trend learning, etc. One problem that has long been considered is that of mining closed frequent itemsets on data streams.
Mining frequent itemsets on static datasets has been studied extensively. However, data streams have posed new challenges. First, data streams tend to be continuous, high-speed, and unbounded. Archiving everything from streams is virtually impossible, not to mention mining association rules from them using algorithms that require multiple scans. Second, the data is not stationary, that is, the data distribution in streams are usually changing with time, and very often people are interested in the most recent patterns.
It is thus of great interest to mine itemsets that are currently frequent. One approach is to always focus on frequent itemsets in the most recent window. A similar effect can be achieved by exponentially discounting old itemsets. For the window-based approach, one can immediately come up with two “naïve” methods:                1. Regenerate frequent itemsets from the entire window whenever a new transaction comes into or an old transaction leaves the window.        2. Store every itemset, frequent or not, in a traditional data structure such as the prefix tree, and update its support whenever a new transaction comes into or an old transaction leaves the window.        
Clearly, method 1 above is not efficient. In fact, as long as the window size is reasonable, and the conceptual drifts in the stream is not too dramatic, most itemsets do not change their status (from frequent to non-frequent or from non-frequent to frequent) often. Thus, instead of regenerating all frequent itemsets every time from the entire window, it may well be reasonable to adopt an incremental approach.
Method 2, as such, is incremental. However, its space requirement makes it infeasible in practice. The prefix tree is often used for mining association rules on static data sets. In a prefix tree, each node nI represents an itemset I and each child node of nIrepresents an itemset obtained by adding a new item to I. The total number of possible nodes is exponential. Due to memory constraints, it is difficult to keep a prefix tree in memory, and disk-based structures will make real time update costly.
In view of these challenges, one may wish to focus on a dynamically selected set of itemsets that are i) informative enough to answer at any time queries such as “what are the (closed) frequent itemsets in the current window”, and at the same time, ii) small enough so that they can be easily maintained in memory and updated in real time.
A key problem is, of course, what itemsets shall be selected for this purpose? To reduce memory usage, one may be tempted to select, for example, nothing but frequent (or even closed frequent) itemsets. However, if the frequency counts of a non-frequent itemset is not monitored, one will never know when it becomes frequent. A naive approach is to monitor all itemsets whose support is above a reduced threshold minsup−ε, so that one will not miss itemsets whose current support is within ε of minsup when they become frequent. This approach is apparently not general enough.
In view of the foregoing, a need has been recognized in connection with improving upon the inadequacies and shortcomings of prior efforts.