The problem of massive-domain stream classification is one in which each attribute can take on one of a large number of possible values. Such streams often arise in applications such as internet protocol (IP) monitoring, super-store transactions and financial data. In such cases, traditional models for stream classification cannot be used, because the size of the storage required for intermediate computation of the models can increase rapidly with domain size. Furthermore, the one-pass constraint for data stream computation makes the problem even more challenging. For such cases, there are no known methods for data stream classification.
In recent years, data streams have become ubiquitous because of the new ways of collecting and processing such data. The problem of mining data streams is especially challenging because of the one-pass constraint on all mining algorithms. A number of surveys on stream mining algorithms are described in Aggarwal C., Data Streams: Models and Algorithms, Springer (2007). A well known problem in the data mining domain is that of classification, see, Quinlan J. R., C4.5: Programs in Machine Learning, Morgan-Kaufmann, Inc. (1993). In the classification problem, a labeled training data set is used in order to supervise the classification of unlabeled data instances.
The problem of massive-domain stream classification is defined as one in which each attribute takes on an extremely large number of possible values. Examples of such domains follow. In internet applications, the number of possible source and destination addresses can be very large. For example, there may be well over 108 possible IP-addresses. It is impossible for most current techniques to compute the discriminatory statistics on such a large number of possible values. In fact, the storage space available on most modern desktop computers is not sufficient to explicitly compute the corresponding discriminatory statistics. For the particular case of data streams, the computation of even 1-dimensional discriminatory statistics becomes infeasible.
Many financial transactions, for example those involving credit cards, may include millions of different types depending upon the location and nature of the transaction. In such cases, the determination of patterns which indicate fraudulent activity may be infeasible from a space and computational efficiency perspective. Supermarket transactions are often drawn from millions of possibilities. In such cases, the determination of patterns that indicate different kinds of classification behavior may become infeasible from a space-efficiency and computational efficiency perspective. The computational and space-efficiency problems are not just related to the massive-domain size, but also the speed of the data stream. The problem of massive-domain size naturally occurs in the space of discrete attributes, whereas most of the known data stream classification methods are designed on the space of continuous attributes. The one-pass restrictions of data stream computation create a further challenge for the computational approach that may be used for discriminatory analysis. Thus, the massive-domain size creates challenges in terms of space-requirements, whereas the stream model further restricts the classes of algorithms that may be used in order to create space-efficient methods. This is illustrated by considering the following types of classification models.
Techniques such as decision trees require the computation of the discriminatory power of each possible attribute value in order to determine how the splits should be constructed. In order to compute the relative behavior of different attribute values, the discriminatory power of different attribute values (or combinations of values) needs to be maintained. Therefore, the space and computational efficiency to perform the intermediate computations for such splits may not be practical. Furthermore, the one-pass restriction on data stream computation makes such computation impossible.
Techniques such as rule-based classifiers require the determination of combinations of attributes which are relevant to classification. In order to determine these combinations, it is required to compute the intermediate statistics for the relevant rules. With increasing domain size, it is no longer possible to compute this efficiently either in terms of space or running time. Methods such as Bayes classifiers require the computation of probabilistic conditional estimates of class behavior over different combinations of attributes. With increasing domain size, the number of such combinations increases rapidly, and it is no longer possible to perform the computations effectively.
These implementation issues create challenges for classifiers even when the data is not presented in the form of a data stream. This is because the one-pass constraint dictates the choice of data structures and algorithms that can be used for the classification problem. All stream classifiers implicitly assume that the underlying domain size can be handled with modest main memory or storage limitations. One observation is that massive-domain data sets are often noisy, and most combinations of dimensions may not have any relationship with the true class label. While the number of possible relevant combinations may be small enough to be stored within reasonable space limitations, the intermediate computations required for determining such combinations may not be feasible from a space and time perspective. This is because the determination of the most discriminatory patterns require intermediate computation of statistics of patterns that are not relevant. When combined with the one-pass constraint of data streams, this is a very challenging problem.