The present disclosure relates generally to data stream classification, and, in particular, to a system and method for resource adaptive classification of data streams.
In recent years, advances in hardware technology have allowed for the automatic and continuous collection of large amounts of data. These continuously growing data sets are referred to as data streams. Data mining is the process of extracting valid, previously unknown, and ultimately comprehensible information from large databases and using it to form a prediction or classification. A data-mining problem is that of classification. The “classification problem” is one in which a large data set (i.e., a training set), consisting of many examples, must be classified. The objective of classification is to develop a classifier based on the examples in the training set. The classification problem has also been widely studied in the context of data streams.
The classification problem faces a number of unique problems in the case of data streams that can be classified in high dimensions because of the exponential number of attribute combinations that can be related to the class variable. In such cases, the large number of potential combinations of attributes creates a natural tradeoff between model incompleteness and computational requirements. For example, each path in a decision tree represents a local subspace for classification purposes. While classifying a test instance, an incorrect decision at a higher level of the tree could lead to a path that defines a poor choice of subspace. The number of possible decision trees varies exponentially with data dimensionality, and each tree may be better suited to a different locality of the data. Many specific characteristics of the test instance cannot be captured during the pre-processing phase on the training data. Therefore, the model is incomplete. When considering computational requirements, a natural solution to this problem is to build multiple decision trees, and construct forests for classification purposes. Often, more robust classifiers are obtained by using majority voting over many groups of decision trees. However, with increasing dimensionality the (time and space) scalability required in the number of trees becomes unmanageable. Furthermore, if the data stream evolves, such a system may significantly degrade for classification purposes.
Similar problems are encountered with the use of rule-based classifiers, typically in the form: (p1. . . pn)q, lazy learning methods, and instance specific learning with nearest neighbor classifiers all of which do not scale well and are usually not designed to optimize the discovery of any subspace of the data in high dimensional cases.