Various embodiments of this disclosure relate to decision trees and, more particularly, to scalable streaming decision tree learning.
Many applications require the processing of Big data, which may be at rest or in motion. Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. When the data is in motion, processing it may need to take place in real time. One mechanism for processing data is through the use of decision trees.
When a decision tree is used, data can be classified by stepping through the nodes of the tree based on known attributes of the data. At each node, a child node is selected based on the value of an attribute of the data, and this selection process may continue until a leaf of the tree is selected. The leaf may be associated with a value to be assigned as a classification value for the data.
Decision tree learning is a form of classification learning, used to determine how to classify data. In decision tree learning, a system generates a decision tree that will be used to classify data based on observed attributes. Through the generation of the decision tree, which is an iterative generation, each interior node is split into subsets, with a child node at the root of each subset, based on the value of an attribute associated with that interior node. Each edge leading from the interior node to a child node corresponds to a particular value of the attribute. If no more attributes are available, a node then becomes a leaf node in the final decision tree, corresponding to a classification or prediction of a final value for the data.
In some systems, parallelism is used to speed up decision tree learning where a large amount of training data is being used as input into generating the decision tree. Specifically, either horizontal of vertical parallelism is used. The data generally includes multiple records, with each record containing multiple attributes, or columns. With vertical parallelism, the set of attributes are divided among available processing elements. In other words, each processing element receives data from multiple records, but the data received by each processing element includes only a subset of the existing attributes. With horizontal parallelism, the set of records are divided among available processing elements. In this case, each processing element receives data from one or more records, including every attribute for the subset of records assigned to that processing element.
Each processing element operates on the data assigned to it. The results of these operations are then aggregated together to complete generation of the decision tree.