Decision trees and random forests are classification tools used to categorize data. These trees may classify dataset records and/or predict consequences or events. For example, a decision tree may consider attributes such as temperature, humidity, outlook and windy to determine whether a specific combination of the four aforementioned attribute values is favorable to a round of golf.
Traditionally, the amount of memory and processing power required to build or train a decision tree is proportionate to the size of the data used for training and/or the resultant tree. As data size increases, so does the amount of required resources. When dealing with big data, such as a large database, this proportionate growth may present significant concerns since training large trees may become cost prohibitive. Further, training algorithms may not scale efficiently on a distributed architecture, such as a massive parallel processing, shared nothing database.
Additionally, classifying data may be similarly cost prohibitive. Classifying the dataset may require both the dataset and the decision tree to be loaded into memory, and memory requirements may therefore be proportionate to the size of the training dataset and/or the trained tree. Like training algorithms, classification algorithms may also not scale efficiently on a distributed architecture.
There is a need, therefore, for an improved method, system, and process for building decision trees and classifying datasets on a distributed architecture.