This disclosure relates generally to data mining using support vector machines.
Support vector machines are useful in providing input to identify trends in existing data and to classify new sets of data for analysis. Generally support vector machines can be visualized by plotting data into an n-dimensional space, n being the number of attributes associated with the item to be classified. However, given large numbers of attributes and a large volume of training data, support vector machines can be processor intensive.
Recently analysts have developed an algorithm known as “Random Forests.” “Random Forests” uses decision trees to classify data. Decision trees modeled on large amounts of data can be difficult to parse and hence classification accuracy is limited. Thus, “Random Forests” utilizes a bootstrap aggregating (bagging) algorithm to randomly generate multiple bootstrapping datasets from a training dataset. Then a decision tree is modeled on each bootstrapping dataset. For each decision tree modeling, at each node a small fraction of attributes are randomly selected to determine the split. Because all attributes need to be available for random selection, the whole bootstrapping dataset is needed in the memory. Moreover, “Random Forests” has difficulty working with sparse data (e.g., data which contains many zeroes). For example, a dataset, formatted as a matrix with rows as samples and columns as attributes, has to be entirely loaded into the memory even when a cell is zero. Thus, “Random Forests” is space-consuming, and when modeling the entire data matrix, “Random Forests” is also time-consuming, given a large and sparse dataset. The dataset cannot be parallelized on a distributed system such as a computer cluster, because it is time-consuming to transfer a whole bootstrapping dataset between different computer nodes.