This specification relates to data processing techniques such as data mining.
Data mining is used, for example, to identify attributes of a dataset that are indicative of a particular result and to predict future results based on the identified attributes. As the number of records in a dataset increase, combinations of attributes and attribute values may be used to predict future results. Therefore, the combinations of attributes and attribute values that are indicative of future results can become more complex, such that machine learning techniques may be used to identify combinations of attributes and attribute values that facilitate computation of predicted results.
Machine learning algorithms generate models based on the combinations of attribute values that are identified to be indicative of a specified result. For example, support vector machines generate a hyperplane that maximizes the distance between data records that are identified as belonging to a specified category and data records that are identified as not belonging to the category. The resulting hyperplane can be used to generate a model including a vector of weights, where each weight corresponds to attributes of the data records. In turn, dot products of the vector of weights and attribute values of input data records are used to classify the input data records as belonging or not belonging to the category for which the model was generated.
Decision trees are another machine learning algorithm that can be used to classify data records. Decision trees group a dataset into subsets of records based on the attributes and corresponding attribute values of the dataset. The full dataset and each of the subsets of records are represented by nodes. The nodes representing the dataset and each of the subsets of records can be connected by links that are referred to as branches. The nodes are connected in a hierarchical manner such that a predicted result (e.g., predicted classification) for input data is computed based on sets of rules (i.e., split-points) that define the sets of records that are represented by the nodes. In turn, a node that represents records that have attribute values similar to the input data can be identified from the decision tree and the result corresponding to the identified node can be defined as the predicted result for the input data. Thus, input data records can be classified by identifying nodes that represent the input data records and classifying the data records based on the predicted result corresponding to the node.
Multiple models can be trained to classify a data record based on different input attribute values. Similarly, multiple models can be trained to classify the data record to different categories based on the same and/or different sets of input attribute values. Therefore, multiple classifications may exist for a single data record when multiple models are used to classify the data record.