In data mining, supervised learning is a collection of techniques that are used to build a model from a given set of records, known as a training set, whose class values are known a priori. Once the model is built, it is tested against another set of records with known class values, known as a test set, in order to quantify the quality of the model. It is then used to predict (or score) unknown class values of real-world records. This last stage where the model is used for prediction is termed apply. The traditional applications of such supervised learning techniques include retail target marketing, medical diagnosis, weather prediction, credit approval, customer segmentation, and fraud detection. Based on the application, it is required that the result of the apply operation contain various class values and their probabilities, as well as some attributes that can be used to characterize the input records.
In conventional data mining systems, a user can describe the result of apply operation in the form of a database table. The user is allowed to specify the columns of the table. The output columns include the predicted class value and its probability, and source attributes of the input data. Typically, such conventional data mining systems allow the user to select only the class value with the highest probability. In real-world applications, however, the data miner may want to get several class values and their associated probabilities. For example, a need may arise for a recommendation engine to choose 10 items with the highest probabilities in order to provide the customers with appropriate recommendations. Thus, a need arises for a data mining system that provides a multi-category apply operation that produces output with multiple class values and their associated probabilities.
This technique can also be used for unsupervised models such as clustering models. Clustering analysis identifies clusters embedded in the data where a cluster is a collection of records in the data that are similar to one another. Once clusters are identified from a given set of records, one can get predictions for new records on which cluster each record is likely to belong. Such predictions may be associated with probability, the quality of fit, which describes how well a given record fits in the predicted cluster, and the distance from the center of the predicted cluster.