Many activities in various fields, from finance to medicine, require evaluating a risk and taking a decision: it may be the decision to make an investment, to begin to treat a patient for a given pathology, to decide whether a document is relevant, to reject a product as defective, or simply to choose between two or more possibilities.
A family of computer tools, named “decision support systems” has the objective to help to take decisions by analysing massive data. For example, knowing a large number a financial data, such a system try to establish whether an investment should be made.
Methods of supervised classification, located at the interface between mathematics and computer science, propose to class the cases to be tested by associating them a binary label, such as “good payer” versus to “insolvent payer” “relevant documents” and “irrelevant document,” patient “healthy” to “ill” or product “conform” to “not conform”. However, such methods only attempt to predict the most likely label for a case. The risk of doing a mistake is only somewhat reduced.
Methods of scoring/ranking propose another approach: the aim of such methods is to learn how to sort/order the cases, so that a large number of “good payers” or “relevant documents” are at the top of the list with maximum likelihood.
The most natural way of ordering multivariate observations is to use a scoring function, assigning to each observed case (new or not) a numerical value—a “score” —interpretable as a degree of risk or relevance. The quality of this scoring function is usually assessed using a functional criterion, namely a Receiver Operating Characteristic (ROC) curve, which allows a very visual display of true positive rate in relation to the rate of false positives. The Recall-Precision curve widely used in information retrieval is a similar approach to visualize the results of discrimination, when the rate of positive label data is very high or very low.
While many powerful algorithms (CART, SVM, boosting, random forests) of supervised classification of high dimensional data are already known, no algorithm of the “machine-learning” type was far dedicated to the problem of scoring. In most applications, the approach is to use conventional statistical methods, such as logistic regression (linear) or linear discriminant analysis, based primarily on manual data preprocessing, such as modelling distribution of the observations, a truncation of information, etc. . . . These approaches also face the problem caused by the exponential increase in data volume associated with adding extra dimensions (“curse of dimensionality”). Besides, the binary classification methods sometimes incorrectly use the “distance to the margin” as score.
There is a consequently a need for an improved method of scoring based on machine-learning.