A classification problem learns a rule for classifying samples into a predetermined plurality of classes from a set of samples each known to belong to one of the classes, and predicts the class to which a sample of unknown class belongs by using the learned rule as a prediction model. Among others, a two-class classification which classifies a sample set into two classes is the most basic classification, and has long been used in structure-activity relationship research and structure-property relationship research; in recent years, the two-class classification has been attracting attention as a useful technique for testing chemicals for toxicity, etc. Methods for learning rules, i.e., classification methods, include linear discriminant analysis methods, such as linear learning machine, discriminant analysis, Bayes linear discriminant analysis, SVM (Support Vector Machine), AdaBoost, etc., and nonlinear discriminant analysis methods, such as Bayes nonlinear discriminant analysis, SVM (Support Vector Machine+Kernel), neural networks, KNN (K-Nearest Neighbor), decision tree, etc.
Generally, in a classification problem, misclassification is unavoidable, and it is difficult to achieve a classification rate of 100%. The term “classification rate” is a measure that indicates how correctly samples for which the classes they belong to are known have been classified, while “prediction rate” is a measure that indicates how correctly samples for which the classes they belong to are not known have been classified. Basically, the “prediction rate” does not exceed the “classification rate.” Accordingly, if the “classification rate” is raised, the upper limit of the “prediction rate” automatically increases. This means that if the classification rate can be raised, the prediction rate improves. Further, from the general characteristics of data analysis, it is well known that as the number of samples used to generate a prediction model increases, the number of misclassified samples also increases and as a result, the classification rate drops. A misclassification is an instance in which a sample that may belong to class 1, for example, is wrongly classified as a sample belonging to class 2. The major reason for this is that as the total number of samples used increases, the absolute number of samples that cause noise in the classification also increases. Unlike statistical techniques, powerful data analysis techniques, such as multivariate analysis or pattern recognition are susceptible to noise, and in most cases, increasing the number of samples will end up making the data analysis difficult.
As a field that requires high classification/prediction rates, the field of chemical toxicity evaluation is gaining importance from the standpoint of environmental protection. In this field, chemicals are often classified into two classes, a toxic chemical group (class 1) and a nontoxic chemical group (class 2), but since the factors contributing to the manifestation of toxicity are complex and diverse, as is always the case in this field, misclassification can easily occur and if the current state of the art of data analysis is applied, it is difficult to raise the classification rate.
It is also to be noted that no matter how high the classification rate is obtained, if the number of samples used is large, the number of misclassified samples becomes large. For example, when classifying toxic chemicals and nontoxic chemicals, if the number of samples used for training is large, for example, if the classification is to be performed using 10000 chemicals, a classification rate of 90% would mean that 1000 chemicals would be misclassified, the number being large enough to become a problem. Further, in the field of toxicity classification, if chemicals having no toxicity were misclassified as chemicals having toxicity (false positive), it would not present a serious problem, but because of the nature of the subject, it would be very dangerous if chemicals having toxicity were misclassified as chemicals having no toxicity (false negative), and such a misclassification should be avoided by all means. From this point also, it is desirable that the classification rate be increased to 100%.
While increasing the prediction rate is the final target of a classification problem, it is now recognized that increasing the classification rate is of utmost concern, and various efforts have been expended for this purpose. As earlier noted, considering the fact that the prediction rate does not exceed the classification rate, if the classification rate is raised, the upper limit of the prediction rate increases. Noting this point, the present inventor has proposed a classification method that can achieve a classification rate as close as possible to 100%, i.e., “K-step Yard sampling method” (hereinafter referred to as the KY method) (Non-patent document 1, PCT/JP-2007/056412).
To briefly describe this method, first a training sample set is constructed using samples known to belong to a first class and samples known to belong to a second class. Then, by performing discriminant analysis on the training sample set, a first discriminant function (hereinafter called the AP model) that achieves a high classification rate, for example, a classification rate of substantially 100%, for the first class and a second discriminant function (hereinafter called the AN model) that achieves a high classification rate, for example, a classification rate of substantially 100%, for the second class are generated. Next, objective variables of each sample are calculated using the two discriminant functions, the AP model and the AN model, and samples for each of which the values of the objective variables, i.e., the classification results, match between the two discriminant functions and samples for each of which the results do not match are identified.
Since the AP and AN models provide a classification rate of nearly 100% for the first and second classes, respectively, any sample whose classification results match between the AP and AN models is identified as a correctly classified sample. Accordingly, any sample whose classification results match is assigned to class 1 or class 2, whichever is identified. On the other hand, any sample whose classification results do not match between the AP and AN models is assigned to a gray class, i.e., a third class where no class determination is made.
When the gray class in the first stage is thus formed, the samples assigned to the gray class are grouped together to form a new sample set. Then, the AP model and the AN model are newly generated for this sample set, and the samples are classified in the same manner as described above. As a result, the gray class in the second stage is formed; thereafter, the gray class in the third stage, the gray class in the fourth stage, etc., are formed in a similar manner. By repeating the gray class formation until the number of samples assigned to the gray class finally decreases to zero, all the samples can be correctly classified into the first and second classes, respectively. That is, a classification rate of 100% is achieved.    Non-patent document 1: “Development of K-step Yard Sampling Method and its Application to ADME-T Predictions,” 34th Structure-Activity Correlation Symposium, November 2006    Non-patent document 2: “Chemical Data Analysis Techniques by Tailor-Made Modeling,” 30th Symposium on Chemical Information, November 2007