A typical example of a classification problem is the problem of character recognition. The current level of data analysis in character recognition is extremely high, and generally, recognition rates of 99% or higher can be achieved. On the other hand, in a two-class classification problem for chemical toxicity (hereinafter, safety) evaluation, which has come to attract attention in recent years because of environmental problems, etc., the prediction rate and classification rate that can be achieved are much lower than those achievable in the character recognition problem. Usually, the best achievable classification rate is said to be around 80% to 90%, and the limit of the prediction rate is said to lie in the range of 70% to 80%. The major reasons are that the factors contributing to the manifestation of toxicity of chemicals are complex and diverse, and the structural variety among chemicals is enormous. However, the safety evaluation of chemicals is an extremely important problem; if the classification or the prediction is wrong, in particular, if a toxic chemical is wrongly classified or predicted as a safe chemical, it will have a serious impact on society. From this point of view, it is strongly desired to improve the accuracy of classification/prediction in the safety evaluation of chemicals.
For these reasons, it is now recognized that increasing the classification rate is an issue of utmost importance in chemical classification problems, and various efforts have been expended for this purpose. The present inventor has proposed a method that can achieve classification rates as close as possible to 100%, that is, “K-step Yard sampling method” (hereinafter referred to as the KY method) (refer to patent document 1 and non-patent document 1).
In this method, first a training sample set is classified by discriminant analysis into two groups, i.e., a group of samples each of which is clearly known to belong to class 1 or class 2, and a group of samples (gray class samples) for which the class each belongs is not clearly known. In the group of samples whose classes are clearly known, each sample is assigned to the class determined based on the result of the discriminant analysis; on the other hand, in the case of the gray class sample group, a new training sample set is constructed using the gray class samples, and discriminant analysis is performed once again. By repeating this process until the number of gray class samples decreases to zero, the classification rate is finally brought to nearly 100%.
The plurality of discriminant functions obtained by performing the KY method are used as a prediction model for predicting the classes of samples whose properties to be classified are unknown. Since this prediction model achieves a classification rate of nearly 100%, a high prediction rate can be expected.
In recent years, a regulation referred to as REACH has entered into force in the EU, and it is expected that a large amount of data on chemical toxicities will be accumulated as its implementation proceeds. Usually, a prediction model is generated by gathering samples whose property values to be predicted are known, i.e., whose dependent variable values are known, and by performing data analysis on the training sample set constructed using these known samples. The larger the number of samples contained in the training sample set, the higher the reliability of the generated prediction model. Therefore, when new data usable as training samples are accumulated after generating the prediction model, it is desirable to generate a new prediction model using a new training sample set constructed by adding the new data.
However, for that purpose, the prediction model has to be updated periodically, which takes a lot of labor and cost. When constructing a prediction model by the above-described KY method, the training sample set needs to be classified by discriminant analysis in many stages and, to generate the prediction model from one training sample set, it takes a great deal of labor and expense compared with the conventional method. Accordingly, if predictions about unknown samples can be made without having to use a prediction model generated based on a fixed training sample set, the classes of the unknown samples can be determined very efficiently. Further, in that case, a higher prediction rate can be expected because the classes of the unknown samples can be predicted based on a training sample set constantly kept up to date by adding new data.    Patent document 1: WO2008/059624    Non-patent document 1: “Development of K-step Yard Sampling Method and its Application to ADME-T Predictions,” 34th Structure-Activity Relationships Symposium, November 2006