In modern society, predicting events in various fields or predicting the properties of substances based on data analysis is becoming increasingly important. In particular, with increasing awareness of worldwide environmental issues, as well as from the standpoint of animal welfare, expectations are growing for the practical implementation of techniques that predict various kinds of safety (toxicity) of chemical compounds by making full use of IT (Information Technology). However, it is known that the field of compound safety (toxicity) prediction is a field where it is very difficult to achieve high prediction rates. Furthermore, it is a difficult field that demands particularly high prediction rates, because failure of the prediction could have serious consequences (not only biologically but also ecologically (environmentally)).
A brief description will be given of the importance of the compound safety prediction and the special nature of the compound safety prediction field. The application field of safety prediction is extensive. Generally, in pharmaceutical related fields, safety prediction has been performed to predict the toxicity of pharmaceutical products and their side effects. The target of safety prediction in this field is a human being (biotoxicity), and if the prediction fails, many people would suffer serious or fatal side effects. In this respect, safety prediction in this field, unlike prediction tasks in other fields, demands significantly strict prediction accuracy.
In recent years, there has been increased interest in the importance of the safety prediction in environment related fields. This is because chemical compounds are substances (biotoxic substances) that greatly affect not only human beings but also the environment, i.e., the ecosystem, and all types of life that depend on the ecosystem. It is expected that governmental regulations on the environmental safety of chemical compounds will become more stringent than ever around the world.
For example, in the EU, a new regulation named REACH entered into force in June 2007. According to this regulation, every company using a chemical is obliged to evaluate the safety of that chemical and register the result of the evaluation; the scope of the regulation is extended to cover more than 30,000 kinds of existing chemicals (including chemicals whose production has empirically been approved without evaluating their safety). REACH is an unprecedentedly strict regulation as it is applied not only to the manufacturers of chemicals but also to the companies that use manufactured chemicals. Company activities in the EU would not be possible without meeting the requirements set by the regulation.
As described above, the trend is toward imposing strict restrictions on the evaluation tests of chemicals using animals, and sooner or later, animal testing for the development of new drugs will be banned. Already, in the EU, a ban on skin-related animal tests will commence in 2011. In the environmental toxicity evaluation as stipulated in the REACH regulation, since the number of chemicals to be evaluated is incomparably larger than that used in the development of a new drug, there arises not only the problem of animal testing but also the problem of time and cost needed to conduct experiments, and therefore, the use of IT as an ultra high-speed screening technique to substitute for experiments becomes an important issue in order to reduce the evaluation time and cost. Therefore there is a need to develop prediction techniques that can evaluate safety with the highest possible accuracy without conducting experiments. Since such techniques can simplify the work associated with REACH and other regulations, the authorities concerned recommend the development of such prediction techniques on a worldwide basis.
While compound safety prediction using IT has been attracting a great deal of attention as described above, it is not possible to achieve sufficiently high prediction accuracy with the present state-of-the-art prediction techniques because, unlike such fields as character recognition, not only are the structural formulas of compounds very complex, but also there are a great variety of compounds amounting to tens of millions of kinds and, on top of that, factors affecting toxicity are also complex. Furthermore, since low prediction rates would pose hazards due to the special nature of this field, it is important to achieve very high prediction reliability for the practical implementation of prediction techniques. Therefore, a general need always exists for the development of a method and apparatus for predicting the properties, in particular, the safety, of chemical compounds with very high accuracy.
FIG. 12 is a diagram illustrating an overview of a prior art prediction system which predicts the physical/chemical properties of compounds by using statistical techniques. In this system, first a training sample set 100 is prepared by collecting as many compounds as possible that have known values for the property about to be predicted (the prediction item). Then, a prediction model 102 is constructed by performing data analysis, such as multivariate analysis or pattern recognition, on the training sample set 100.
In the prediction execution stage, prediction results are obtained by applying the prediction model 102 constructed as described above to each of the compounds A to N (hereinafter called the unknown samples) whose properties are to be predicted. For example, in the case of a discriminant analysis for determining whether a compound has carcinogenicity, the prediction result YES means that it is determined that the compound has carcinogenicity, while NO means that it is determined that the compound does not have carcinogenicity.
Various attempts have been made to predict, for example, the toxicity of compounds by using techniques such as described above, but the reality is that prediction rates, as high as expected, have not yet been achieved. The prediction rate may normally be calculated based on the correctness of the predictions performed on unknown samples, but this would require that the actual effect be verified by animal testing, etc., which would be difficult to implement. Therefore in actual practice, one sample is taken as a tentative unknown sample from the training sample set, and the prediction is performed on the tentative unknown sample by using the prediction model generated from the remaining training sample set; then, the degree of accuracy of the prediction model, i.e., the prediction rate, is calculated based on the result of the prediction.
In a prediction system such as illustrated in FIG. 12, various strategies have been devised to improve the prediction rate. Such strategies include, for example, ingeniously designing a data analysis method for obtaining a prediction model, or classifying, based on various empirical criteria, a large number of compounds forming the training sample set, and constructing a prediction model for each classified class. In the former case, classification methods using such techniques as linear learning machine, discriminant analysis, Bayes linear discriminant analysis, Bayes nonlinear discriminant analysis, neural networks, SVM, KNN (K-Nearest Neighbor), etc., have been tried for use in the problem of classifying compounds into two classes, one having toxicity and the other having no toxicity, and recently it has been reported that high classification rates can be obtained with relative ease by the neural network or SVM methods (non-patent document 1).
However, while the classification rate improves with the neural network or SVM methods, the prediction rate drops. This is presumably because such analysis techniques tend to perform classification for the sake of classification while ignoring the chemical factors that lie behind the background of the classification. For this reason, the approach that aims to improve the prediction rate by ingeniously designing an analysis technique has, up to this date, not been successful in achieving good results.
In the prediction system illustrated in FIG. 12, one prediction model is generated from the training sample set. On the other hand, as described above, an attempt has been made to perform predictions by generating a plurality of prediction models from the training sample set and by applying one or more prediction models to each unknown sample.
FIG. 13 is a diagram providing an overview of such a prediction system. First, the large number of compounds forming the training sample set 100 are classified based on the basic structure or property of the compounds, to generate sub-sample sets 1, 2, and 3. Next, multivariate analysis or pattern recognition is performed on each sub-sample set, generating a prediction model 1 from the sub-sample set 1, a prediction model 2 from the sub-sample set 2, and a prediction model 3 from the sub-sample set 3.
In the prediction execution stage, the prediction is performed by applying the plurality of thus constructed prediction models to the unknown samples A to N. The problem here is which of the plurality of prediction models is to be applied, for example, to the unknown sample A. If the correct prediction model is not selected, there can occur cases where, when the prediction model 1 is applied to the unknown sample A, for example, the result YES is obtained but, when the prediction model 2 is applied, the result NO is obtained, and the reliability of the prediction thus degrades. Usually, all of the prediction models are applied to each unknown sample to obtain a plurality of prediction results, and after that, the final prediction result is determined by taking a majority among the plurality of prediction results.
However, even with this method, it is not possible to obtain sufficiently high prediction rates. As a possible solution, a prediction model generated from a sub-sample set containing samples having a structure similar to an unknown sample may be selected as the prediction model for that unknown sample, but since the structures of compounds are complex and diverse, there is not always a significant correlation between the sub-sample set and the unknown sample, and as a result, it is not possible to achieve a high prediction rate.
As described above, with the prediction system illustrated in FIG. 13, the classification rate of the training sample set increases because of the construction of a plurality of prediction models, but it falls short in improving the prediction rate.
Non-patent document 1: Kazutoshi Tanabe, Norihito Ohmori, Shuichiro Ono, Takahiro Suzuki, Takatoshi Matsumoto, Umpei Nagashima, and Hiroyuki Uesaka, “Prediction of Carcinogenicity of Chlorine-containing Organic Compounds by Neural Network,” Comput. Chem. Jpn., Vol. 4., No. 3, pp. 89-100 (2005)