A data analysis technique applied to the generation of a model for predicting a physical, chemical, or physiological property (objective variable) of a sample when the objective variable is a numerically continuous quantity is generally called a fitting technique. A regression analysis technique is one typical analysis technique used for this purpose. In this technique, regression analysis is performed on a sample whose objective variable is known, by applying one or more of explanatory variables suitably selected, and a regression equation that defines the relationship between the objective variable and the explanatory variables is calculated; then, for a sample whose objective variable is unknown, the value of the objective variable is predicted using the regression equation. When the analysis involves the use of more than one explanatory variable, the analysis is called multiple regression analysis. The fitting techniques include such techniques as multiple linear regression, multiple nonlinear regression, PLS (Partial Least Squares), and neural networks, and any of these techniques can be used in the present invention.
The prediction reliability for an unknown sample depends on the goodness of fit of the multiple regression equation calculated using the multiple linear regression technique. The goodness of fit of the multiple regression equation is measured by the value of a correlation coefficient R or a coefficient of determination R2. The closer the value is to 1, the better the regression equation, and the closer the value is to 0, the worse the regression equation.
FIG. 1 depicts the results of the multiple linear regression analysis performed on a certain sample set. The figure depicts the correlation between the measured values and the calculated values (the values calculated using a prediction model) of the objective variable of the samples. The abscissa represents the measured value of the objective variable of each sample, and the ordinate represents the value of the objective variable Y of each sample calculated by a multiple regression equation (prediction model) obtained as a result of the multiple regression analysis. The multiple regression equation in this case is given by the following equation (1).Y=±a1·x1±a2·x2± . . . ±an·xn±C  (1)
In equation (1), Y indicates the calculated value of the objective variable of each sample, and x1, x2, . . . , xn indicate the values of the explanatory variables; further, a1, a2, . . . , an are coefficients, and C is a constant. By substituting the values of the explanatory variables into the above equation (1) for each sample, the value of the objective variable Y of the sample is calculated. When the value of the objective variable Y calculated by equation (1) coincides with the measured value of the sample, the sample indicated by an open circle lies on the regression line Y drawn in FIG. 1. Accordingly, the closer the samples cluster to the regression line Y, the regression equation is judged to be better (the reliability is higher). The reliability of the multiple regression equation is determined by the correlation coefficient R. When the correlation coefficient R is 1, the samples lie on the regression line. FIG. 1 depicts the case where the correlation coefficient R is 0.7.
Generally, when the number of samples is small, the samples can be made to lie on the regression line relatively easily. However, as the number of samples increases, the number of samples classified as noise relatively increases, making it extremely difficult to distribute all the samples so as to lie on the single regression line. In view of this, when the number of samples is large, an analysis technique is employed that divides the whole sample set into smaller subsets and obtains a regression equation on a subset-by-subset basis. When performing regression analysis on a subset-by-subset basis, it is of utmost importance how the whole sample set is divided into a plurality of subsets, and this greatly affects the reliability of the resulting regression equation as well as the predictability. Further, when predicting the objective variable of an unknown sample, making a selection as to which regression equation generated for which subset is used for the prediction of the sample is also an important issue, and if the selection is wrong, a totally unreliable prediction result, i.e., a value significantly departing from the actual value, may be generated.
Generally, increasing the reliability of the regression equation is of utmost concern in data analysis. In one technique to achieve this, samples located some distance away from the regression line, i.e., samples whose predicted values greatly differ from the measured values, are removed from the sample set in practice as a measure important to the generation of a good multiple regression equation. Samples located far away from the regression line are called outlier samples, and the value of the correlation coefficient R can be distinctly improved by removing such samples. A multiple linear regression program generally used to generate a multiple regression equation (prediction model) is designed to automatically generate a multiple regression equation that minimizes the occurrence of such outlier samples.
Accordingly, if the sample set contains even a single sample whose value of the objective variable departs far more widely from the regression line than the other samples, such an outlier sample will exert a significant influence in the generation of a multiple regression equation, and a multiple regression equation greatly affected by it will be generated. In data analysis, therefore, it is common practice to locate and remove such outlier samples from the sample set and to generate a multiple regression equation by using the remaining samples. In this case, the removed outlier samples are classified as noise in the data analysis and will never be used again in the data analysis process. That is, in the data analysis, information relating to the samples removed as outlier samples is discarded. As a result, if the multiple regression equation thus generated has a high correlation coefficient, the prediction reliability in the case of predicting samples similar or related to the outlier samples decreases, reducing the application range of the multiple regression equation and greatly affecting its versatility. Accordingly, in multiple regression analysis, it is desired to generate a multiple regression equation yielding a high correlation coefficient, while minimizing the occurrence of such outlier samples.
FIG. 2 is a diagram depicting the correlation between the measured values (abscissas) and calculated values (ordinates) of samples, for illustrating the method for improving the correlation coefficient R by removing outlier samples from the results of multiple regression analysis. In FIG. 2, the outlier samples are indicated at 1; when the multiple regression equation is generated by removing such outlier samples and using only the remaining samples clustering along the regression line 2, the correlation coefficient R improves. However, when the multiple regression equation is improved in this manner, since the information relating to the samples removed as noise is not reflected in the generation of a new multiple regression equation, as described above, the information that the outlier samples have is disregarded.
Such a multiple regression equation improvement is effective when the number of samples is relatively small as depicted in FIG. 2, but when the number of samples is large as in the case of FIG. 1, the number of outlier samples relatively increases; therefore, if an analysis is performed by simply taking a sample set, a multiple regression equation will be generated that is far part from reality and that is close to a local solution that lacks universality. As a result, analysis, prediction, etc. of the samples may not be performed with high reliability.
Further, when the purpose of the multiple regression analysis is simply a factor analysis, even the analysis technique that eliminates the outlier samples, such as depicted in FIG. 2, may be effective, but when the main purpose is to make a prediction about a sample whose objective variable is unknown, and when its prediction reliability is important, the above analysis technique is not suitable because its application range is limited due to loss of information.
For example, in the case of a chemical toxicity prediction problem or the like, the number of samples used for the generation of a multiple regression equation often becomes very large, and therefore, it becomes very difficult to obtain a high correlation coefficient. Further, in many cases, the variety of samples is bound to become large, and the proportion of samples eliminated as outlier samples tends to increase; this also makes it difficult to obtain a high correlation coefficient. As a result, even when performing multiple regression analysis on a relatively small number of samples, the prediction becomes extremely difficult. In this way, with the multiple regression technique that eliminates outlier samples and does not reuse them, the prediction reliability of the resulting multiple regression equation greatly drops. There is therefore a need for a novel multiple regression analysis technique that is neither the technique that divides a sample set into a plurality of subsets nor the technique that eliminates the outlier samples.
Many instances of chemical toxicity and pharmacological activity predictions using multiple linear or nonlinear regression analyses have been reported up to date (for example, refer to non-patent documents 1 and 2).    Non-patent document 1: Tomohisa Nagamatsu et al., “Antitumor activity molecular design of flavin and 5-deazaflavin analogs and auto dock study of PTK inhibitors,” Proceedings of the 25th Medicinal Chemistry Symposium, 1P-20, pp. 82-83, Nagoya (2006)    Non-patent document 2: Akiko Baba et al., “Structure-activity relationships for the electrophilic reactivities of 1-β-O-Acyl glucuronides,” Proceedings of the 34th Structure-Activity Relationships Symposium, KP20, pp. 123-126, Niigata (2006)