The present invention relates to a method and system for estimating EDR directions in a single-index model, and more particularly to a method, system, and program for estimating EDR directions in a single-index model related to a large number of variables, and a memory medium for storing the program.
In general, one of objects of statistical analysis of actual phenomena is to find relationships among various characteristics and make a prediction. In such a case, it is frequent practice to find any relationship from data using regression analysis and make a prediction on certain variables. For example, linear regression analysis or logistic regression analysis is used to analyze the relationship between a response variable y and an explanatory variable x.
However, the higher the dimension p of the explanatory variable x, the more difficult it is to perform this type of regression analysis. To solve this problem, there have been proposed several methods to reduce the number of dimensions of the explanatory variable x.
For example, referring to the following document 1 (Ker-Chau Li, “Sliced inverse regression for dimension reduction,” Journal of the American Statistical Association, Vol. 86 (414), pp. 316-342, 1991.), Ker-Chau Li proposed SIR (Sliced Inverse Regression).
SIR is a method for determining a subspace of x enough to describe the response variable y. The subspace determined is called EDR (Effective Dimension Reduction) space, and a vector spanning the EDR space is called an EDR direction vector. Using conventional regression analysis, the relationship between the response variable y and the explanatory variable x in the EDR space, the dimension of which has been reduced, can be found out.
Referring also to the following document 2 (Ichimura et. al., “Optimal Smoothing in Single Index Models,” The Annals of Statistics, Vol. 21, pp. 157-178, 1993. ), Hall and Ichimura estimated EDR directions using a smoothing method.
Referring further to the following document 3 (Xia et al., “An adaptive estimation of dimension reduction space,” Journal of the Royal Statistical Society (Series B), Vol. 64, pp. 363-410, 2002. ), Xia et. al. proposed a technique for estimating the EDR space using a non-linear smoothing method. However, if the number of explanatory variables becomes enormous, it will be very difficult to make calculations.
SIR will be described below. In the SIR method, a model indicated by the following equations (1) to (6) is assumed.y=f(β1′x, . . . βk′x,ε)  (1)
In this equation, y represents a response variable, f is an unknown function, ε is a random variable independent of x, and x is a p-dimensional explanatory variable. Further, β1, . . . ,βk, are p-dimensional unknown coefficient vectors, that is, EDR direction vectors.
Using FIGS. 1 and 2, SIR operations will be described below. First, explanatory variables in a data file inputted from an input device 1 are standardized by data standardizing means 24 of a data analyzer 2 (step A1 in FIG. 2):                                                                         z                i                            =                                                ∑                  xx                                ⁢                                                                                                 -                                              1                        2                                                                              ⁢                                      [                                                                  x                        i                                            -                                              x                        _                                                              ]                                                                                                          (                                                i                  =                  1                                ,                …                ⁢                                                                   ,                n                            )                                                          (        2        )            where       ∑    xx    ⁢      ,          x      _      is a variance-covariance matrix, average of x, respectively.
Then slice average calculating means 22 sorts response variables y and divides them into H slices II. . . IH (step A2). Then the proportion of response variables belonging to slice Ik is calculated as {circumflex over (P)}k (see the following equation (3)):                                           p            ^                    k                =                                            1              n                        ⁢                                          ∑                                  i                  =                  1                                n                            ⁢                                                                    δ                    k                                    ⁡                                      (                                          y                      i                                        )                                                  ⁢                                                                   ⁢                where                ⁢                                                                   ⁢                                                      δ                    k                                    ⁡                                      (                                          y                      i                                        )                                                  ⁢                                                                   ⁢                is                ⁢                                                                   ⁢                                                      δ                    k                                    ⁡                                      (                                          y                      i                                        )                                                                                =                      {                                                                                1                    ,                                                                                                                                      y                        i                                            ∈                                              I                        k                                                              ,                                                                                                                    0                    ,                                                                                                              y                      i                                        ∉                                                                  I                        k                                            .                                                                                                                              (        3        )            
Next, using the following equation (4), the mean vector of standardized explanatory variables is calculated for each slice (step A3).                               m          k                =                              [                          1                              n                ⁢                                                      p                    ^                                    k                                                      ]                    ⁢                                    ∑                                                y                  i                                ∈                                  I                  k                                                      ⁢                          z              i                                                          (        4        )            
Then, principle component analyzing means 25 carries out a principle component analysis of the mean vectors m on a slice basis to determine eigen vectors (step A4).
In this case, the characteristic numbers and eigen vectors are determined using the following equation (5):                     V        =                              ∑                          k              =              1                        H                    ⁢                                                    p                ^                            k                        ⁢                          m              k                        ⁢                          m              k              ′                                                          (        5        )            
The data standardizing means 24 extracts K eigen vectors ηk (k =1, . . , K) with characteristic numbers in descending numeric order, and uses the following equation (6) to transform them into the original coordinate system (step A5):                               β          k                =                              ∑            xx                    ⁢                                                                   -                                  1                  2                                                      ⁢                          η              k                                                          (        6        )            
The EDR direction vectors determined at step A5 are outputted on an output device 3 (step A6).
The first problem of the above-mentioned prior art is that SIR is not applicable to data having a large number of variables such as a DNA chip for gene expression analysis or a micro array. In order to standardize data, SIR requires the inverse matrix of the variance-covariance matrix of explanatory variables, and a principle component analysis for estimating EDR direction vectors to determine eigen vectors. However, if the variables are enormous in number, it may be mathematically impossible to determine the inverse matrix of the variance-covariance matrix, or the principle component analysis may take enormous computation time.
The second problem is that SIR limits the distribution of explanatory variables to elliptic distributions. Therefore, SIR cannot be applied when explanatory variables are binary.