1. Technical Field
The invention relates to the field of statistical analysis, and, more specifically, to principle component analysis of descriptor vectors.
2. Description of the Related Art
Areas of inquiry in computationally biology and chemistry typically characterize an item of interest (e.g., molecular complex such as a molecule, compound or portion of one or more molecules/compounds, or property of one or more molecules/compounds) with a vector of descriptors (i.e., descriptor vector) that represent one or more characteristics of the item. For example, the descriptor vector for a molecular complex may characterize structure, charge distribution, biological activity, and/or other property of the molecular complex.
Principal Component Analysis is often applied to select a subset of components of the descriptor vectors associated with a set of items that approximates the data within the set. The selected subset of components is typically used to perform analysis of regression and/or correlation on the set of items. Generally, such analysis of regression and correlation both concern the following questions:
1) does a statistical relation affording some predictability appear between the set of items?
2) how strong is the apparent statistical relation, in the sense of the possible predictive ability that the statistical relation affords?
3) can a rule be formulated for predicting relations among the set of items, and, if so, how good is this rule?
A more detailed description of Principal Component Analysis together with regression analysis and/or correlation analysis may be found in I. T. Jolliffe, xe2x80x9cPrincipal Component Analysisxe2x80x9d, Springer Verlag, New York, 1986, herein incorporated by reference in its entirety.
Typically, Principal Component Analysis generates a correlation matrix based upon the descriptor vectors for a given set of items, identifies the largest principal values of such correlation matrix, and selects those components of the descriptor vector that correspond to the identified principle values.
Although traditional Principle Component Analysis selects components that approximate both the items of interest and correlation matrix, it may reject information that discriminates between groups of items, thus leaving a need in the art for an improved method for capturing information that optimally discriminates between groups of items.
The problems stated above and the related problems of the prior art are solved with the principles of the present invention, method and apparatus for mapping components of descriptor vectors to a space that discriminates between groups. The present invention transforms descriptor vectors that characterize molecular complexes partitioned into groups into a space that discriminates between those groups in a well defined optimal sense. First data is generated that represents a differences between the groups of descriptor vectors. Second data is generated representing variation within the groups of descriptor vectors. A set of component vectors is then identified that maximizes an F distributed criterion function that measures differences of desciptor vectors between groups relative to varations of descriptor vectors within groups. A statistic is generated for subsets of the component vectors. For each particular subset of component vectors, a probability value for the statistic associated with the particular subset is calculated. The subset with the minimum probability value is selected. Finally, one or more of the descriptor vectors for the molecular complexes are mapped to a space corresponding to the selected subset of component vectors.