The micro-array was developed in 1995 to measure gene expression data in a massively parallel fashion. Since that time a significant increase in the amount of data per experiment has occurred (See http://www-binf.bio.uu.nl/˜dutilh/gene-networks/thesis.html). In the case of gene exploration, this extensive data is important for use in assessing genes and their influence on expressed states in organisms. In particular, it is necessary to assess the function and operation of a particular gene; a gene being defined as a prescribed area on a nucleic acid molecule that is necessary for the production of a final expressed phenotype of an individual. On a more complex and broader scale, the interaction network is also of interest due to its influence in regulating higher cellular, physiological and behavioral properties. Recent attempts are being made to reconstruct the precise interaction network or its fragments based on large-scale array experiments for a condition-specific database, e.g., melanoma (Bittner et al., 2000). The critical first step in these efforts is to find the smallest subset of (predictors), within a desirable degree of precision, related to an arbitrary target. Based on such set of predictors, computed for every target of interest, it is possible to find the smallest set that can explain or predict behavior of any target in terms of expression. In the case of genes, finding the smallest set to predict a prescribed behavior could be a very complicated and arduous task given the massive amount of data that results from analyzing a complete organism's genome.
Most important to scientists is the ability to select a minimal cardinality set that can represent a whole set of expressed information. In the pattern recognition literature, this is known as feature selection or dimensionality reduction, depending on the context.
The issue at hand now is a question of mathematics and computation rather than pure biology. In particular, the specific problem at focus has been addressed from a computational standpoint. A number of algorithms can be applied from other fields and areas of study to help solve this arduous task. The specific problem at focus, from a computation standpoint, is to find the best (with respect to a given quality function) k-tuples, from a set of n features, for many values of k. One method to find the best k-tuple predictor subset is to conduct an exhaustive search through all possible k-tuples. Although this approach always leads to the best solution, it becomes intractable for even moderate values of k (the computational time grows exponentially with k).
Also important to bio-informatics will be the methods developed for pattern recognition. In the context of pattern recognition, machine learning, data mining and their applications to various subject areas, e.g., medical diagnostics, manufacturing test design, image recognition etc., a similar problem of subset selection, known as feature selection is important. A number of approaches have been proposed and designed to address these problems or issues. The approaches include and are not limited to, sequential (backward and forward) search techniques, floating search techniques and genetic algorithms, etc. However, methods based on the sequential search techniques suffer from the nesting effect, i.e., they overlook good feature sets consisting of individually poor quality features. A second method called the floating search methods (Pudil et al., 2000; Somol et al., 2000) attempt to avoid the nesting problem by successively adding the best and removing the worst subsets of features from the candidate set of features. This introduces an exponential complexity in the search when the size of a subset grows. A significant drawback of these methods is that they become slow for large dimensional data sets as is the case with biological expression data. Genetic algorithms also do not have well defined stopping criteria and, in principle, can be exponentially complex.
Most importantly, the above methods and algorithms are intended to be applied in the field of array data processing to enable computationally efficient searches for the smallest subset of features that predict a target's expression levels within a given level of confidence or accuracy.
It would, therefore, be desirable to develop a method and algorithm that determines “good” solution sets with high probability in linear time, with respect to total number of features in a predictor set. For this reason, “sequential forward selection” (SFS) (Bishop, 1997; Pudil et al., 2000; Somol & Pudil, 2000) was developed to add the best (the one that leads to the largest improvement in the value of the quality function) new feature, at each successive stage of the algorithm, to the current set of features until the needed number of features is reached. It follows from construction that SFS suffers from the nesting problem and always overlooks better solutions sets whose features are of mediocre or poor quality. This is one of the shortcomings addressed by the present invention. While “sequential floating forward selection” (SFFS) also addresses the nesting problem, it maintains exponential time complexity for large data sets. The second shortcoming that this invention addresses is the exponential time complexity to find “good” solutions. The proposed method and invention finds a “good” solution set with high probability in linear time with respect to number of predictors. One of the floating search algorithms, called “oscillating search”, (Somol & Pudil, 2000) can also find approximate solutions in linear time. However, the present invention and method guarantees an equal or better quality solution while maintaining the linear time complexity. In addition, the same generic method or algorithm can be used not only for gene network reconstruction, but also can be applied to protein data, feature selection for classification and other biological data that is very large and complex to organize and analyze.