Knowledge discovery is the most desirable end product of data collection. Recent advancements in database technology have lead to an explosive growth in systems and methods for generating, collecting and storing vast amounts of data. While database technology enables efficient collection and storage of large data sets, the challenge of facilitating human comprehension of the information in this data is growing ever more difficult. With many existing techniques the problem has become unapproachable. Thus, there remains a need for a new generation of automated knowledge discovery tools.
As a specific example, the Human Genome Project has completed sequencing of the human genome. The complete sequence contains a staggering amount of data, with approximately 31,500 genes in the whole genome. The amount of data relevant to the genome must then be multiplied when considering comparative and other analyses that are needed in order to make use of the sequence data. To illustrate, human chromosome 20 alone comprises nearly 60 million base pairs. Several disease-causing genes have been mapped to chromosome 20 including various autoimmune diseases, certain neurological diseases, type 2 diabetes, several forms of cancer, and more, such that considerable information can be associated with this sequence alone.
One of the more recent advances in determining the functioning parameters of biological systems is the analysis of correlation of genomic information with protein functioning to elucidate the relationship between gene expression, protein function and interaction, and disease states or progression. Proteomics is the study of the group of proteins encoded and regulated by a genome. Genomic activation or expression does not always mean direct changes in protein production levels or activity. Alternative processing of mRNA or post-transcriptional or post-translational regulatory mechanisms may cause the activity of one gene to result in multiple proteins, all of which are slightly different with different migration patterns and biological activities. The human proteome is believed to be 50 to 100 times larger than the human genome. Currently, there are no methods, systems or devices for adequately analyzing the data generated by such biological investigations into the genome and proteome.
In recent years, machine-learning approaches for data analysis have been widely explored for recognizing patterns which, in turn, allow extraction of significant information contained within a large data set that may also include data consists of nothing more than irrelevant detail. Learning machines comprise algorithms that may be trained to generalize using data with known outcomes. Trained learning machine algorithms may then be applied to predict the outcome in cases of unknown outcome, i.e., to classify the data according to learned patterns. Machine-learning approaches, which include neural networks, hidden Markov models, belief networks and kernel-based classifiers such as support vector machines, are ideally suited for domains characterized by the existence of large amounts of data, noisy patterns and the absence of general theories. Support vector machines are disclosed in U.S. Pat. Nos. 6,128,608 and 6,157,921, both of which are assigned to the assignee of the present application and are incorporated herein by reference.
The quantities introduced to describe the data that is input into a learning machine are typically referred to as “features”, while the original quantities are sometimes referred to as “attributes”. A common problem in classification, and machine learning in general, is the reduction of dimensionality of feature space to overcome the risk of “overfitting”. Data overfitting arises when the number n of features is large, such as the thousands of genes studied in a microarray, and the number of training patterns is comparatively small, such as a few dozen patients. In such situations, one can find a decision function that separates the training data, even a linear decision function, but it will perform poorly on test data. The task of choosing the most suitable representation is known as “feature selection”.
A number of different approaches to feature selection exists, where one seeks to identify the smallest set of features that still conveys the essential information contained in the original attributes. This is known as “dimensionality reduction” and can be very beneficial as both computational and generalization performance can degrade as the number of features grows, a phenomenon sometimes referred to as the “curse of dimensionality.”
Training techniques that use regularization, i.e., restricting the class of admissible solutions, can avoid overfitting the data without requiring space dimensionality reduction. Support Vector Machines (SVMs) use regularization, however even SVMs can benefit from space dimensionality (feature) reduction.
The problem of feature selection is well known in pattern recognition. In many supervised learning problems, feature selection can be important for a variety of reasons including generalization performance, running time requirements and constraints and interpretational issues imposed by the problem itself. Given a particular classification technique, one can select the best subset of features satisfying a given “model selection” criterion by exhaustive enumeration of all subsets of features. However, this method is impractical for large numbers of features, such as thousands of genes, because of the combinatorial explosion of the number of subsets.
One method of feature reduction is projecting on the first few principal directions of the data. Using this method, new features are obtained that are linear combinations of the original features. One disadvantage of projection methods is that none of the original input features can be discarded. Preferred methods incorporate pruning techniques that eliminate some of the original input features while retaining a minimum subset of features that yield better classification performance. For design of diagnostic tests, it is of practical importance to be able to select a small subset of genes for cost effectiveness and to permit the relevance of the genes selected to be verified more easily.
Accordingly, the need remains for a method for selection of the features to be used by a learning machine for pattern recognition which still minimizes classification error.