1. Field of the Invention
Aspects of present invention generally relate to methods and systems for training machines to categorize data, and/or recognize patterns in data, and to machines and systems relating thereto. More specifically, exemplary aspects of, the invention relate to methods and systems for training machines that include providing one or more training data samples encompassing one or more data classes, identifying patterns in the one or more training data samples, providing one or more data samples representing one or more unknown classes of data, identifying patterns in the one or more of the data samples of unknown class(es), and predicting one or more classes to which the data samples of unknown class(es) belong by comparing patterns identified in said one or more data samples of unknown class with patterns identified in said one or more training data samples. Also provided are tools, systems, and devices, such as support vector machines (SVMs) and other features and methods, software implementing the features and methods, and computers incorporating the features and methods and/or running software, where the features and methods, software, and computers utilize various aspects of the present invention relating to analyzing data.
2. Description of Related Art
Machine learning is the study of how to create computers that learn from experience, and modify their activity based on that learning (as opposed to traditional computers, for which activity typically will not change unless the programmer explicitly changes it). As is known in the art, learning machines comprise software programs, for example, that may be trained to generalize using data with known outcomes. Trained learning machine software programs may then be applied to cases of unknown outcome for prediction. For example, a learning machine may be trained to recognize patterns in data. Learning machines may be trained to solve a wide variety of problems across a variety of disciplines.
Bioinformatics is one example of the use of machine learning techniques to maximize the information that may be derived from biotechnology data. Bioinformatics enables researchers to better manage and utilize vast amounts of biological information. Techniques for large-scale biology have resulted, for example, in enormous gene expression data repositories containing many separate studies by different groups. Researchers are often unable to adequately query this data because they lack proper tools.
Scientists have discovered that the root cause of many serious diseases is change in a person's gene expression resulting, for example, in a cancerous tumor, Multiple Sclerosis, or diabetes, and so methods of searching gene expression data could result in significant progress in developing treatments for these and other conditions. Currently, only a limited number of methods are available for searching across diverse gene expression experiments. These experiments may include, for example, a study of diseased vs. normal tissue, the exposure of tissue culture cells to chemical compounds, or relationships between expression patterns in diseased cells and normal expression in other types of cells and search methods may be difficult to implement, even where these experiments were conducted using the same type of array format. A machine learning based querying method is needed that could, among other things, facilitate scientific discovery and help identify treatments for a variety of medical conditions by recognizing patterns contained in this data.
BLAST (Basic Local Alignment Search Tool) is an example of a simple searching tool that provides a method for rapid searching nucleotide and peptide databases. BLAST is a sequence comparison software program, optimized for speed, which is used to search sequence databases for optimal local alignments to a query. Sequence alignments provide a powerful way to compare novel sequences with previously characterized genes, and both functional and evolutionary information can be inferred from well-designed queries and alignments. Since the BLAST software program local as well as global alignments, regions of similarity embedded in otherwise unrelated proteins can be detected. BLAST has provided a useful tool for searching nucleotide and peptide databases, but more sophisticated tools are needed to search gene expression data.
Among methods of pattern recognition, the support vector machine (SVM) has been proven to have exceptional performance as a classifier, and can tackle a wide range of problems. (Vapnik, Statistical Learning Theory, John Wiley and Sons, Inc. (1998).) SVMs are a form of machine learning invented by Boser, Guyon, and Vapnik. (Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144-152 (1992).) SVMs were first applied to gene expression samples to create trained classifiers (Brown, et al., Proc Natl Acad Sci USA, 97:262-7 (2000); Furey, et al., Bioinformatics, 16:906-14 (2000)).
While SVMs and other machine learning tools have gained popularity in recent years, the methods remain primarily used for microarray analysis and classification, and have not been fully developed to optimize trained machines for searching and querying. Use of SVMs for data mining is largely limited to text, such as for use in examination of protein names (Shi and Campagne, BMC Bioinformatics, 6:88 (2005)). ‘Data mining’ in microarray contexts usually refers to such tasks as retrieving an important gene list from within an experiment (e.g., Frank et al., Bioinformatics, 20:2479-2481 (2004)), not to the querying of a repository for similar and/or distantly related data patterns. Current searching methods still rely largely on annotations, descriptive information, or values ranges for specific fields, and large amounts of data are therefore often not being utilized.
Some existing machine learning techniques utilize SVMs that allow pattern-matching within a single type of experiment, and SVMs have proven superior to many supervised and unsupervised methods at classification based on subtle relationships in data. But SVMs require advancements in generalization, speed, performance and automation to enable their effective use by scientists to query large, varied databases for both exact and merely similar matches.
Existing methods also typically rely heavily upon knowledge of mathematics by the end user in order to optimize the application of machine learning to data. In the current state of the art, the efficient use of machine learning algorithms requires knowledge and understanding of the mathematical techniques on the part of the end user to optimize results. In particular, the choice of which trained machine is most likely to generalize well to data of unknown category is often a difficult one. Also, a determination of which features in the data are the most important, and how many and which should be used in training is normally a complex problem that is challenging even to those well skilled in the art. Accordingly, there remains an unfilled need for tools that can be used directly and effectively by research and biomedical personnel who do not have special training in mathematics.
Various machine learning systems and methods have been developed, including those described below.
U.S. Pat. No. 5,649,068 discloses systems and methods for pattern recognition using support vectors. Decision systems based on the dual representation mathematical principle are described, where the principle permits some decision functions that are weighted sums of predefined functions to be represented as memory-based decision function. Using this principle, a memory-based decision system with optimum margin is designed where weights and prototypes of training patterns of a memory-based decision function are determined such that the corresponding dual decision function satisfies the criterion of margin optimality.
U.S. Pat. Nos. 6,228,575, 6,924,094, and 7,252,948 disclose methods for making chip-based species identifications and phenotypic characterizations of microorganisms using arrays of oligonucleotides, and assessing differences in hybridization between organisms.
U.S. Pat. No. 6,789,069 discloses methods for using a learning machine to extract information from large amounts of biological data, in which training data and test data is pre-processed in order to add dimensionality or to identify missing or erroneous data points. After training of the learning machine using the training and test data has been confirmed, live data is pre-processed and input into the trained learning machine in order to obtain information. The learning machine may be one or more SVMs.
U.S. Pat. No. 7,062,384 discloses methods of classifying high-dimensional biological data obtained from biological samples, using a least squares-based dimension reduction step followed by a logistic determination step. The methods may be used to make univariate or multivariate classifications.
U.S. Pat. No. 7,117,188 discloses methods for identifying patterns in biological systems using SVMs and Recursive Feature Elimination (RFE). The methods include pre-processing of the data sets by correcting or eliminating missing or erroneous data points. The pre-processing of the data sets may also include adding dimensionality to training data by adding new coordinates to the vector, which may be derived by applying a transformation to the original coordinates.
U.S. Pat. No. 7,318,051 discloses methods for feature selection in a learning machine. The methods include a pre-processing step to reduce the quantity of features to be processed. The feature selection methods include RFE, minimizing the number of non-zero parameters, evaluating a cost function to identify a subset of features compatible with constraints imposed by the learning set, unbalanced correlation score, and transductive feature selection. Features remaining after feature selection are used to train the learning machine.
U.S. Publ. Appl. No. 2003/0207278 discloses methods for diagnosing diseases by obtaining high dimensional experimental data, filtering the data, reducing the dimensionality of the data, training a supervised pattern recognition method, ranking and choosing individual data points, and using the data points to determine if unknown data indicates a disease condition, a predilection for a disease, or a prognosis regarding a disease.
U.S. Publ. Appl. No. 2004/0058376, U.S. Pat. Nos. 6,733,969, and 6,303,301 disclose methods for mapping relationships among genes by parallel monitoring of gene expression. The methods include detecting expression of downstream genes in reference cells and target cells to determine expression patterns, and comparing those expression patterns to detect functional mutations.
U.S. Publ. Appl. No. 2005/0131847 discloses methods for pre-processed feature ranking for an SVM. The features are pre-processed to minimize classification error, and to constrain features used to train the SVM. After training, live data is processed using the SVM.
U.S. Publ. Appl. No. 2005/0287575 discloses methods for improving the determination of the genotype of a biological sequence. The methods include receiving intensity data for probe features on an array, applying filters to the intensity values, applying models to the intensity values to determine the genotype of each feature, and combining the genotype determinations to make a final genotype determination. The reliability of the genotype determination is tested.
U.S. Publ. Appl. No. 2006/0064415 discloses a data mining platform including a plurality of modules, where each module includes an input data component, a data analysis engine for processing the input data, an output component for outputting results of the data analysis, and a web server to access and monitor the modules within the platform and provide communication amongst the modules. Each module processes a different type of data.
U.S. Publ. Appl. No. 2007/0026406 discloses methods for classifying multi-dimensional biological data, which can be used to predict a biological activity or biological state. The methods include providing a plurality of gene expression datasets associated with a first class of compounds having a first biological activity, providing a plurality of gene expression datasets associated with a second class of compounds having a second biological activity, deriving a linear classification rule based on the plurality of gene expression datasets, and applying the linear classification rule to a set of gene expression levels associated with a compound of interest in order to determine whether the compound of interest has the first biological activity or the second biological activity.
U.S. Publ. Appl. No. 2007/0203861 discloses a method for operating a computer as an SVM in order to define a decision surface separating two classes of a training set of vectors. The method includes associating a distance parameter with each vector of the SVM's training set, where the distance parameter indicates the distance of its associated vector from the opposite class.
U.S. Publ. Appl. No. 2007/0276610 discloses a method for classifying genetic data. The method for class prediction is based on identifying a nonlinear system that has been defined for carrying out a given classification task. Information characteristic of exemplars from the classes to be distinguished is used to create training inputs, and the training outputs are representative of the class distinctions to be made. Nonlinear systems are found to approximate the defined input/output relations, and these nonlinear systems are then used to classify new data samples.
These systems and methods provide complex data processing, but rely upon inefficient techniques, such as expansion of training data and selection of optimum machines based on test data sets. Accordingly, there remains a great need in the art for tools and methods for using machine learning techniques to efficiently recognize patterns in data, and/or categorize data based on pattern recognition. The ability of a learning machine to discover knowledge from data is limited by the type of algorithm selected. Accordingly, there is also a need for methods and systems for selecting and/or creating an appropriate algorithm for a learning machine. There remains a need in the art for accurate methods for estimating the success of a trained machine using training data alone, which can be easily automated. Methods are also lacking that allow for the creation of hypothetical patterns and searching data to see if such patterns, or similar patterns, exist in actuality.
Methods, systems and devices are needed to manipulate the information contained in the databases containing data generated by investigations, for example, of proteomics and genomics. Also, methods, systems and devices are needed to integrate information from genomic, proteomic and traditional sources of biological and medical information. Such information is needed for the diagnosis and prognosis of diseases and other changes in biological and other systems.
Further, methods and compositions are also needed for treating diseases and other changes in biological systems that are identified by the trained learning machine. Once patterns or the relationships between the data are identified by the learning machines of the present invention and are used to detect or diagnose a particular disease state, diagnostic tests, including gene chips and tests of bodily fluids or bodily changes, and methods and compositions for treating the condition are needed.