The present invention relates to bioinformatics and its use in methods of characterizing and identifying candidate genes. More particularly, the invention relates to the use of information extraction in the analysis of data from high-throughout gene expression profiling experiments.
High-throughput gene expression profiling techniques, such as those employing DNA microarrays, have become a standard and widely used technique for the identification of drug targets, gene therapy targets and therapeutic protein targets in human medicine (see U.S. Pat. Nos. 5,807,522 and 5,593,839). The early and efficient identification of target genes, otherwise known as “candidate” genes, using these techniques could significantly reduce the overall costs and time taken to develop and market actual products.
The rise in popularity of gene expression profiling, coupled with the increase in complexity of the experiments, has led to a tremendous increase in the amount of information that has to be organized and processed. One experiment alone, for example a time course of a disease process or a comparison between a treated sample with a non-treated control, may provide data on several thousand different genes. Analyzing and storing this data in a meaningful way has become rate limiting for biologists.
Several computational tools have been applied to this problem. For example, computational methods have assigned names to DNA sequences by comparing their sequence with sequence of named genes in public databases using such algorithms as BLAST (see generally U.S. Pat. No. 6,023,659). DNA sequences have also been assembled and grouped into functional hierarchies by specific algorithms to help investigators interpret gene expression data (U.S. Pat. No. 6,023,659).
More recently, gene expression profiles have been examined using methods that can cross-compare the expression profiles of many thousands of genes across many different experiments (for example Eisen et al P.N.A.S. 95, 14863-8). These methods employ pattern recognition algorithms to cluster genes with a similar expression patterns facilitating the facile identification of groups of genes that are co-regulated. Both supervised and unsupervised pattern recognition algorithms can be used to for clustering. Supervised pattern recognition algorithms require a priori knowledge that forms a training set, whereas unsupervised pattern recognition algorithms do not need a priori knowledge and are typically used to discover latent patterns. Many unsupervised clustering methods have been applied to gene expression profile data: these include hierarchical, K-means, self-organizing maps (Tamayo et al. PNAS 96:2907-12), or support vector machines (M. Brown et al. PNAS 97:262-7).
Once gene expression data has been gathered and analyzed, mostly by computer, researchers typically spend a significant amount of time gathering information from public databases, in particular public literature databases, in order to annotate their genes of interest, increase their confidence in a particular result, and permit the discovery of candidate genes. These methods are typically manually performed, in part due to a lack of tools to organize and process the enormous amount of public literature that is available for many of these genes. Because of this manual step, the available methods do not allow the efficient and facile identification of candidate genes.
Thus, there is a pressing need for tools that can process, summarize and cross-reference the enormous amounts of public literature, and allow this data to be used in combination with gene expression profiles to aid in discovering candidate genes. Since public literature is making a transition from printed media to digital media in the form of literature databases, an opportunity has emerged for computers to assist in this effort. There have been several attempts to use information extraction (IE) and natural language processing (NLP) methods within the context of biology. For example, protein-protein interactions can be examined using IE approaches (Science. 1997,275(5298):327-334; Proux (1988) Genome Inf. Workshop 9, 72-80; Hishiki et al, (1998) Genome Inf. Workshop 9, 81-90). However, IE and NLP have not yet been used in the context of examining gene expression profiling data to identify candidate genes.
Thus, within the art, there is a need for methods and techniques that can efficiently annotate genes with known information, in particular information from public literature databases regarding relationships between gene functions, and organize this information with gene expression profiles, facilitating the identification of candidate genes.