1. Field of the Invention
The present invention generally relates to a system and method for identifying genes and, more particularly, a system and method which utilizes a database of patterns to identify genes.
2. Description of the Related Art
Gene identification is one of the most important problems in molecular biology and has been receiving increasing attention with the advent of automated large scale sequencing projects. Indeed, more than 70 complete genomes currently exist in the public domain, while the sequencing of many others is currently in progress. Consequently, the automated identification of the protein coding regions in a newly sequenced genome is gaining importance.
Accurate gene prediction is of relevance to many biological applications. For instance, the predicted coding regions can be used to generate probes for a DNA microarray, or to form the basis for knockout experiments. In addition, the candidate proteins that correspond to these predicted genes might be used as new drug targets, and so forth.
Specific attention has been given to the prokaryotic gene identification problem. With the exception of a handful of reported instances in archaeal organisms, splicing generally does not occur in prokaryotes and thus the problem of gene identification in these organisms is assumed to be simpler than its eukaryotic counterpart. Even so, the available schemes for the in silico gene prediction on prokaryotic genomes can be improved further and increasingly accurate prediction methods are always sought.
Over the years, a large number of methods have been proposed that address the gene identification problem. These methods can be largely divided into two categories. The first school of thought makes use of the statistics of DNA sequences to determine gene locations. It was observed early on that the nucleotide usage exhibits different statistical properties in DNA regions that code for genes than it does outside: the concept of the CpG island (e.g., see Bird, A., (1987) “CpG islands are gene markers in the vertebrate nucleus”, Trends in Genetics, 3: 342-347) is a demonstration of such a difference in statistical behavior.
Among the gene identification methods that make use of this observation, hidden Markov models (HMMs) are probably the most popular. Specifically, HMMs are used in conventional methods such as GLIMMER (e.g., see Delcher, A. L., et al (1999), “Improved Microbial Gene identification with GLIMMER”, Nucl. Acid. Res., 27 (23): 4636-4641; and Salzberg, S. L., et al., (1998) “Microbial Gene Idenfication Using Interpolated Markov Models”, Nucl. Acid. Res., 26(2): 544-548) and GeneMark (Lukashin, A. V., and Borodovsky, M., (1998), “GeneMark.hmm: New Solutions for Gene Identification”, Nucl. Acid. Res., 16(4): 1107-1115).
The second school of thought advocates a strategy that is based on similarity searches in databases containing genomic information (e.g., see Badger, J. H. and Olsen, G. J., (1999), “CRITICA: Coding Region Identification Tool Invoking Comparative Analysis”, Molecular Biology and Evolution, 16:512-524; Bafna, V., and Huson, D. H., (2000), “The Conserved Exon Method for Gene Finding”, Proc. ISMB '00; Gelfand, M. S., Mironov, A. A., and Pevzner, P., (1996) “Gene Recognition Via Spliced Alignment”, Proc. Natl. Acad. Sci.USA, 93:9061-9066; Gish, W., and States, D. J., (1993) “Idenfication of Protein Coding Regions by Database Similarity Search, Nat. Genet., 3:266-272; and Robinson, K., Gilbert., W., and Church, G., (1994) “Large-scale Bacterial Gene Discovery by Similarity Search”, Nat. Genet., 7:205-214). Here one searches in existing databases for either proteins or DNA regions in other genomes that share similarities with candidate proteins corresponding to open reading frames (ORFs) identified in the genome under consideration (e.g., see Burge, C., and Karlin, S., (1998), “Finding the Genes in Genomic DNA”, Current Opinion in Structural Biology, 8:346-354; Burset, M. and Buigo, R., (1996) “Evaluation of Gene Structure Prediction Programs”, Genomics, 34:353-367; Claverie, J. M., (1998), “Computational Methods for Exon Detection”, Molecular Biotechnology, 10:27-48; Claverie, J. M., (1997), “Computational Methods for the Identification of Genes in Vertebrate Genomic Sequences”, Human Molecular Genetics, 6(10):1735-1744; Fickett, J. W., (1996), “The Gene Identification Problem: An Overview for Developers”, Computers Chem., 20(1):103-118; and Fickett, J. W. and Hatzigeorgiou, A. G., (1997), “Eukaryotic Promoter Recognition”, Genome Research, 7: 871-878).
However, these conventional strategies have shortcomings. Statistical methods like HMMs can find regions whose statistical behavior is similar to that of the used training set. But if no appropriate training sets are available, one must resort to using training sets that are derived through database search, or simply assume very long open reading frames to be coding for genes. The statistics of coding regions often differ from organism to organism, and ideally one ought to use HMMs whose parameters are organism-dependent if one wishes to achieve high prediction ratio. That is, one must train HMMs separately for each genome.
It has also been demonstrated that there exist many genes that are statistically distinct from other genes of the same organism, such as genes that are the result of horizontal transfer (e.g., see Kehoe, M. A., Kapur, V., et al., (1996) “Horizontal Gene Transfer Among Group A Streptococci: Implications for Pathogenesis and Epidemiology”, Trends Microbiol., 4(11):436-443; and Nielsen, K. M., bones, A. M., et al., (1998), “Horizontal Gene Transfer From Transgenic Plants to Terrestrial Bacteria—A Rare Event?”, FEMS Microbiol Rev., 22(2):79-103). Such cases typically pose challenges to statistical methods.
Finally, short genes (e.g. fewer than 60-80 a.a.) cannot be predicted easily using statistical methods. Similarity-based methods are more successful in finding short genes or genes that are statistically different from those in the rest of the organism under consideration as long as similar genes or proteins already appear in the databases being searched. Additional problems arise if the shared similarity between a candidate gene and its database counterpart is very low. On the flip-side, there is no dependence of the quality of answers on the choice of training sets. Similarity-based methods generally have an improved ability in determining the correct location of genes over statistical methods, a desirable property. It is for these reasons that large genome sequencing projects often employ a combination of methods from both schools (e.g., see Fleishman, R. D., et al., (1995), “Whole-genome Random Sequencing and Assembly of Haemophilus Influenzae”, Science, 269: 496-512).