I. Field of the Invention
The present invention generally relates to the field of genomic research and to the use of computational tools for analyzing nucleic acid sequences. More particularly, the present invention relates to computer- and software-based tools for discriminating protein-coding regions from non-coding regions in a sequence, and for analyzing other properties of a nucleic acid sequence.
II. Background and Material Information
Vast amounts of sequence data have been collected in the scientific community as a result of various sequencing efforts. The complete genome sequence is already available for many prokaryotes, for yeast, and for the nematode Caenorhabditis elegans. Further, efforts to sequence the entire human genome continue to place large amounts of sequence data into public databases. In addition, other efforts to sequence the genomes of other organisms, like the mouse, will generate additional sequence data for the scientific community. Recently established commercial ventures also promise to accelerate the pace of genomic sequencing even further (see, for example, PENNISI, E. (1999), "Human Genome: Academic Sequencers Challenge Celera in a Sprint to the Finish," Science, 283, pp.1822-1823).
The deluge of sequence data has created a need for automated processes for deriving useful information from sequences. In particular, there is a critical need for computational tools that do not require human intervention and that provide a high processing throughput. For example, in the area of genomic research, there is a strong need for reliable computational tools that provide an automated process for identifying protein-coding genes in deoxyribonucleic acid (DNA) sequences. The ability to accurately identify protein-coding regions from non-coding regions in a sequence is a critical aspect of the overall process for developing therapeutic compounds and drugs for treating human diseases.
Several approaches have been proposed for gene detection. One approach for determining the coding potential of DNA sequences has been to look for patterns that distinguish coding sequences from non-coding sequences. Such approaches look for the predominant occurrence of a given base at specific codon positions and/or the presence of other characteristics in order to identify protein-coding regions within a DNA sequence (see, for example, FICKETT, J. W. (1982), "Recognition of Protein Coding Regions in DNA Sequences," Nucleic Acids Research, Vol.10, pp. 5303-5318). In the approach proposed by FICKETT, preference numbers are computed for all four bases (i.e., Adenine (A), Cytosine (C), Guanine (G), and Thymine (T)) at each of the three codon positions to generate a total of twelve numbers. By establishing the characteristic values of these twelve numbers among coding and non-coding sequences, the FICKETT approach attempts to identify protein-coding regions within a DNA sequence.
Codon preference approaches (see, for example, GRIBSKOV, M., et al., (1984), "The Codon Preference Plot: Graphic Analysis of Protein Coding Sequences and Prediction of Gene Expression," Nucleic Acids Research, Vol. 12, pp. 539-549) typically exploit the fact that species differ in their preference for using certain codons for certain amino acids. Codon usage tables have been prepared for a variety of species. Under this approach, the adherence to the usage characteristics of the species in question is considered an indication that the given sequence is a coding region. However, codon preferences relate to the availability of specific transfer ribonucleic acids (tRNAs) in the cell. Therefore, strict adherence to a codon frequency table is much more likely for highly expressed genes, than for genes of lower expression levels.
Other methods have been proposed for identifying protein-coding regions based on the preferential occurrence of specific "words" in a sequence. For example, one approach has relied upon the preferential occurrence of specific words having two or six nucleotides within coding regions as a means of predicting the coding potential of DNA sequences (see, for example, CLAVERIE, J. M., et al. (1990), "K-tuple Frequency Analysis: From Intron-Exon Discrimination to T-cell Epitope Mapping," Methods in Enzymology, Vol. 183, pp. 237-252).
Neural networks and neural net trained systems have also been proposed for identifying protein-coding regions in sequences. For example, GRAIL uses various sensor algorithms based on past approaches and a modified version of the nucleotide word occurrence approach to predict coding regions (see, for example, UBERBACHER, E. C. and MURAL, R. J. (1991), "Locating Protein-Coding Regions in Human DNA Sequences by Multiple Sensor Neural Network Approach", Proceedings of the National Academy of Sciences, U.S.A., Vol. 88, pp. 11261-11265). In GRAIL, the output from each sensor algorithm is provided as input to a neural network that is trained to predict coding from non-coding sequences based on the output from the sensor algorithms. The neural network of GRAIL purportedly provides a higher prediction value than any one of the sensor techniques individually.
While such techniques have been proposed, all of the past approaches share at least two major drawbacks. First, past approaches use a sliding window that is generally insufficient in length to provide a reliable computational tool. For example, past approaches typically use a sliding window that is between 50 to 200 nucleotides. A sliding window of this length is insufficient to make statistically significant conclusions. Further, the use of longer windows in past approaches is not feasible since it reduces the resolution of the technique. Second, all of the above-noted approaches are dependent on statistical data collected from a sequence database. Since any such database, of necessity, contains only known genes, this may imply a systematic bias toward the genes of relatively high expression.
In an attempt to address the first drawback, a Fourier transform method has been developed that amplifies any periodicities that are found in order to effectively lengthen the window size (see, for example, YAN, M., et al. (1998), "A New Fourier Transform Approach for Protein Coding Measure Based on the Format of the Z Curve, Bioinformatics, Vol. 14, pp. 685-690). Further, in an attempt to address the second drawback, a technique referred to as GeneScan moves away from dependence on any preexisting databases by noticing a peak signal at a frequency of 1/3 in the Fourier transform of most coding regions (see, for example, TIWARI, S., et al., (1997), "Prediction of Probable Genes by Fourier Analysis of Genomic Sequences", CABIOS, Vol.13, pp. 263-270). In GeneScan, the DNA sequence is subjected to a Fourier transform analysis and the presence of a signal that is significantly above a noise level is examined. Any window that shows such a signal is predicted by GeneScan to be a protein-coding region.
While such techniques have been proposed, past approaches still suffer significant drawbacks. For example, the window size employed in GeneScan (i.e., 99 nucleotides) is likely to be inadequate to observe the peak signal for all sequences. Further, the requirement in GeneScan that the signal be significantly above noise may not be required and may impose an unnecessary constraint on the analysis process. Further, and perhaps more importantly, none of the proposed methods eliminate both of the major drawbacks of past approaches and provide a process that has a high degree of reliability for analyzing various types of sequences.
In view of the foregoing, there is presently a need for a process or system for analyzing sequence data that is not dependant on any preexisting databases and that eliminates the use of a sliding window. There is also a need for computational tools that use all available information, and that provide more robust pattern detection capabilities. In addition, there is a need for automated processes or systems for identifying protein-coding regions of a nucleic acid sequence while providing a high level of reliability for various types of species and patterns. Such features would enable researchers and other members of the scientific community to analyze sequence data in a more efficient and reliable manner, and facilitate the process for developing new drugs and remedies for diseases.