The present invention is directed generally to a method and system for identification of genetic information from a polynucleotide sequence. More particularly, it is directed to a method and system for identifying genetic information from a polynucleotide sequence by calculating the rate of change from one base pair to the next in the polynucleotide sequence and then statistically processing the rate of change data to produce genetic information.
It will be appreciated by one having ordinary skill in the art that due to the speed and sophistication of technological innovation, the genomes of an increasingly larger number of organisms are being sequenced. There is more information in these genetic codes than has yet been experimentally determined; and, identifying information containing regions on sequenced but otherwise unknown genomic material has been the focus of considerable work in the first half of this decade, Burset, et al., Genomics 34: 353-367. Several methods have already been devised to attack this problem at both the DNA and protein sequence level. In addition, sequences with known biological structure can be used in homological comparison with newly sequenced DNA. Computers are the primary tool in this endeavor and more accurate and reliable computational methods are needed to push this work forward.
Once a polynucleotide sequence is generated, it becomes important to identify protein coding regions, or genes, within the polynucleotide sequences. To this end, there have been several attempts to provide systems that predict the location of protein coding regions within polynucleotide sequences. In the last few years several exon identification methods some with gene assembly capabilities have been developed. These include Markov and Hidden Markov models, e.g. P. Baldi, et al., Proc. Nat. Acad. Sci., 91: 1059-1063; statistical methods, e.g. R. Guigo, et al., J.Mol.Biol. 226, 141-157 (1992); homology, e.g. W. Pearson, et al., Proc. Nat. Acad. Sci., 85: 2444-2448; fournier analysis, e.g. S. Tiwari, Prediction of Probable Genes by Fournier Analysis of Genomic Sequences, New Delhi preprint 1-22; as well as neural nets, e.g. E. Uberbacher, et al., Proc. Nat. Acad. Sci., 88: 11261-11265; and game theory, e.g. Jeffrey, H. Nucleic Acids Res. 13: 3453-3462. A recent paper by Burset, et al., Genomics 34: 353-367, compared several such methods that predict protein coding regions.
For the most part, the systems described in the prior art attempt to predict the locations of protein in coding regions using the probability that a certain polynucleotide base will appear next in a sequence based on known characteristics that appear in protein coded regions such as start codons, stop codons and the like.
However, none of the systems in the prior art analyze patterns derived from numerical values generated by assigning numerical values to the change of base pairs (i.e., derivative of the sequence) and then adding the numerical values of the change of base pairs over desired sequence length to extract reading frame and directional information. Further, the review article Burset, et al., Genomics 34 353-367, points out the relative inaccuracy and ineffectiveness of currently available gene prediction systems. It points out that currently available products are relatively inaccurate and ineffective with accuracy percentages as low as 30% in some cases. Therefore, among those having ordinary skill in the field, there is dissatisfaction with currently available products.
What is needed, then, is a method and system for identifying genetic information within a polynucleotide sequence that is accurate and reliable. Such a method and system are presently lacking in the prior art.