Advances in techniques for sequencing long stretches of genomic deoxyribonucleic acid (DNA) have allowed investigators to collect vast nucleic acid sequence data rapidly. These advances, combined with initiatives to sequence the entire human genome and the genomes of several other species, have created a need for the rapid identification of genes on long stretches of sequenced DNA. Conventional gene location techniques, such as cDNA hybridization, are effective at locating transcribed genes, but are time-consuming and costly.
An alternative for locating genes on DNA that has not otherwise been analyzed for potential coding regions involves using statistical detection methods. Such methods conventionally include using probability models to predict where in a DNA sequence a gene is located. The theoretical nucleic acid sequence probabilities can be determined through analysis of known coding regions in the organism of interest. Once theoretical nucleic acid sequence probabilities are determined, nucleic acid sequences in unannotated regions of DNA in the same or a similar organism can be statistically compared to the theoretical nucleic acid sequence probabilities. If the similarity is sufficient, the investigator is notified that a coding sequence exists. Conventional cloning techniques can then be used to isolate the putative gene and check for transcription.
One type of statistical detection method searches DNA by content In such content-based models, highly conserved regions of DNA that are common to all genes are located. If a conserved region of DNA is found, then the nucleic acid sequence associated with the conserved region can be compared with known genes. Such comparisons, which can be done with nucleic acid sequence comparison programs such as BLAST, are inefficient to run, however, and content-based searches therefore have limited desirability.
A second type of statistical detection method searches DNA by signal. This type of searching involves using probability models to predict whether DNA fragments within a larger nucleic acid sequence are coding. Early searching by signal programs, such as TestCode and Grail, relied on statistical variations within coding regions of DNA, including codon frequency, local nucleic acid sequence composition, codon preference measures, heuristics based on oligonucleotide frequency variations, and measures of nucleic acid sequence complexity.
Beyond simple gene detection, there is also a need for the determination of other coding features, such as the location of intron/exon boundaries in eukaryotic organisms and the location of insertions or deletions. The program GENSCAN (Burge, C. and Karlin, S. (1997) Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268, 78-94), for example, predicts exon location with local state probabilities based on oligonucleotide usage. GENSCAN, however, also depends on non-local nucleic acid sequence characteristics, which make the program very sensitive to sequencing errors and genes containing alternative splicing strategies.
One statistical model that avoids the problems caused by dependence on non-local nucleic acid sequence characteristics is the inhomogeneous Markov model. An inhomogeneous Markov model depends upon local probabilities, and is not therefore sensitive to sequencing errors or genes with alternative splicing strategies. The inhomogeneous Markov model is “inhomogeneous” because it determines the state probabilities for a given nucleotide in multiple reading frames rather than in a single reading frame. GeneMark, for example, is a computer program that uses the inhomogeneous Markov model to locate genes.
The GeneMark gene prediction algorithm was developed in several steps. A series of three publications demonstrated that inhomogeneous Markov models were useful tools for gene prediction (see Borodovsky, M., Sprizhitsky Yu., Golovanov E. and Alexandrov A. (1986) Statistical Patterns in Primary Structures of Functional Regions in the E. Coli Genome: I. Oligonucleotide Frequencies Analysis, Molecular Biology, 20, 826-833, Borodovsky, M., Sprizhitsky Yu, Golovanov E. and Alexandrov A. (1986) Statistical Patterns in Primary Structures of Functional Regions in the E. Coli Genome: II. Non-homogeneous Markov Models, Molecular Biology, 20, 833-840, Borodovsky, M., Sprizhitsky Yu., Golovanov E. and Alexandrov A. (1986) Statistical Patterns in Primary Structures of Functional Regions in the E. Coli Genome: III. Computer Recognition of Coding Regions, Molecular Biology, 20, 1145-1150, all of which are herein incorporated by reference in their entirety). The GeneMark method was based on an inhomogeneous Markov model and was described in 1993 (see Borodovsky, M. and McIninch J. (1993) GeneMark, Parallel Gene Recognition for both DNA Strands, Computers & Chemistry, 17, 123-133, and Borodovsky, M. and McIninch J. (1993) BioSystems v30, pp. 161-171, both of which are herein incorporated by reference in their entirety). The capabilities of the GeneMark program were subsequently investigated (see James D. McIninch, Prediction of Protein Coding Regions in Unannotated DNA sequences Using an Inhomogeneous Markov Model of Genetic Information Encoding (1997) (Ph.D. dissertation, Georgia Institute of Technology, on file with the Georgia Institute of Technology Library, which is herein incorporated by reference in its entirety).
Conventional programs using inhomogeneous Markov models, however, are limited to a defined probabilistic model for determining probability, and cannot be tailored by the investigator to better suit the nucleic acid sequence under study if information about that nucleic acid sequence is already available. Further, conventional implementations do not allow for the efficient and accurate detection of other nucleic acid sequence features.
What is needed in the art is a method of determining state probabilities for a nucleic acid sequence having some known characteristics, where the method is insensitive to frameshift insertions or deletions, and compatible methods for detecting other nucleic acid sequence features in known or unknown nucleic acid sequences.