Gene expression is a biological process by which a DNA sequence generates a protein. The process involves two steps, namely transcription and translation. Transcription produces a messenger RNA (mRNA) sequence using the DNA sequence as a template. The subsequent process, called translation, synthesizes the protein according to information coded in the mRNA. In eukaryotes (higher organisms), the region of the DNA coding for a protein is usually not continuous but comprises alternating stretches of introns (non-coding parts) and exons (coding parts that result in the production of a part of the protein). Six reading frames exist, of which only one contains the gene sequence. Hence, genes cannot generally be read directly from a DNA sequence.
There are more than 3 billion bases of human DNA sequences. In the human genome, only 2%–3% of the sequences comprise coding. As a consequence of the size of the database, manual searching for genes that code for proteins is not practical. A need thus exists for an automated method of finding genes.
Chris Burge and Samuel Karlin, in a paper entitled “Prediction of Complete Gene Structures in Human Genomic DNA”, Journal of Molecular Biology (1997) 268, pp. 78–94, discuss a probabilistic method to predict sequences which code for proteins (i.e. find gene sequences). However, this method is not optimised for finding a specific gene.
Mikhail S. Gelfand, Andrey A. Mirnov, and Pavel A. Pevzner, in a paper entitled “Gene Recognition via Spliced Sequence Alignment”, Proceedings National Academy of Science (USA), August 1996, Volume 93, pp. 9061–9066, present a technique of finding high scoring blocks. The blocks are then combined to form a sequence, the weight of which is the optimal alignment score of the sequence with the target sequence. The blocks can be combined in many ways and the complexity of the problem increases with the number of blocks. Moreover, the second stage of finding the optimal alignment score increases the time required for completion of the algorithm. The technique does not take into account the presence of synonyms and consequent effects on the alignment scores.
International Patent Publication No. WO/9966302, published on 23 Dec. 1999, by the MUSC Foundation for Research and Development, and entitled “Recognition of Protein Coding Regions in Genomic DNA Sequences”, describes the use of neural networks to identify coding regions. Disadvantages associated with neural networks include the time necessary to train a network and the fact that information is stored in a form that is not easily understood by humans, restricts further analysis. In applications where target marker strings change rapidly, neural networks are not the best choice, given the time and effort required in training (both positive and negative samples are necessary).
Ron Shamir, in a lecture handout entitled “Algorithms for Molecular Biology”, Lecture 7, Tel Aviv University, dated “Fall Semester 2001”, discusses general concepts and algorithms relating to gene finding.
Rainer Sprang and Martin Vingron, in a paper entitled “Statistics of Large-Scale Sequence Searching”, published in Bioinformatics, Volume 14, No. 3, 1998, pp 279–284, discuss the statistical significance of scores in the context of a database search.
Samuel Karlin and Stephen F. Altschul, in a paper entitled “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes”, Proceedings of the National Academy of Science (USA), March 1990, Volume 87, pp. 2264–2268, present a theory that provides precise numerical formulas for assessing the statistical significance of any region in a sequence with a high aggregate score. The object is to identify whether particular sequence patterns occur simply by chance.
In another paper entitled “Applications and statistics for multiple high-scoring segments in molecular sequences”, Proceedings of the National Academy of Science (USA), June 1993, Volume 90, pp. 5873–5877, Samuel Karlin and Stephen F. Altschul discuss score-based measures of molecular sequence features as an aid in the study of proteins and DNA. In particular, the paper discusses potential problems encountered when using score-based techniques to identify similar sequences.
In a paper entitled “Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models” and published in the Journal of Computational Biology, Vol. 8, No. 3, 2001, pp 249–282, Yi-Kuo Yu and Terence Hwa propose a modified “semi-probabilistic” alignment consisting of a hybrid of the Smith-Waterman alignment. Specifically, the proposed method uses Hidden Markov Models to predict coding regions, rather than automaton's, profiles and scores for matching.