Most eukaryotic cells have an operational center called the nucleus which contains structures called chromosomes. Chemically, chromosomes are formed of deoxyribonucleic acid (DNA) and associated protein molecules. Each chromosome may have tens of thousands of genes. Some genes are referred to as xe2x80x9cencodingxe2x80x9d (or carrying information for constructing) proteins which are essential in the structuring, functioning and regulating of cells, tissues and organs. Thus, for each organism, the components of the DNA molecules encode much of the information necessary for creating and maintaining life of the organism. See Human Genome Program, U.S. Department of Energy, xe2x80x9cPrimer on Molecular Geneticsxe2x80x9d, Washington, D.C., 1992.
The shape of a DNA molecule can be thought of as a twisted ladder. That is, the DNA molecule is formed of two parallel side strands of sugar and phosphate molecules connected by orthogonal/cross pieces (rungs) of nitrogen-containing chemicals called bases. Each long side strand is formed of a particular series of units called nucleotides. Each nucleotide comprises one sugar, one phosphate and a nitrogenous base. The order of the bases in this series (the side strands series of nucleotides) is called the DNA sequence.
Each rung forms a relatively weak bond between respective bases, one on each side strand. The term xe2x80x9cbase pairsxe2x80x9d refers to the bases at opposite ends of a rung, with one base being on one side strand of the DNA molecule and the other base being on the second side strand of the DNA molecule. Genome size or sequence length is typically stated in terms of number of base pairs.
There are four different bases present in DNA: adenine (A), thymine (T), cytosine (C) and guanine (G). Adenine will pair only with thymine (an A-T pair) and cytosine will pair only with guanine (a C-G pair). A DNA sequence is represented in writing using A""s, C""s, T""s and G""s (respective abbreviations for the bases) in corresponding series or character strings. That is, the ACTG""s are written in the order of the nucleotides of the subject DNA molecule.
As previously mentioned, each DNA molecule contains many genes. A gene is a specific sequence of nucleoticlo bases. These sequences carry the information required for constructing proteins. A protein is a large molecule formed of one or more chains of amino acids in a specific order. Order is determined by base sequence of nucleotides in the gene coding for the protein. Each protein has a unique function. In DNA molecule there are protein-coding sequences (genes) called xe2x80x9cexonsxe2x80x9d, and non-coding-function sequences called xe2x80x9cintronsxe2x80x9d interspersed within many genes. The balance of DNA sequences in the genome are other non-coding regions or intergenic regions.
According to the foregoing method of representing genome and DNA sequences, the DNA sequence specifies the genetic instructions required to create a particular organism with its own unique traits and at the same time provides a text (character string) environment in which to study the same.
Biology and biotechnology are undergoing a technological revolution which is transforming research into an information-rich enterprise. Novel technologies such as high-throughput DNA sequencing and DNA microarrays are generating unprecedented amounts of data. A typical bacterial genome sequence is comprised of several million bases of DNA and contains several thousand genes. Many microbial genomes have been sequenced by the major genome centers, and the total number of such xe2x80x9csmallxe2x80x9d genomes is expected to reach 100 shortly. Substantial progress is being made on sequencing the genomes of higher organisms as well. The genomes of eukaryotes are typically much larger; e.g., the human genome is approximately 3 billion bases long and is expected to contain approximately 100,000 genes.
Gene identification and gene discovery in newly sequenced genomic sequences is one of the most timely computational questions addressed by bioinformatics scientists. Popular gene finding systems include Glimmer, Geumark, Genscan, Genie, GENEWISE, and Grail (See Burge, C. and S. Karlin, xe2x80x9cPrediction of complete gene structures in human genomic DNA,xe2x80x9d J Mol. Biol., 268:78-94, 1997; Salzberg, S. et al., xe2x80x9cMicrobial gene identification using interpolated Markov models,xe2x80x9d Nucl. Acids Res., 26(2):544-548, 1998; Xu, Y. at al., xe2x80x9cGrail: A multi-agent neural network system for gene identification,xe2x80x9d Proc. of the IEEE, 84(10):1544-1552, 1996; Kulp, D. et al., xe2x80x9cA generalized hidden Markov model for the recognition of human genes in DNA,xe2x80x9d in ISMB-96: Proc. Fourth Intl. Conf. Intelligent Systems for Molecular Biology, pp. 134-141. Menlo Park, Calif., 1996, AAAI Press; Borodovsky, M. and J. D. Mcininch, xe2x80x9cGenemark: Parallel gene recognition for both DNA strands,xe2x80x9d Computers and Chemistry, 17(2):123-133, 1993; and Salzberg, S. et al. eds., Computational Methods in Molecular Biology, Vol. 32 of New Comprehensive Biochemistry, Elsevier Science B.V., Amsterdam, 1998). The annotations produced by gene finding systems have been made available to the public. Such projects include the genomes of over thirty microbial organisms, as well as Malaria, Drosophila, C.elegans, mouse, Human chromosome 22 and others. For instance. Glimmer has been widely used in the analysis of many microbial genomes and has reported over 98% accuracy in prediction accuracy (See Fraser, C. M. et al., xe2x80x9cGenomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi,xe2x80x9d Nature 390(6680):580-586, December 1997). Genie (D. Kulp et al. above) has been deployed in the analysis of the Drosophila genome, and Genscan (C. Burge and S. Karlin above) was used for analysis of human chromosome 22.
In addition to these central projects, a large number of proprietary genome analysis projects using gene-finding systems are in progress at the major bioinformatics centers in drug companies, bioinformatics companies, and other industrial organizations. As a result, a large number of research projects are underway in the goal of improving the performance of such systems, primarily targeting improvements in accuracy of reported genes. In fact, one of the current controversies involves producing an accurate estimate on the number of genes in the human genome. The current number of genes actually found by the gene finding programs are substantially lower than previous estimates.
Typically, the cellular machinery reads the bases on either strand of an input DNA sequence but in different directions depending on which strand it is reading. DNA is transcribed into RNA and then translated into proteins using a genetic code, which reads the bases in groups of 3 (called codons) and translates each codon into one amino acid. Amino acids are chained into molecules known as proteins. Levels of gene expression influence levels of protein expression which in turn influence the particular biological function it encodes.
On a very high level, genes in human DNA and many other organisms have a relatively simple structure. All eukaryotic genes, including human genes, are thought to share a similar layout. This layout adheres to the following xe2x80x9cgrammarxe2x80x9d or pattern: start codon, exon, (intron-exon)n, stop codon. The start codon is a specific 3-base sequence (e.g. ATG) which signals the beginning of the gene. Exons are the actual genetic material that code for proteins as mentioned above. Introns are the spacer segments of DNA whose function is not clearly understood. And finally stop codons (e.g TAA) which signal the end of the gene. The notation (intron-exon)n, simply means that there are n alternating intron-exon segments. Genes identification procedures has to take into account other important issues such as polyA tail, promoters, pseudo-genes, alternative splicing and other features.
The proliferation of gene prediction systems, especially systems that focus on exon prediction raises the question whether a careful combination of the predictions made by these systems would produce a significantly improved gene detection system. The present invention systematically builds on the framework for a combination of experts.
General theory for the combination of experts has drawn significant interest in the machine learning community. Theory and practice of combining experts have been studied in literature. The choice of a particular way of combining expert predictions depends on the properties of individual experts and the demands posed by the problem at hand.
Most techniques for combining gene predictions proposed in the past have been rather trivial or have relied on ad hoc combinations of experts. In one prior project, Murakami and Takagi (Murakami, K. and T. Takagi, xe2x80x9cGene recognition by combination of several gene-finding programs,xe2x80x9d Bioinformatics, 14(8):665-675, 1998) proposed a system for gene recognition that combines several gene-finding programs. They implemented an AND and OR combination, HIGHEST-method (best individual expert), RULE-method (decisions using sets of expert rules), and an ad hoc BOUNDARY-method. The best of these methods achieved an improvement in general accuracy of 3%-5% over the individual gene finders.
Another similar expert combination scheme based on majority voting was recently used at The Institute for Genomic Research (TIGR) and reported in the 12th International Genome Sequencing Conference, September 2000. However, it only achieved moderate improvements in prediction.
In the present invention, apparatus and method for automated gene prediction operate as follows:
Using a plurality of expert systems (or similar units), gene locations in a subject genomic sequence are preliminarily predicted. Next, using a Bayesian network, the preliminarily predicted gene locations are combined to form a final combined output. The final combined output indicates predicted genes of the subject genomic sequence. The Bayesian network combiner accounts for dependencies between individual expert systems and dependencies between adjacent parts of the subject genomic sequence.
Preferably, the Bayesian network combines the preliminarily predicted gene locations according to
Yt*=maxYtP(Y|E1 . . . En, Ytxe2x88x921*)
where t is location in the subject genomic sequence and E1, . . . ,En are the respective predicted gene locations of individual expert systems, n being the number of expert systems in the plurality.
In accordance with another aspect of the present invention, the preliminarily predicted gene locations and/or predicted genes include exon (or coding regions) predictions. Alternatively, the gene locations for predicted genes are indicated by exons and introns (i.e., coding and non-coding regions) of the subject genome sequence.