The present invention relates generally to the field of genomic analysis, and more particularly to methods for identifying and characterizing genes using microarrays.
A fundamental goal of the human genome project is the identification and characterization of all genes in the genome. Knowledge of the location and structure of these genes has diverse applications, ranging from diagnostics and drug discovery through gene therapy. Genome projects for other species have similar goals and even more diverse applications, including increased food yields from plants and animals, production of industrially important proteins or metabolites, and development of new antimicrobial agents.
In bacterial and fungal genomes where only limited use of mRNA splicing is observed, most genes can be found simply by searching for open-reading-frames in the DNA sequence. Even in these simpler cases, problems are encountered in searching for small genes, reading frames that do not start the with the common AUG codon, and genes where translational frameshifting is used to control expression. In addition, finding genes via open-reading-frames is not effective when extensive splicing is seen and single genes can be spread across tens or even hundreds of kilobases of DNA. Moreover, regulatory sequences are often present in the untranslated regions at the ends of mRNAs, yet the reading-frame information is not helpful in locating such sequences.
Given a complex eukaryotic genome sequence, there are several known routes to gene discovery. Once the complete genome of an organism has been sequenced, the next step is to identify which regions of the genome are transcribed into mRNAs that code for proteins.
Until now, EST analysis has been the most powerful approach for identifying the transcribed regions of a sequenced genome. The process involves generating a large collection of cDNA clones from one or more tissues or growth conditions (see, e.g., Adams et al., 1991, Science 252:1651-6). The cloned sequences are tested with various sequence comparison algorithms to identify those that are parts of the same gene or represent different genes. Overlapping sequences representing a single gene are then merged to determine the sequence of the full length mRNA. The location of exons, or gene structure, can then be determined by simply mapping the mRNA sequence onto the genomic DNA. However, a major drawback of this approach is that some RNA species are produced at low levels or only in specific cells of an organism. Even with normalization methods to enrich for rarer RNAs, very large numbers of sequences from large numbers of tissues must be generated. Moreover, existing large collections of ESTs are often not uniformly distributed along the length of the gene because of the of the 3xe2x80x2 bias caused by the oligo (dT)-primed reverse transcriptase (RT). For example, FIG. 2 shows a typical distribution of ESTs along a given human gene. Variants such as SAGE can yield much larger numbers of sequences, but this method only sequences a short region of each gene and relies on appropriately positioned restriction endonuclease cleavage sites (see, e.g., Velculescu et al., 1995, Science 270:484-7). Further, multiple RNA species can be derived from the same gene through differential splicing or other processing steps (see, e.g., Herbert and Rich, 1999, Nat. Genet. 21:265-9), making it difficult to obtain complete collections of mRNAs as full length cDNAs (see, e.g., Strausberg et al., 1999, Science 286:455-7).
Another experimental approach for identifying exons in genomic DNA involves hybridizing labeled mRNA to a microarray containing random genomic fragments. The genomic inserts that hybridize to the labeled mRNA are then sequenced and mapped back onto the chromosomal reference sequence (see, e.g., Stephan et al., 2000, Mol. Genet. Metab.70:10-18). While this approach has been successful in some cases, any clones will contain both introns and exons, making the procedure undesirable due to the very low resolution of the exon structure. Further, this method requires extensive DNA sequencing, and can only be used for relatively small genomic regions.
Hybrid selection is also another experimental method that can be used to identify transcribed regions of genomic DNA (see, e.g., Parimoo et al., 1991, Proc. Natl. Acad. Sci. 88:9623-7). Recent developments have expanded the number of genes that can be tested (see, e.g., Gracia et al., 1999, Genome Res. 7:100-7). However, the clones may only provide data on a small part of a gene.
Gene discovery can also be accomplished by comparing genomic sequences with known sequences from other species, making use of the evolutionary conservation of sequences with important functions (see, e.g., Rogozin et al., 1999, Gene 226:129-37; Hardison et al., 1997, Genome Res. 7:959-66; Nature Genetics Vol. 25 Num. 2, 235-8 (2000)). Such methods may prove successful for genes that are highly conserved, but will fail completely on the genes that are unique to a particular species.
In a similar manner, computer modeling may be used to develop models of gene structure and to scan new sequence data for suspected genes (see, e.g., Uberbacher and Mural, 1991, Proc. Natl. Acad. Sci. 88:11261-5; Snyder and Stormo, 1993, Nuc. Acids Res. 21:607-13). However, computer models will not succeed in identifying classes of genes that do not fit the assumptions of the models. Further, while such computer programs may frequently locate portions of genes, they cannot reliably or accurately predict the overall structure of a gene (see, e.g., Burset and Guigo, 1996, Genome Res. 15:353-67). Known errors include artifactually joining one gene with a neighboring gene of different function, failing to identify exons, predicting exons that do not exist, predicting the incorrect size of an exon, and splitting a single known gene into separate predicted genes (see, e.g., Reese et al., 2000, Genome Res. 10:483-501). Moreover, computer models are even more unreliable for genes that do not encode proteins, especially for long transcripts.
Thus, there exists a need for a high-throughput method for precisely identifying the location of genes in genomic sequences, especially genes that are transcribed at low levels. There also exists a need for a method of identifying and characterizing all of the elements of genes, especially genes that are spread over large regions of genomic DNA. Further, there exists a need for a method of characterizing the structure of genes without extensive DNA sequencing of ESTs. Even further, there exists a need for a method of correctly predicting the exact protein sequence based on the accurate structure of the gene. The methods and compositions of the present invention fulfill these needs and solve other problems in the prior art.
Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.
The present invention provides methods for identifying and characterizing the regions in genomic sequences that are transcribed into RNA. In particular, the invention provides improved, robust methods for detecting genes through the use of microarrays to analyze the expression state of the genome. Genes which are expressed can be mapped to their respective positions in the genome, and the structure of the gene determined.
The invention is premised, in part, upon the discovery that microarrays consisting of tiled genomic sequences can be employed to precisely identify the locations of expressed genes within the genome. In particular, an RNA or cDNA molecule will hybridize to probes on a microarray corresponding to the locations of the exons of the corresponding gene. Thus, the structure of the gene can be rapidly determined, even if the exons are widely separated in the genome or the gene is expressed at low levels. The invention is also partially premised upon the discovery that high resolution microarrays can be used to accurately determine intron-exon boundaries. Thus, by using the methods of the present invention, an accurate gene structure can be readily ascertained. Further, the methods of the present invention allow for the calculation of the probability that a particular nucleotide in a region of interest is expressed.
The present invention offers numerous advantages over the methods outlined above. First, the microarrays of the present invention enable an efficient and comprehensive genome scan that provides much more detailed data than prior art methods. Second, the methods of the present invention allow for the efficient identification of small genes, genes that do not encode proteins, genes that are transcribed at low levels, and untranslated regions of mRNAs encoding proteins. Third, the use of microarrays in the present invention allows the structure of the gene to be determined at the same time as the gene is detected, even if the gene is spread over large regions of the genome. Additional advantages and features of the invention will become apparent to one of skill in the art from the description and claims which follow.