The present invention relates generally to the field of gene and transcript expression and specifically to a method for the serial analysis of a large number of transcripts by identification of a defined region of a transcript which corresponds to a region of an expressed gene.
Determination of the genomic sequence of higher organisms, including humans, is now a real and attainable goal. However, this analysis only represents one level of genetic complexity. The ordered and timely expression of genes represents another level of complexity equally important to the definition and biology of the organism.
The role of sequencing complementary DNA (cDNA), reverse transcribed from mRNA, as part of the human genome project has been debated as proponents of genomic sequencing have argued the difficulty of finding every mRNA expressed in all tissues, cell types, and developmental stages and have pointed out that much valuable information from intronic and intergenic regions, including control and regulatory sequences, will be missed by cDNA sequencing (Report of the Committee on Mapping and Sequencing the Human Genome, National Academy Press, Washington, D.C., 1988). Sequencing of transcribed regions of the genome using cDNA libraries has heretofore been considered unsatisfactory. Libraries of cDNA are believed to be dominated by repetitive elements, mitochondrial genes, ribosomal RNA genes, and other nuclear genes comprising common or housekeeping sequences. It is believed that cDNA libraries do not provide all sequences corresponding to structural and regulatory polypeptides or peptides (Putney, et al., Nature, 302:718, 1983).
Another drawback of standard cDNA cloning is that some mRNAs are abundant while others are rare. The cellular quantities of mRNA from various genes can vary by several orders of magnitude.
Techniques based on cDNA subtraction or differential display can be quite useful for comparing gene expression differences between two cell types (Hedrick, et al., Nature, 308:149, 1984; Liang and Pardee, Science, 257:967, 1992), but provide only a partial analysis, with no direct information regarding abundance of messenger RNA. The expressed sequence tag (EST) approach has been shown to be a valuable tool for gene discovery (Adams, et al., Science 252:1656, 1991; Adams, et al., Nature, 355:632, 1992; Okubo et al., Nature Genetics, 2:173, 1992), but like Northern blotting, RNase protection, and reverse transcriptase-polymerase chain reaction (RT-PCR) analysis (Alwine, et al., Proc. Natl. Acad Sci, U.S.A., 74:5350, 1977; Zinn et al, Cell, 34:865, 1983; Veres, et al., Science, 237:415, 1987), only evaluates a limited number of genes at a time. In addition, the EST approach preferably employs nucleotide sequences of 150 base pairs or longer for similarity searches and mapping.
Sequence tagged sites (STSs) (Olson, et al., Science, 245:1434, 1989) have also been utilized to identify genomic markers for the physical mapping of the genome. These short sequences from physically mapped clones represent uniquely identified map positions in the genome. In contrast, the identification of expressed genes relies on expressed sequence tags which are markers for those genes actually transcribed and expressed in vivo.
The restriction enzyme MmeI is a class II restriction endonclease which is a monomeric protein of 101 kDa. It is derived from Methylophilus methylotrophus. MmeI has a pI of 7.85 and is active in the pH range of 6.5 to 10, with the optimum at 7 to 8. MmeI cleaves DNA 20/18 nucleotides 3xe2x80x2 of the asymmetric recognition sequence (5xe2x80x2-TCCRAC-3xe2x80x2). See Tucholski et al., Gene, vol. 157, pp. 87-92, 1995.
There is a need for an improved method which allows rapid, detailed analysis of thousands of expressed genes and/or expressed transcripts for the investigation of a variety of biological applications, particularly for establishing the overall pattern of gene expression in different cell types or in the same cell type under different physiologic or pathologic conditions. Identification of different patterns of expression has several utilities, including the identification of appropriate therapeutic targets, candidate genes for gene therapy (e.g., gene replacement), tissue typing, forensic identification, mapping locations of disease-associated genes, and for the identification of diagnostic and prognostic indicator genes. There is a need in the art for more efficient methods of accomplishing these taks. There is a need in the art for methods of determining correspondence between isolated nucleic acids and genes and/or expressed transcripts identified in genomic databases. There is a need in the art for methods of identifying rare expressed genes not otherwise predicted as well as for identifying non-translated RNA factors. There is a need in the art for additional tools to assist in assigning function to genes identified in the human genome.
The present invention provides a method for the rapid analysis of numerous transcripts in order to identify the overall pattern of transcript expression (transcriptome) in different cell types or in the same cell type under different physiologic, developmental or disease conditions. The method is based on the identification of a xe2x80x9clongxe2x80x9d nucleotide sequence tag at a defined position in a messenger RNA. The tag is used to identify the corresponding transcript and/or gene from which it was transcribed. By utilizing dimerized tags, termed a xe2x80x9cditagxe2x80x9d, the method of the invention allows elimination of certain types of bias which might occur during cloning and/or amplification and possibly during data evaluation. Concatemerization of these nucleotide sequence tags allows the efficient analysis of transcripts in a serial manner by sequencing multiple tags on a single DNA molecule, for example, a DNA molecule inserted in a vector or in a single clone.
The method described herein is the serial analysis of transcript expression, an approach which allows the analysis of a large number of transcripts. To demonstrate this strategy, cDNA sequence tags were generated from mRNA, randomly paired to form ditags, concatenated, and cloned. Manual sequencing of 1,000 tags revealed a characteristic gene expression pattern. Identification of such patterns is important diagnostically and therapeutically, for example. Moreover, the use of serial analysis as a transcript discovery tool was documented by the identification and isolation of new pancreatic corresponding to novel tags. This method provides a broadly applicable means for the quantitative cataloging and comparison of expressed transcripts in a variety of normal, developmental, and disease states. xe2x80x9cLong SAGExe2x80x9d of xe2x80x9cLong SATExe2x80x9d permits the ready and accurate identification of isolated tags with genomic sequence data.