The field of genomics has taken rapid strides in recent years. It started with efforts to determine the entire nucleotide sequence of simpler organisms such as viruses and bacteria. As a result, genomic sequences of Hemophilus influenzae (Fleischman et al., Science 269: 496-512, 1995) and a number of other bacterial strains (Escherichia coli, Mycobacterium tuberculosis, Helicobacter pylori, Caulobacter jejuni, Mycobacterium leprae) are now available (reviewed in Nierman et al., Curr. Opin. Struct. Biol. 10: 343-348, 2000). This was followed by the determination of complete nucleotide sequence of a number of eukaryotic organisms including budding-yeast (Saccharomyces cerevisiae) (Goffeau et al., Science 274: 563-567, 1996), nematode (Cenorhabditis elegans) (C. elegans sequencing consortium, Science 282: 2012-2018 1998) and fruit fly (Drosophila melanogaster) (Adams et al., Science 287: 2185-2195, 2000). The sequence of the human genome was published in February of 2001 (International Human Genome Sequence Consortium, Nature, 409:860-921, 2001; Venter et al., Science, 291:1304-1351, 2001). Additionally, some of the ongoing efforts are currently focused on genome sequencing of agriculturally important plants such as rice (Science 288: 239-240, 2000; Sasaki and Burr, Curr. Opin. Plant Biol. 3: 138-141, 2000) and experimentally critical animal model such as mice (News Focus, Science 288: 248-257, 2000).
The availability of complete genomic sequences of various organisms promises to significantly advance our understanding of various fundamental aspects of biology. It also promises to provide unparalleled applied benefits such as understanding genetic basis of certain diseases, providing new targets for therapeutic intervention, developing a new generation of diagnostic tests, etc. New and improved tools, however, will be needed to harvest and fully realize the potential of genomics research.
Even though the DNA complement or gene complement is identical in various cells in the body of multi-cellular organisms, there are qualitative and quantitative differences in gene expression in various cells. A human genome is estimated to contain roughly about 30,000-40,000 genes, however, only a fraction of these genes are expressed in a given cell (International Human Genome Sequence Consortium, Nature, 409:860-921, 2001; Venter et al., Science, 291:1304-1351, 2001). Moreover, there are quantitative differences among the expressed genes in various cell types. Although all cells express certain housekeeping genes, each distinct cell type additionally expresses a unique set of genes. Phenotypic differences between cell types are largely determined by the complement of proteins that are uniquely expressed. It is the expression of this unique set of genes and their encoded proteins, which constitutes functional identity of a cell type, and distinguishes it from other cell types. Moreover, the complement of genes that are expressed, and their level of expression vary considerably depending on the developmental stage of a given cell type. Certain genes are specifically activated or repressed during differentiation of a cell. The level of expression also changes during development and differentiation. Qualitative and quantitative changes in gene expression also take place during cell division, e.g. in various phases of cell cycle. Signal transduction by biologically active molecules such as hormones, growth factors and cytokines often involves modulation of gene expression. Global change in gene expression also plays a determinative role in the process of aging.
In addition to the endogenous or internal factors mentioned above, certain external factors or stimuli, such as environmental factors, also bring about changes in gene expression profiles. Infectious organisms such as bacteria, viruses, fungi and parasites interact with the cells and influence the qualitative and quantitative aspects of gene expression. Thus, the precise complement of genes expressed by a given cell type is influenced by a number of endogenous and exogenous factors. The outcome of these changes is critical for normal cell survival, growth, development and response to the environment. Therefore, it is important to identify, characterize and measure changes in gene expression. The knowledge gained from such analysis will not only further our understanding of basic biology, but it will also allow us to exploit it for various purposes such as diagnosis of infectious and non-infectious diseases, screening to identify and develop new drugs, etc.
Besides the conventional, one by one gene expression analysis methods like Northern analysis, RNase protection assays, and real time PCT (RT-PCR); there are several methods currently available to examine gene expression in a genome wide scale. These approaches are variously referred to as RNA profiling, differential display, etc. These methods can be broadly divided into three categories: (1) hybridization-based methods such as subtractive hybridization (Koyama et al., Proc. Natl. Acad. Sci. USA 84: 1609-1613, 1987; Zipfel et al., Mol. Cell. Biol. 9: 1041-1048, 1989), microarray, etc., (2) cDNA tags: EST, serial analysis of gene expression (SAGE) (see, e.g. U.S. Pat. Nos. 5,695,937 and 5,866,330), and (3) fragment size based, often referred to as gel-based methods where a differential display is generated upon electrophoretic separation of DNA fragments on a gel such as a polyacrylamide gel (described in U.S. Pat. Nos. 5,871,697, 5,459,037, 5,712,126 and PCT publication No. WO 98/51789).
Expressed Sequence Tags (ESTs) are created by partially sequencing (usually a single pass) randomly chosen gene transcripts that have been converted into cDNA. The concept of ESTs as an alternative to genome sequencing for the rapid determination of the expressed complement of a genome was first introduced by Adams et al. (Science 252: 1651-1656, 1991). The EST approach, combined with the power of high throughput sequencing, has brought about a revolution (Zweiger and Scott, Curr. Opin. Biotechnol. 8: 684-687, 1997). EST tags often contain enough information to identify the transcript corresponding to the cDNA clone by searching nucleotide and protein databases. Thus, the EST approach provides a valuable tool for the identification and characterization of novel genes. ESTs are also long enough to be reliably used as substrate for microarrays, and are less expensive and superior in performance than oligonucleotides for the purpose. The use of EST-based microarray technology is finding increasing use in global genome wide transcript profiling. The number of EST sequences being deposited in databanks is exponentially growing. A sub-database containing EST sequences (dbEST) is available in GenBank. A large number of EST sequences from various organisms are also available in public databanks.
The National Cancer Institute (NCI) launched the first large-scale EST project in 1997, the Cancer Genome Anatomy Program (CGAP), focusing on a single aspect, the comprehensive molecular characterization of human normal, pre-cancerous and malignant cells (Strausberg et al., Trends Genet. 16: 103-106, 2000). Similarly, tissue-specific EST databases have also been created in order to understand the unique physiological functions of a tissue/organ as well as to gain insight into changes into global gene expression associated with various pathological conditions. For instance, a collection of EST sequences derived from human heart tissue of various anatomical, developmental and pathological stages has been established with a view to understand cardiovascular diseases (Dempsey et al., Mol. Med. Today, 6: 231-237, 2000). The EST profile of an organ typically reflects the unique function of the organ. For example, the pancreas and liver have a large proportion of ESTs corresponding to secreted proteins, whereas the brain and heart have a very small proportion of such ESTs. Similarly, ESTs corresponding to contractile proteins are present in high proportion in the heart, but are absent in the brain, liver or pancreas. A number of EST projects directed at agriculturally important plants such as rice, wheat, maize, soybean, and sorghum are currently underway (Richmond and Somerville, Curr. Opin. Plant Biol., 3: 108-116, 2000).
One of the most fundamental applications of an EST approach has been in the rapid identification of novel gene homologs and orthologs. The successful implementation of this approach is illustrated by the discovery of a number of novel members of the chemokine family (Karp, Trends Biotechnol. 14: 273-279, 1996) and the TNFxcex1 and TNFxcex1 receptor families (Pan et al., Science 276: 111-113, 1997; Sheridan et al., Science 277:818-821, 1997; Vincez and Dixit, J. Biol. Chem. 272: 6578-6583, 1997). Major advances in bioinformatic software used for mining the rapidly growing relational databases, including more powerful motif identification and search tools, have accelerated the pace of progress in the use of EST approach for the identification of new genes (Zweiger and Scott, supra). Availability of a human transcript chromosomal map showing map positions of an increasing number of ESTs has significantly helped accelerate the progress in the field of positional cloning of genes, particularly in the identification and characterization of gene mutations or alterations which cause or predispose an individual to a particular disease or trait. Once a genetic locus for a disease is identified by genetic linkage analysis, the presence of ESTs in the region surrounding the locus provides a list of genes likely responsible for the disease. This strategy has been used successfully in the identification of presenilin 2 implicated in Alzheimer""s disease (Levy-Lahad et al., Science 269: 970-973 and 973-977, 1995), and mutS homolog 2 implicated in hereditary colon cancer (Fishel et al., Cell 75: 1027-1038, 1993; Papadopoulos et al., Science 263: 1625-1629, 1994).
Another major use of ESTs is in the large-scale monitoring of gene expression of a given cell type under various physiological, pathological or environmental conditions. Not only does this yield valuable information about the fundamental biological processes, but it also promises to provide important targets for therapeutic intervention in various diseases. ESTs in the form of PCR products provide a better and cost effective alternative to the use of oligonucleotides for use in microarray-based gene expression analysis. In this method, a large number of ESTs are gridded on a solid support, such as glass or a polymer, which can be hybridized to fluorescently tagged samples of mRNA or cDNA derived from a given cellular source (Graves, Trends Biotechnol. 17: 127-134, 1999). The hybridization is scored qualitatively and quantitatively by a scanning fluorescent microscope. Since the sequence and location of the immobilized probes is known, the technique provides a rapid and comprehensive analysis of gene expression, and essentially creates a transcript profile of the target cell. In a variation of the theme, a collection of ESTs with unknown sequences can also be used to prepare a microarray, and only the EST clones of relevance (e.g. showing significant change in expression levels) are then sequenced.
Microarray based gene analysis approach enables working with hundreds of thousands of genes simultaneously rather than one or a few genes at a time. Microarray technology has come at an appropriate time, when entire genomes of humans and other organisms are being worked out. Massive sequence information generated as a result of genome sequencing, particularly human genome sequencing, has created a demand for technologies that provide high-throughput and speed. Microarrays fill this unique niche. Most of the complex physiological processes precede or succeed change in the expression of a large number of genes. Techniques that were available before the advent of microarrays are not suitable to monitor such large-scale changes in gene expression. DNA microarrays offer the opportunity to perform fast, comprehensive, moderately quantitative analyses on hundreds of thousands of genes simultaneously. A DNA microarray is composed of an ordered set of DNA molecules of known sequences usually arranged in rectangular configuration in a small space such as 1 cm2 in a standard microscope slide format. For example, an array of 200xc3x97200 would contain 40,000 spots with each spot corresponding to a probe of known sequence. Such a microarray can be potentially used to simultaneously monitor the expression of 40,000 genes in a given cell type under various conditions. The probes usually take the form of cDNA, ESTs or oligonucleotides. Most preferred are ESTs and oligonucleotides in the range of 30-200 bases long as they provide an ideal substrate for hybridization. There are two approaches to building these microarrays, also known as chips, one involving covalent attachment of pre-synthesized probes, the other involving building or synthesizing probes directly on the chip. The sample or test material usually consists of RNA that has been amplified by PCR. PCR serves the dual purposes of amplifying the starting material as well as allowing introduction of fluorescent tags. For a detailed discussion of microarray technology, see e.g., Graves, Trends Biotechnol. 17: 127-134, 1999.
High-density microarrays are built by depositing an extremely minute quantity of DNA solutions at precise location on an array using high precision machines, a number of which are available commercially. An alternative approach pioneered by Packard Instruments, enables deposition of DNA in much the same way that ink jet printer deposits spots on paper. High-density DNA microarrays are commercially available from a number of sources such as Affymetrix, Incyte, Mergen, Genemed Molecular Biochemicals, Sequenom, Genomic Solutions, Clontech, Research Genetics, Operon and Stratagene. Currently, labeling for DNA microarray analysis involves fluorescence, which allows multiple independent signals to be read at the same time. This allows simultaneous hybridization of the same chip with two samples labeled with different fluorescent dyes. The calculation of the ratio of fluorescence at each spot allows determination of the relative change in the expression of each gene under two different conditions. For example, comparison between a normal tissue and a corresponding tumor tissue using the approach helps in identifying genes whose expression is significantly altered. Thus, the method offers a particularly powerful tool when the gene expression profile of the same cell is to be compared under two or more conditions. High-resolution scanners with capability to monitor fluorescence at various wavelengths are commercially available.
Although the EST approach is a powerful tool in the study of expression and function of genes, a major drawback of this method is the unavoidable redundancy of clones representing highly expressed transcripts and conversely under-representation of transcripts with low expression levels. The problem persists even with normalized and subtractive libraries. In order to sample the low-level transcripts, the sequencing effort must go very deep into the library which involves substantial labor, time, and associated costs. SAGE greatly minimizes the sequencing runs, but generates cDNA tags (13-15 bases) that are too short to be used as gene specific primers or probes. Moreover, SAGE suffers from higher cost and labor intensiveness.
Accordingly, there is a need for further improvements in the techniques used to identify, sequence and characterize novel genes.
The present inventors have discovered a novel method for the production of a nucleic acid library, in particular a non-redundant nucleic acid library, and more particularly a non-redundant, expressed sequence tag (EST) library. Among the several advantages of the present method over the prior art are decreased redundancy resulting in increased speed and decreased cost in library construction, the representation of transcripts independent of their expression level, and the ability to isolate and sequence transcripts without cloning.
According, one of the several aspects of the invention provides a method for constructing a nucleic acid library comprising obtaining a population of double-stranded cDNA containing a detectable label. In one embodiment, the detectable label is on the 3xe2x80x2 end and the cDNA is produced by isolating polyA mRNA, hybridizing the polyA mRNA to an oligo dT primer with a 5xe2x80x2 label, synthesizing the first cDNA strand by primer extension and synthesizing the second cDNA strand by nick translation. The end-labeled cDNA is divided into portions, for example a first portion and a second portion. The first portion is digested with at least one sequence-specific restriction endonuclease and the second portion is digested with at least one restriction endonuclease having a degenerate recognition sequence. Any number of sequence-specific endonucleases can be used but commonly at least 4 to 6 are used. Endonucleases having 4-base, 5-base, or 6-base recognition sequences can be used as well as combinations of endonuclease having recognition sequences of different lengths. In one embodiment, the at least one endonuclease having a degenerate recognition sequence produces fragments having an unpaired (single-stranded) overhang containing Nm sequences where N is the extent of degeneracy and is an integer between 2 and 4, and m is the number of bases in the unpaired overhang and is generally an integer between 2 and 6. In one embodiment, Nm equals at least 64.
After the first digestion, the labeled digestion fragments in each portion are isolated by the detectable label and the isolated fragments subjected to a second digestion. In the second digestion, portion one is digested with the at least one restriction endonuclease having a degenerate recognition sequence previously used to digest portion two, and portion two is digested with the at least one sequence specific endonuclease previously used to digest portion one. Following this second digestion, fragments containing the detectable label are separated and discarded while unlabeled fragments are retained.
The retained unlabeled fragments are then hybridized and ligated to adapters specific for the endonucleases used. In one embodiment, the portions are combined prior to adaptor hybridization and ligation, while in another embodiment, the portions are kept separate. Generally, adapters for all possible fragments produced will be used. Thus, if an endonuclease potentially produces 64 different fragments (e.g. different unpaired overhangs), then 64 different adapters are used for that endonuclease. Ordinarily, the adapters are designed so that they do not have high sequence homology to any known sequences in the population. In one embodiment, adapters are designed so that they have common regions that can be used for primer binding sites. In one embodiment, no more than 10 different primer binding sequences are used while in another embodiment, two different primer binding sequences are used.
Following addition of the adapters, the fragments are amplified, ordinarily by PCR and generally using primer binding sites contained in the adapters. After, amplification, the fragments produced are separated and identified. In one embodiment, the fragments produced are identified by separating them on the basis of size. In another embodiment, the amount of each fragment produced is quantified.
Another aspect provides a method for detecting a change in RNA expression in a tissue or cell associated with an internal or external factor. In one embodiment, a population of double-stranded cDNA having an end label is produced from RNA obtained from a first cell or tissue exposed to the internal or external factor of interest and a nucleic acid library constructed using any of the novel methods described herein. The number of fragments and/or quantity of each fragment in the library is then determined to establish the pattern of RNA expression. Next a population of double stranded cDNA having an end labeled is produced from RNA obtained from a second cell or tissue not exposed to the internal or external factor of interest and a nucleic acid library produced as for the first cell or tissue. The number of fragments and/or quantity of each fragment in the library is determined to establish the pattern of RNA expression. The two patterns can then be compared to determine the effect of the external or internal factor on RNA (gene) expression. In one embodiment, data representing the pattern of RNA expression for the cell or tissue not exposed to the internal or external factor is stored on a computer-readable medium.
Still another aspect provides a method for diagnosing a disease, condition, disorder, or predisposition where the disease, condition, disorder or predisposition is associated with a change in RNA expression. In one embodiment, a population of double-stranded cDNA having an end label is produced from RNA obtained from a first cell or tissue known to have the disease, condition, disorder or predisposition of interest and a nucleic acid library constructed using any of the novel methods described herein. The number of fragments and/or quantity of each fragment in the library is then determined to establish the pattern of RNA expression. Next a population of double stranded cDNA having an end label is produced from RNA obtained from a second cell or tissue from a test subject and a nucleic acid library produced as for the first cell or tissue. In one embodiment, the first and second cell or tissue are of the same type. In another embodiment, the first and second cell or tissue are from the same species. The number of fragments and/or quantity of each fragment in the library is determined to establish the pattern of RNA expression. The two patterns can then be compared to diagnose the disease, condition, disorder or predisposition. In one embodiment, data representing the pattern of RNA expression for the cell or tissue known to have the disease, condition, disorder or predisposition of interest is stored on a computer-readable medium.
A further aspect provides a method for determining the physiological or developmental state of a cell or tissue. In one embodiment, a population of double-stranded cDNA having an end label is produced from RNA obtained from a first cell or tissue of a known physiological or developmental state and a nucleic acid library constructed using any of the novel methods described herein. The number of fragments and/or quantity of each fragment in the library is then determined to establish the pattern of RNA expression. Next a population of double stranded cDNA having an end labeled is produced from RNA obtained from a second cell or tissue of an unknown physiological or developmental state and a nucleic acid library produced as for the first cell or tissue. In one embodiment, the first and second cell or tissue are of the same type. In another embodiment, the first and second cell or tissue are from the same species. The number of fragments and/or quantity of each fragment in the library is determined to establish the pattern of RNA expression. The two patterns can then be compared to determine the physiological or developmental state of the cell or tissue. In one embodiment, data representing the pattern of RNA expression for the cell or tissue of a known physiological or developmental state is stored on a computer-readable medium.
Another aspect provides a method for constructing a nucleic acid expression library comprising obtaining a population of double-stranded cDNA molecules that contain a detectable label on their 3xe2x80x2 ends. The population of cDNA molecules is then digested with at least one restriction endonuclease having a degenerate recognition sequence containing at least one degenerate base to produce digestion fragments. The digestion creates single-stranded (unpaired) overhangs or portions in the fragments. These unpaired overhangs contain a region having the formula Nm, where N is the extent of degeneracy and m is the number of bases. The fragments are then hybridized and ligated to a population of adapters, where the adapters are specific for the at least one endonuclease used. In one embodiment, adapters are provided for all possible fragments (different overhangs). In another embodiment, less than all possible adapters are used. In one embodiment Nm equals at least 16, while in another embodiment, the adapters contain a primer binding region(s) common to groups of adapters or to all adaptors.
After ligation of the adapters, the 3xe2x80x2 end fragments produced are isolated using the detectable label. The 3xe2x80x2 end labeled fragments are then amplified and the amplified fragments are identified. In one embodiment the fragments are amplified by PCR while in a related embodiment, the amplification uses primer binding sites in the adapters. In still another embodiment, the amplified 3xe2x80x2 end fragments are identified by separating them on the basis of size.
Yet a further aspect provides a computer readable medium having stored thereon executable instructions for performing a method comprising, receiving data on RNA expression from a first cell or tissue produced by any of the novel methods disclosed herein and comparing the data from the first cell or tissue with RNA expression data produced from a second cell or tissue using the same method for determining RNA expression as for the first cell or tissue.
Another aspect provides non-redundant expressed sequence tags or probes comprising the fragments or portions thereof produced by the novel methods disclosed herein. Such probes or tags can be used, for example, to determine gene expression by Northern blotting.
Yet another aspect provides microarrays containing fragments or portions thereof produced by the novel methods disclosed herein.