Analysis of complex nucleic acid populations is a common problem in many areas of molecular biology, nowhere more so than in the analysis of patterns of gene expression. Various methods have been developed to allow simultaneous analysis of entire mRNA populations, or their corresponding cDNA populations, in order to understand the observed patterns of gene expression.
The method of “subtractive cloning” (Lee et al, Proc. Nat. Acad. Sci. USA 88, 2825-2829) allows identification of mRNAs, or rather, their corresponding cDNAs, that are differentially expressed in two related cell types. One can selectively eliminate cDNAs common to two related cell types by hybridizing cDNAs from a library derived from one cell type to a large excess of mRNA from a related, but distinct cell type. mRNAs in the second cell type complementary to cDNAs from the first type will form double-stranded hybrids. Various enzymes exist which degrade such double-stranded hybrids allowing these to be eliminated thus enriching the remaining population in cDNAs unique to the first cell type. This method allows highly specific comparative information about differences in gene expression between related cell types to be derived and has had moderate success in isolating rare cDNAs.
The methods of “differential display” (Science 257, 967-971, 1992) sorts mRNAs using PCR primers to selectively amplify specific subsets of an mRNA population. An mRNA population is primed with a general oligo(dT) primer to amplify one strand and a specific primer, of perhaps 10 nucleotides or so to amplify the reverse strand with greater specificity. In this way only mRNAs bearing the second primer sequence are amplified; the longer the second primer the smaller a proportion of the total cDNA population is amplified or any given sequence of that length used. The resultant amplified sub-population can then be cloned for screening or sequencing or the fragments can simply be separated on a sequencing gel. Low copy number mRNAs are less likely to get lost in this sort of scheme in comparison with subtractive cloning, and it is probably more reproducible. Whilst this method is more general than subtractive cloning, time-consuming analysis is required.
The method of “molecular indexing” (PCT/GB93/01452) uses populations of adapter molecules to hybridize to the ambiguous sticky-ends generated by cleavage of a nucleic acid with a type IIs restriction endonuclease to categorize the cleavage fragments. Using specifically engineered adapters one can specifically immobilize or amplify or clone specific subsets of fragments in a manner similar to differential display but achieving a greater degree of control. Again, time-consuming analysis is required.
The method of Kato (Nucleic Acids Research 12, 3685-3690, 1995) exemplifies the above molecular indexing approach and effects cDNA population analysis by sorting terminal cDNA fragments into sub-populations followed by selective amplification of specific subsets of cDNA fragments. Sorting is effected by using type IIs restriction endonucleases and adapters. The adapters also carry primer sites, which in conjunction with general oligo(dT) primers allows selective amplification of terminal cDNA fragments as in differential display. It is possibly more precise than differential display in that it effects greater sorting: only about 100 cDNAs will be present in a given subset and sorting can be related to specific sequence features rather than using primers chosen by trial and error.
The method of “Serial Analysis of Gene Expression” or “SAGE” (Science 270, 484-487, 1995) allows identification of mRNAs, or rather, their corresponding cDNAs, that are expressed in a given cell type. The method involved a process for isolating a “tag” from every cDNA in a population using adapters and type IIs restriction endonucleases. A tag is a sample of a cDNA sequence of a fixed number of nucleotides sufficient to identify uniquely that cDNA in the population. Tags are then ligated together to create so-called di-tags consisting of two decamers from the pool of cDNA molecules under investigation ligated head-to-head and flanked by two linkers. These di-tags are then amplified using PCR, concaternerized into longer fragments, cloned and sequenced. The method gives quantitative data on gene expression and will readily identify novel cDNAs. This method was invented in 1995, but trials have since then showed that the amplification efficiency of different di-tags depends very much upon the sequence of the individual di-tags. In one trial a seven fold difference between two di-tag sequences after 20 cycles of PCR was detected even though there was no difference in abundance between these two di-tags in the starting material (NAR 27(18), e22, 1999). This makes SAGE a very bad choice if reliable quantitative data are required. The method is also extremely time-consuming in view of the large amount of sequencing required.
The method of “Tandem Arrayed Ligation of Expressed Sequence Tags” or “TAL-EST” (NAR 27(18), e22, 1999) is a modification of SAGE, where the PCR amplification step gives way to a cloning step. Each analysis then involves two cloning steps. The method is very quantitative and reproducible (P=0.99), but on the other hand approx. 15% of all genes are invisible in this assay. This means that the expression of 15% of all genes is not detected regardless how abundant their mRNA is. Thus TALEST is a very labor and time intensive technique to work with and the coverage is only 85% of all genes.
The method of “Total Gene Expression Analysis” or “TOGA” (PNAS 97(5), p. 1976-1981, 2000) makes use of a technique where the poly(T) tail of the cDNA along with the sequence 5′ of the poly(T) tail is ligated into an RNA expression vector. This vector is then linarized and RNA in vitro synthesized. Then gene specific sequences are detected and quantified in approximately the same manner as with AFLP. Thus in TOGA, PCR is also used to amplify the products that are analyzed. As for SAGE, the use of PCR before the analysis step jeopardizes the quantitative aspect of the method.
The method of “Massively Parallel Signature Sequencing” or “MPSS” (Nature Biotech. 18, 630-634, 2000) uses a FACS sorting device in the data acquisition process. Like many of the other techniques MPSS depends heavily upon PCR for amplification of the tags, and hence MPSS is inflicted with all the problems that comes from using PCR.
Methods involving hybridization grids, chips and arrays are advantageous in that they avoid gel methods for sequencing and are relatively quantitative. They can be performed entirely in solution, and are thus readily automatable. These methods come in two forms.
The first involves immobilization of target nucleic aids to an array of oligonucleotides complementary to the terminal sequences of the target nucleic acid. Immobilization is followed by partial sequencing of those fragments by a single base method, e.g. using type IIs restriction endonucleases and adapters. This particular approach is advocated by Brenner in PCT/US95/12678.
The second form involves arrays of oligonucleotides. Nucleic acids are hybridized as single strands to the array. Detection of hybridization is achieved by fluorescently labeling each nucleic acid and determining from where on the grid the fluorescence arises, which determines the oligonucleotide to which the nucleic acid has bound. The fluorescent labels also give quantitative information about how much nucleic acid has hybridized to a given oligonucleotide. This information and knowledge of the relative quantities of individual nucleic acids should be sufficient to reconstruct the sequences and quantities of the hybridizing population. This approach is advocated by Lehrach in numerous papers and Nucleic Acids Research 22, 3423 contains a recent discussion. A disadvantage of this approach is that the construction of large arrays of oligonucleotides is extremely technically demanding and expensive. It is also still a very big technological challenge to hybridize between 10.000 and 20.000 different cDNA products quantitatively to a gene-chip containing between 25.000 and 100.000 different cDNA probes without getting a significant amount of mismatch hybridization. Another drawback with DNA array technology is that high quality sequence information is necessary for all the genes used on the array. Still the technology is relatively easy to use once the arrays have been designed and manifactured.
Additional methods for analyzing and demonstrating differential gene expression have been disclosed in e.g. WO 94/01582; WO 97/10363; WO 97/13877; WO 98/10095; WO 98/15652; WO 98/31380; WO 98/44152; WO 98/48047; WO 99/02725; WO 99/02726; WO 99/02727; WO 99/02728; WO 99/39001; WO 00/53806; U.S. Pat. No. 5,508,169; U.S. Pat. No. 5,658,736; U.S. Pat. No. 6,090,553; and EP 735 144 A1. Reference is also made to Cowan et al. (J. Theor. Biol., 1987, vol. 127, p. 229-245), who disclose breakage of double-standed DNA due to single-stranded nicking. The nicking activity is not site-specific. Morgan et al. (Biol. Chem., 2000, vol. 361, p. 1123-1125) disclose a characterization of the specific DNA nicking activity of restriction endonuclease N.BstNBI.
None of the above methods are related to a method for obtaining—and optionally analyzing the sequence of—at least one single stranded polynucleotide tag originating at least partly from a biological sample and comprising a consecutive sequence of bases, wherein—prior to sequence analysis or other characterization—no part of the single stranded polynucleotide tag comprises a complementary polynucleotide strand, and wherein preferably all of the bases originate from the biological sample, such as more than 95% of the bases, for example more than 90% of the bases, such as more than 85% of the bases, for example more than 80% of the bases, such as more than 75% of the bases originating from the biological sample.
Furthermore, none the above methods exploit a cleavage agent, preferably in the form of a site-specific nicking endonuclease capable of i) recognizing a predetermined nucleotide motif comprising complementary nucleotide strands and ii) cleaving only one of said complementary strands in the process of generating at least one single stranded polynucleotide tag.