Parallel assay formats permitting the concurrent (“multiplexed”) analysis of multiple analytes in a single reaction are gaining wide-spread acceptance in the analysis of proteins and nucleic acids in molecular medicine and biomedical research. Multiplexed formats of nucleic acid analysis—either in solution or in a solid phase format involving arrays of immobilized primers and probes (see, e.g., U. Maskos, E. M. Southern, Nucleic Acids Res. 20, 1679-1684 (1992); S. P. A. Fodor, et al., Science 251, 767-773 (1991))—generally involve the selection of oligonucleotide probes whose specific interaction with designated subsequences within a given set of target sequences of interest reveals the composition of the target at the designated position(s).
Applications of particular practical interest, involve multi-step procedures, such as, as a first step, the conversion of a set of original sequences into a selected subset, for example by means of amplification of selected subsequences of genomic DNA by PCR amplification to produce corresponding amplicons, or by means of reverse transcription of selected subsequences of mRNA to produce corresponding cDNAs. In the simplest such sequence of process steps, a conversion step is followed by a detection step to complete the analysis. In these applications, the reliability of multiplexed nucleic acid analysis critically depends on the specific and preferably exclusive interaction of primers with their respective cognate target subsequences and the specific and preferably exclusive interaction of probes with their respective cognate subsequences within the targets produced in the conversion step. Accordingly, described herein are methods which, given a set of target sequences of interest, allow selection of conversion probes (“primers”) and detection probes so as to minimize the interaction of a given primer or probe with any but its cognate target subsequence.
Multiplexed Expression Profiling—Methods of gene expression analysis have been widely used in connection with target discovery or mapping, in which genes of interest may not be known a priori and a significant risk of error may have to be tolerated. Conversely, in diagnostic applications involving a designated set of genes of interest, the multiple sources of potential error inherent in the aforementioned approaches generally will not be tolerable. The present invention discloses methods of analysis suitable for diagnostic applications as well as target validation and patient profiling.
Known methods for multiplexed expression analysis use either randomly placed short reverse transcription (RT) primers to convert a set of RNAs into a heterogeneous population of cDNAs, or a universal RT primer directed against the polyA tail of the mRNA to produce full-length cDNAs. While these methods obviate the need for design of sequence-specific RT primers, both have significant disadvantages in quantitative expression monitoring, which requires the quantitative determination of cDNA levels in the target mixture as a measure of the levels of expression of the corresponding mRNAs.
The determination of gene expression levels may be performed in a parallel format by employing an array of oligonucleotide capture probes or, in some cases, cDNA molecules disposed on a planar substrate, and contacting the array—under specific conditions permitting formation of probe-target complexes—with a solution containing nucleic acid samples of interest, including mRNAs extracted from a particular tissue, or cDNAs produced from the mRNAs by reverse transcription (RT). Following completion of the complex formation (“hybridization”) step, unbound target molecules are removed, and intensities are recorded from each position within the array, these intensities reflecting the amount of individual probe-target complexes formed during the assay. This pattern is analyzed to obtain information regarding the abundance of mRNAs expressed in the sample.
In a commonly practiced approach to multiplexed expression profiling, mRNA molecules in a sample of interest are first reverse transcribed to produce corresponding cDNAs and are then contacted with an array of oligonucleotide capture probes formed by spotting or by in-situ synthesis. Lockhart et al., U.S. Pat. No. 6,410,229 invoke a complex protocol to produce cRNA, wherein mRNA is reverse transcribed to cDNA, which is in turn transcribed to cRNA under heavy labeling—of one in eight dNTPs on average—and detected on an array of synthesized oligonucleotide probes using a secondary “decoration” step. This is a complex, lengthy and expensive process.
These known methods rely on multiplexed probe-target hybridization, which is known to be lacking in specificity, as the single step of sequence-specific discrimination between, and quantitative determination of, multiple target sequences. Randomly placed RT primers will produce a representative population of cDNAs; that is, one in which each cDNA is represented with equal frequency, only in the limit of infinitely long mRNA molecules. The analysis of a designated set of short mRNAs by random priming generally will produce cDNAs of widely varying lengths for each type of mRNA in the mixture, and this in turn will introduce potentially significant bias in the quantitative determination of cDNA concentration, given that short cDNAs will more readily anneal to immobilized capture probes than will long cDNAs. Further, the production of full-length cDNAs, if in fact full-length RT is successful, provides a large sequence space for potential cross-reactivity between probes and primers, making the results inherently difficult to interpret and unreliable.
Some methods of multiplexed hybridization use long probes in spotted arrays. Note that Agilent EP 1207209 discloses probes of preferred length 10 to 30 nucleotides, and preferably about 25 nucleotides. These may offer an advantage—in the generally undesirable situation in which probe adhesion to the substrate randomly obstructs target access to probe sequences of interest because probe-target complex formation generally will not involve the full length, but rather randomly accessible subsequences of the probe. However, in a long probe, the probe sequence of interest may be obstructed and not accessible.
Differential Gene Expression—Gene expression analysis has been widely used to characterize molecular differences between normal tissue or cells vs diseased or otherwise altered tissue or cells, or differences between normal (“wild-type”) vs transgenic plants. In accordance with a commonly practiced approach to differential gene expression, a set of cDNA clones is “spotted” onto a planar substrate to form the probe array which is then contacted DNA produced from normal and altered sources, the two types of DNA. DNA from the two sources is differentially labeled to permit the recording of patterns formed by probe-target hybridization in two color channels and thus permitting the determination of expression ratios in normal and altered samples (see, e.g., U.S. Pat. No. 6,110,426 (Stanford University)). The system of two-color fluorescent detection is cumbersome and may lead to errors of detection.
Multiplexed Analysis of Mutations and Polymorphisms—Another well-known method for multiplexed conversion of genomic DNA sequences to a selected set of short DNA subsequences is amplification with sequence-specific primers, as in the example of linear amplification by strand displacement or other methods or geometric amplification by PCR. Following amplification, the amplicons can be analyzed by hybridization detection or by hybridization coupled with elongation detection, using cognate probes. Selection of primers and probes can avoid excessive cross-hybridization and enhance the reliability of the results. The methods described herein also relate to applications that call for amplification followed by detection, as well as to situations calling for the concatenation of multiple conversion and detection steps.
What is desirable in these applications is the selection, for each target, of a matching (“cognate”) probe, that is, a probe with a sequence that is perfectly complementary to one and only one designated subsequence while containing at least one, but preferably several non-complementary (“mismatched”) positions with respect to all other sequences (or subsequences on the same target strand as the cognate subsequence) in the reaction (see e.g., “Selection of optimal DNA oligos for gene expression arrays”, Li & Stormo, Bioinformatics 17, 1067-1076 (2001)). To select one among several possible candidate probes, known methods rely on the evaluation of sequence-dependent free energies of the complex (“duplex”) formed between primer or probe and target, the analysis culminating in the evaluation of the thermodynamic stability of the complex in terms of a “melting” temperature (Cantor & Smith, “Genomics”, 2001).
Several available algorithms for primer and probe design have been described which invoke NN-interaction parameters to compute the free energy of a hybridization complex of known sequence whose thermodynamic stability is expressed in the form of a “melting temperature”, Tm; at T=Tm, half of the complex has denatured into its constituent strands. Several commercially available software packages focus on the detailed modeling of probe-target interaction under a wide range of relevant experimental parameters to predict the stability of the complex as well as competing structures such as folded target or probe strands, the latter including certain hairpin configurations. In the majority of commercial primer or probe design tools, the issue of cross-reactivity, critical to the design of multiplexed assays, remains substantially unaddressed.
When sequence homologies are taken into account, this is achieved by pairwise comparison using standard search tools such as BLAST (see, e.g., PrimerSelect (DNAStar), ArrayDesigner 2(Premier Biosoft)), an approach that not only requires significant time and effort in manually performing pairwise comparisons by “cutting and pasting”, but also fails for long templates (>1 kb), and generally ignores the fact that the position of a mismatch within the primer or probe sequence plays a critical role in determining the actual extent of cross-reactivity. Moreover, the design of conversion probes (“primers”) is treated independently of the design of detection probes, creating a source of unreliability.
Design of Unique Sequences: Coding—The issue of selecting a set of unique probe sequences is central to the design of DNA codes, namely sets of equi-length “words” composed of the letters A, T, G and C, for purposes of designing methods of parallel sequencing, storing (“encoding”) information in chemical libraries such as “zip code” oligos (U.S. Pat. No. 5,981,176 to Wallace) or analog (“DNA”) computing. The objective of code design is to find a set of N-letter words (herein also referred to as “N-strings”) wherein any two words differ in at least d positions with respect to the Watson-Crick base pairing rules—that is, words have a Hamming distance of at least d≦N. Generally, codes satisfy additional constraints, for example, the constraint that free energies, computed on the basis of standard nearest-neighbor (NN) interaction parameters (Cantor & Smith, “Genomics”, 2001), fall into a given range.
The methods herein address a different situation: probe sequences must be identified which match a preselected set of target sequences while minimizing unwanted cross-reactions with other than the cognate sequences. In view of the foregoing considerations, it will be desirable, for diagnostic application of gene expression analysis—herein also referred to as multiplexed expression monitoring (mEM)—as well as for related situations involving target amplification—to have flexible and rapid methods by which to produce correlated sets of desirable conversion probes such as RT primers and detection probes such as probes for hybridization-mediated target capture which enhance the level of reliability.