The present invention relates to methods for detecting and mapping genetic abnormalities associated with various diseases. In particular, it relates to the use of nucleic acid hybridization methods for comparing copy numbers of particular nucleic acid sequences in a collection of sequences relative to the copy number of these sequences in other collections of sequences.
Many genomic and genetic studies are directed to the identification of differences in gene dosage or expression among cell populations for the study and detection of disease. For example, many malignancies involve the gain or loss of DNA sequences resulting in activation of oncogenes or inactivation of tumor suppressor genes. Identification of the genetic events leading to neoplastic transformation and subsequent progression can facilitate efforts to define the biological basis for disease, improve prognostication of therapeutic response, and permit earlier tumor detection.
In addition, perinatal genetic problems frequently result from loss or gain of chromosome segments such as trisomy 21 or the micro deletion syndromes. Thus, methods of prenatal detection of such abnormalities can be helpful in early diagnosis of disease.
Cytogenetics is the traditional method for detecting amplified or deleted chromosomal regions. The resolution of cytogenetic techniques is limited, however, to regions larger than approximately 10 Mb (approximately the width of a band in Giemsa-stained chromosomes). In complex karyotypes with multiple translocations and other genetic changes, traditional cytogenetic analysis is of little utility because karyotype information cannot be fully interpreted. Furthermore conventional cytogenetic banding analysis is time consuming, labor intensive, and frequently difficult or impossible due to difficulties in obtaining adequate metaphase chromosomes. In addition, the cytogenetic signatures of gene amplification, homogeneously staining regions (HSR), or double minute chromosomes, do not provide any information that contributes to the identification of the sequences that are amplified.
More recent methods permit assessing the amount of a given nucleic acid sequence in a sample using molecular techniques. These methods (e.g., Southern blotting) employ cloned DNA or RNA probes that are hybridized to isolated DNA. Southern blotting and related techniques are effective even if the genome is heavily rearranged so as to eliminate useful karyotype information. However, these methods require use of a probe specific for the sequence to be analyzed. Thus, it is necessary to employ very many individual probes, one at a time, to survey the entire genome of each specimen, if no prior information on particular suspect regions of the genome is available.
Comparative genomic hybridization (CGH) is a more recent approach to detect the presence and identify the location of amplified or deleted sequences. See, Kallioniemi et al., Science 258: 818-821 (1992) and WO 93/18186). CGH reveals increases and decreases irrespective of genome rearrangement. In one implementation of CGH, genomic DNA is isolated from normal reference cells, as well as from test cells (e.g., tumor cells). The two nucleic acids are differentially labelled and then hybridized in situ to metaphase chromosomes of a reference cell. The repetitive sequences in both the reference and test DNAs are either removed or their hybridization capacity is reduced by some means. Chromosomal regions in the test cells which are at increased or decreased copy number can be quickly identified by detecting regions where the ratio of signal from the two DNAs is altered. For example, those regions that have been decreased in copy number in the test cells will show relatively lower signal from the test DNA than the reference compared to other regions of the genome. Regions that have been increased in copy number in the test cells will show relatively higher signal from the test DNA.
Thus, CGH discovers and maps the location of the sequences with variant copy number without prior knowledge of the sequences. No probes for specific sequences are required and only a single hybridization is required. Where a decrease or an increase in copy number is limited to the loss or gain of one copy of a sequence, the CGH resolution is usually about 5-10 Mb.
New techniques which provide increased sensitivity, more precise localization of chromosomal abnormalities and which can detect differences in levels of gene expression are particularly desirable for the diagnosis of disease. The present invention provides these and other benefits.
The present invention provides methods for quantitatively comparing copy numbers of at least two nucleic acid sequences in a first collection of nucleic acid molecules relative to the copy numbers of those same sequences in a second collection. The method comprises labeling the nucleic acid molecules in the first collection and the nucleic acid molecules in the second collection with first and second labels, respectively. The first and second labels should be distinguishable from each other. The probes thus formed are contacted to a plurality of target elements under conditions such that nucleic acid hybridization to the target elements can occur. The probes can be contacted to the target elements either simultaneously or serially.
Each target element comprises target nucleic acid molecules bound to a solid support. One or more copies of each sequence in a target element may be present. The sequence complexity of the target nucleic acids in the target element are much less than the sequence complexity of the first and second collections of labeled nucleic acids.
The nucleic acids for both the target elements and the probes may be, for example, RNA, DNA, or cDNA. The nucleic acids may be derived from any organism. Usually the nucleic acid in the target elements and the probes are from the same species.
The target elements may be on separate supports, such as a plurality of beads, or an array of target elements may be on a single solid surface, such as a glass microscope slide. The nucleic acid sequences of the target nucleic acids in a target element are those for which comparative copy number information is desired. For example, the sequence of an element may originate from a chromosomal location known to be associated with disease, may be selected to be representative of a chromosomal region whose association with disease is to be tested, or may correspond to genes whose transcription is to be assayed.
After contacting the probes to the target elements the amount of binding of each, and the binding ratio is determined for each target element. Typically the greater the ratio of the binding to a target element the greater the copy number ratio of sequences in the two probes that bind to that element. Thus comparison of the ratios among target elements permits comparison of copy number ratios of different sequences in the probes.
The methods are typically carried out using techniques suitable for fluorescence in situ hybridization. Thus, the first and second labels are usually fluorescent labels.
To inhibit hybridization of repetitive sequences in the probes to the target nucleic acids, unlabeled blocking nucleic acids (e.g., Cot-1 DNA) can be mixed with the probes. Thus, the invention focuses on the analysis of the non-repetitive sequences in a genome.
In a typical embodiment, one collection of probe nucleic acids is prepared from a test cell, cell population, or tissue under study; and the second collection of probe nucleic acids is prepared from a reference cell, cell population, or tissue. Reference cells can be normal non-diseased cells, or they can be from a sample of diseased tissue that serves as a standard for other aspects of the disease. For example, if the reference probe is genomic DNA isolated from normal cells, then the copy number of each sequence in that probe relative to the others is known (e.g., two copies of each autosomal sequence, and one or two copies of each sex chromosomal sequence depending on gender). Comparison of this to a test probe permits detection in variations from normal. Alternatively the reference probe may be prepared from genomic DNA from a primary tumor which may contain substantial variations in copy number among its different sequences, and the test probe may prepared from genomic DNA of metastatic cells from that tumor, so that the comparison shows the differences between the primary tumor and its metastasis. Further, both probes may be prepared from normal cells. For example comparison of mRNA populations between normal cells of different tissues permits detection of differential gene expression that is a critical feature of tissue differentiation. Thus in general the terms test and reference are used for convenience to distinguish the two probes, but they do not imply other characteristics of the nucleic acids they contain.
The invention also provides kits comprising materials useful for carrying out the methods of the invention. Kits of the invention comprise a solid support having an array of target nucleic acids bound thereto and a container containing nucleic acids representing a normal reference genome, or cDNA from a reference cell type, and the like. The kit may further comprise two different fluorochromes, reagents for labeling the test genomes, alternate reference genomes and the like.
A xe2x80x9cnucleic acid arrayxe2x80x9d is a plurality of target elements, each comprising one or more target nucleic acid molecules immobilized on a solid surface to which probe nucleic acids are hybridized.
xe2x80x9cTarget nucleic acidsxe2x80x9d of a target element typically have their origin in a defined region of the genome (for example a clone or several contiguous clones from a genomic library), or correspond to a functional genetic unit, which may or may not be complete (for example a full or partial cDNA). The target nucleic acids can also comprise inter-Alu or Degenerate Oligonucleotide Primer PCR products derived from such clones. If gene expression is being analyzed, a target element can comprise a full or partial cDNA.
The target nucleic acids of a target element may, for example, contain specific genes or, be from a chromosomal region suspected of being present at increased or decreased copy number in cells of interests e.g., tumor cells. The target element may also contain an mRNA, or cDNA derived from such mRNA, suspected of being transcribed at abnormal levels.
Alternatively, a target element may comprise nucleic acids of unknown significance or location. An array of such elements could represent locations that sample, either continuously or at discrete points, any desired portion of a genome, including, but not limited to, an entire genome, a single chromosome, or a portion of a chromosome. The number of target elements and the complexity of the nucleic acids in each would determine the density of sampling. For example an array of 300 target elements, each target containing DNA from a different genomic clone, could sample the entire human genome at 10 megabase intervals. An array of 30,000 elements, each containing 100 kb of genomic DNA could give complete coverage of the human genome.
Similarly, an array of targets elements comprising nucleic acids from anonymous cDNA clones would permit identification of those that might be differentially expressed in some cells of interest, thereby focusing attention on study of these genes.
Target elements of various dimensions can be used in the arrays of the invention. Generally, smaller, target elements are preferred. Typically, a target element will be less than about 1 cm in diameter. Generally element sizes are from 1 xcexcm to about 3 mm, preferably between about 5 xcexcm and about 1 mm.
The target elements of the arrays may be arranged on the solid surface at different densities. The target element densities will depend upon a number of factors, such as the nature of the label, the solid support, and the like.
One of skill will recognize that each target element may comprise a mixture of target nucleic acids of different lengths and sequences. Thus, for example, a target element may contain more than one copy of a cloned piece of DNA, and each copy may be broken into fragments of different lengths. The length and complexity of the target sequences of the invention is not critical to the invention. One of skill can adjust these factors to provide optimum hybridization and signal production for a given hybridization procedure, and to provide the required resolution among different genes or genomic locations. Typically, the target sequences will have a complexity between about 1 kb and about 1 Mb.
In preferred embodiments, the targets of the invention are nucleic acids which substantially lack superstructure associated with condensed metaphase chromosomes from which they are derived. The general nature of the packing of DNA into eukaryotic chromosomes is well known to those of skill in the art. Briefly, the superstructure of a eukaryotic chromosome comprises many orders of complexity. DNA is wrapped around a histone core to form regular repeating nucleosomes, which, in turn, are packed one upon another to generate more tightly condensed 30 nm chromatin fibers. The chromatin fibers are then further packed in a variety of looped domains to produce higher orders of folding and condensation in the metaphase chromosome. The nucleic acid targets of the invention lack some or all of the these features of naturally occurring condensed, metaphase chromosomes. For a general description of global structure of eukaryotic chromosomes, see, Alberts et al. Molecular Biology of the Cell 2nd ed. pp 496-506, Garland Publishing Inc. New York, 1989).
The terms xe2x80x9cnucleic acidxe2x80x9d or xe2x80x9cnucleic acid moleculexe2x80x9d refer to a deoxyribonucleotide or ribonucleotide polymer in either single- or double-stranded form, and unless otherwise limited, would encompass known analogs of natural nucleotides that can function in a similar manner as naturally occurring nucleotides.
As used herein a xe2x80x9cprobexe2x80x9d is defined as a collection of nucleic acid molecules (either RNA or DNA) capable of binding to a target nucleic acid of complementary sequence through one or more types of chemical bonds, usually through hydrogen bond formation. The probes are preferably directly or indirectly labelled as described below. They are typically of high complexity, for instance, being prepared from total genomic DNA or mRNA isolated from a cell or cell population.
The term xe2x80x9ccomplexityxe2x80x9d is used here according to standard meaning of this term as established by Britten et al. Methods of Enzymol. 29:363 (1974). See, also Cantor and Schimmel Biophysical Chemistry: Part III at 1228-1230 for further explanation of nucleic acid complexity.
xe2x80x9cBind(s) substantiallyxe2x80x9d refers to complementary hybridization between a probe nucleic acid and a target nucleic acid and embraces minor mismatches that can be accommodated by reducing the stringency of the hybridization media to achieve the desired detection of the target polynucleotide sequence.
The terms xe2x80x9cspecific hybridizationxe2x80x9d or xe2x80x9cspecifically hybridizes withxe2x80x9d refers to hybridization in which a probe nucleic acid binds substantially to target nucleic acid and does not bind substantially to other nucleic acids in the array under defined stringency conditions. One of skill will recognize that relaxing the stringency of the hybridizing conditions will allow sequence mismatches to be tolerated. The degree of mismatch tolerated can be controlled by suitable adjustment of the hybridization conditions.
One of skill will also recognize that the precise sequence of the particular nucleic acids described herein can be modified to a certain degree to produce probes or targets that are xe2x80x9csubstantially identicalxe2x80x9d to others, and retain the ability to bind substantially to a complementary nucleic acid. Such modifications are specifically covered by reference to individual sequences herein. The term xe2x80x9csubstantial identityxe2x80x9d of polynucleotide sequences means that a polynucleotide comprises a sequence that has at least 90% sequence identity, and more preferably at least 95%, compared to a reference sequence using the methods described below using standard parameters.
Two nucleic acid sequences are said to be xe2x80x9cidenticalxe2x80x9d if the sequence of nucleotides in the two sequences is the same when aligned for maximum correspondence as described below. The term xe2x80x9ccomplementary toxe2x80x9d is used herein to mean that the complementary sequence is complementary to all or a portion of a reference polynucleotide sequence.
Sequence comparisons between two (or more) polynucleotides are typically performed by comparing sequences of the two sequences over a xe2x80x9ccomparison windowxe2x80x9d to identify and compare local regions of sequence similarity. A xe2x80x9ccomparison windowxe2x80x9d, as used herein, refers to a segment of at least about 20 contiguous positions, usually about 50 to about 200, more usually about 100 to about 150 in which a sequence may be compared to a reference sequence of the same number of contiguous positions after the two sequences are optimally aligned.
Optimal alignment of sequences for comparison may be conducted by the local homology algorithm of Smith and Waterman Adv. Appl. Math. 2: 482 (1981), by the homology alignment algorithm of Needleman and Wunsch J. Mol. Biol. 48:443 (1970), by the search for similarity method of Pearson and Lipman Proc. Natl. Acad. Sci. (U.S.A.) 85: 2444 (1988), by computerized implementations of these algorithms.
xe2x80x9cPercentage of sequence identityxe2x80x9d is determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison and multiplying the result by 100 to yield the percentage of sequence identity.
Another indication that nucleotide sequences are substantially identical is if two molecules hybridize to the same sequence under stringent conditions. Stringent conditions are sequence dependent and will be different in different circumstances. Generally, stringent conditions are selected to be about 5xc2x0 C. lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and pH. The Tm is the temperature (under defined ionic strength and pH) at which 50% of the target sequence hybridizes to a perfectly matched probe.