1. Field of the Invention
The invention relates to high-resolution, precise method for detecting genomic rearrangements in vitro using specially designed combinations of polynucleotide probes. The invention concerns accurate methods of detection and diagnosis of conditions, disorders and diseases associated with rearrangement of genomic DNA.
2. Description of the Related Art
The Multigenic Paradigm of Human Diseases
Advances in genetic analysis of human diseases have provided better insights into the molecular mechanisms contributing to disease initiation and progression. Previous associations were made between particular diseases and association and/or linkage disequilibrium to single base mutations in somatic genetic sequences or with particular single nucleotide polymorphisms (“SNPs”) in genomic DNA. Newer technologies have provided evidence that larger genetic alterations and rearrangements are associated with, or can constitute major causes of diseases, disorders or conditions having a genetic origin or basis. Disease associations have now moved from a monogenic to a multigenic paradigm where a disease's origins and progression is mainly linked to more than one single genetic mutation or origin. While these new insights provide better avenues for disease detection and treatments, they also highlight the need for combinatorial genetic analysis that goes beyond detection of single mutational events or SNPs by assessing disease associations with larger genomic rearrangements. Such combinatorial genetic analysis would provide a better, more precise and accurate diagnosis of a particular condition, disorder, disease or pathology, but would also help establishing a more appropriate medical survey, more accurate therapeutic decisions and interventions, as well as help in assessing the efficacy of such therapies and interventions.
Multigenic Causes of Genetic Disease
Genetic disorders manifesting the same or similar clinical signs and consequences can arise from both single and exclusive, or combined, mutations in various genes. Such mutations can fall within either the single base alteration and/or the class of large genetic rearrangements. A few examples of such genetic disorders are Fragile X syndrome (mutations and expansions in the FMR1 gene), Ataxia Telangectasia (single base pair mutations in either intronic and exonic sequences as well as deletions and translocations of the ATM gene), Seckel syndrome (mutations as well as large rearrangements in SCKL1, SCKL2, SCKL3, PCTN and ATR), autism (mutations as well as large rearrangements in GLO 1, MTF 1 and SLC11A3), Spinal Muscular Atrophy (mutations, deletions, transconversions as well as cis-duplications involving the SMN1 and SMN2 genes) and myotonic dystrophy (trinucleotide/tetranucleotide expansions in DM1 and DM2).
Multigenic Causes of Cancer Predisposition
In the case of cancer predisposition, there are several examples of familial cancer predisposition syndromes for which one can nominate several causative genes for which both single base alterations and/or large rearrangements were identified.
Breast and Ovary Cancer. Causative genes: BRCA1, BRCA2, ATM . . .
mutation type: higher proportion of point mutations identified so far.
Hereditary nonpolyposis colorectal cancer (Lynch syndroma). Causative genes: MSH2, MLH1, MSH6, EPCAM, . . . mutation type: equivalent proportion of point mutations has also been identified.
Multigenic Causes of Cancer Progression
Cancer progression is surely the human disease domain where the monogenic causative hypothesis was definitely ruled out since several years. First, the disease's initiation is strictly dependent of two molecular events (immortalizing and transforming) due to genetic alterations in at least two independent genes classified at either oncogene or tumor suppressor genes. Second, the disease's progression is linked to additional genetic alterations independent from the causative ones. Not only do these additional alterations play a role in cancer progression, they also were demonstrated to be the basis for appearance of resistance to therapy during treatments. Strikingly, in the list of cancer related genes, if extremely rare examples are only subject to discrete single base mutations (e.g., KRas or BRaf), the large majority is either subject to only large rearrangements (e.g., HER2, ALK . . . ) or to both single base mutations and large rearrangements (p53, c-myc, c-Met, EGFR . . . ).
The identification and characterization of multigenic conditions, disorders and diseases, including cancer, cardiovascular disease, diabetes and other heritable genetic conditions has been made difficult in part due to the imprecision of existing methods of molecular diagnosis. Molecular Combing is probably the sole approach allowing detecting all type of large genetic rearrangements (deletion, amplification, expansions, inversions, translocations . . . ) even in a complex and heterogeneous population (such as tumors).
High resolution barcodes allowing multiplex analysis of patients could help diagnostic at different level such as for patient stratification/classification and/or prognosis.
Multiplex High Resolution Barcodes for Identifying the Right Genetic Alterations as a Key Driver for Therapeutic Intervention
The Example of Myotonic Dystrophy
Myotonic Dystrophy (DM1) and Myotonic Dystrophy 2 (DM2) are two muscular dystrophies characterized by trinucleotide/tetranucleotide expansions in two different genes. If severe forms of DM1 can be clinically differentiated from DM2, milder DM1 forms are displayed extremely similar clinical signs than DM2. There is currently no cure for or treatment specific to myotonic dystrophy. However, DM1 patients exhibit Complications of the disease (heart problems, cataracts . . . ) not existing in DM2 that could can be treated but not cured. Differentiating DM1 and DM2 by the use of a multiplex assay of high resolution barcodes could thus help preventing and treating secondary effects
The Example of Hereditary Breast and Ovary Cancer
In certain countries (U.S.) detecting constitutional alterations in BRCA1/2 drives to therapeutic intervention (surgery/reconstitution). Thus, there is a clear need for an accurate diagnostic comprising all the potentially involved genes. Such a test could be made on the basis of a multiplex assay of high resolution barcodes comprising large chromosomal regions around genes known to be involved in this syndrome; BRCA1, BRCA2, ATM, ATR . . .
DNA Damage and Response Inhibitors Example
Synthetic lethality became a strong reality for therapeutic decision to include Cancer patients in specific protocols/regimens. One of the first examples was given with the demonstration that Breast cancer patients with BRCA deficiency exhibit a higher sensitivity to PARP inhibitors, a new category of drug acting on DNA Damage and Response pathway. More recently, this was extended to other type of inhibitors in this category such as ATM inhibitors but also to more traditional anti-cancer drugs including all types of DNA polymerase and replication inhibitors.
Not only does this concept extended to other inhibitors, but it was also demonstrated that it could be extended to other types of cancers such as lung and metastatic melanoma.
Here, a multiplex high resolution barcode will allow detection of genetic alteration in genes involved in DNA damage and response that could help predicting sensitivity to this class of inhibitors. A list of such genes could include BRCA1, BRCA2, ATM, ATR, MSH2, MLH1, MSH6, EPCAM . . .
The Lung Cancer Example
Numerous alterations involved in lung cancer could be multiplexed for a better patient classification such as:
LOH/Deletion (P53, STK11, LKB1, BRG1, KLF6);
Amplification (FGFR1, MET, EGFR, HER2 . . . );
Translocation: (ALK);
All these genetic alteration are associated to therapeutic treatments:
P53: Nutlin (low doses Actinomycin D produce similar effects)
FGFR1: Masitinib, PD173074, SU5402 TK1258 AZD4547 . . .
MET: GSK1363089, ARQ197, SGX523, XL184 . . .
EGFR: Tarceva, Erbitux, Vectibix . . .
HER2: Herceptin, Lapatinib . . .
ALK: Crizotinib
As at least 30% of NSCLCs were demonstrated to be dependent on at least one of these mutations, defining the genetic profile of the tumor could help driving therapeutic options. This could be made possible by designing multiplex assays combining high resolution barcodes covering this major genetic loci.
Localization of (Genetic) Sequences of Interest
Genetic sequence is the most fundamental information to synthesize functional protein. Alteration of genetic sequence sometimes results in loss of functional protein synthesis. In addition to alteration of genetic sequence, loss or gain of genetic sequence (copy number variation, CNV) also can be problematic for homeostasis of cellular activity. For example, loss of (functional) anti-tumor protein (p53) or gain of proto-oncogene (c-myc) results in cancer-prone cell. When such mutation happens (or exists) in germ cell, this mutation spreads whole cell in an individual who is either carrier or patient of genetic disease, or has a predisposition to cancer. The germline mutation can be heritable. These days CNV becomes more and more important to understand in the field of genetics (ref 1). However, copy number count alone is not always sufficient and it is often critical to establish the actual location of sequence elements. This is strikingly the case for e.g. balanced translocations. DNA sequencing and CNV detection methods such as array-based comparative genomic hybridization (aCGH) and quantitative PCR generally cannot detect these balanced mutations because these methods assess whether the sequence and the copy number are correct or not. FISH and its extended forms such as fiber-FISH or molecular combing can address these balanced mutations with different resolutions and precisions depending on methods.
Resolution and Precision
The use of BAC/PAC/cosmid probes on targeted regions was successfully conducted to detect large (a few kb to tens of kb) genomic rearrangements (ref 2). In these approaches, the minimum size of detectable events (e.g., the size of the deleted or amplified sequence), hereafter designated as the “resolution” of such an assay, is limited due to the large standard deviation involved in measuring probes or gaps of tens of kilobases. Indeed, in such assays the standard deviation of measurements increases with the length of the measured element. For example, a 40 kb-probe is measured with a standard deviation of ˜5 kb. Thus, if 16 measurements of a given probe are made on a slide, the precision on the size of the probe obtained as the mean value of measurements is in the order of magnitude of 2.5 kb (Considering the distribution is gaussian, and the precision is the half-width of the confidence interval, i.e. 2.sd/√n where sd=standard deviation and n=number of measurements). For a 10 kb-probe, where the standard deviation is ˜2 kb, the precision would be ˜1 kb. This illustrates the fact that shorter probes allow for better (lower) resolution.
Besides, the location of such an event (the position of the extremities of the event) may be defined with a precision (hereafter the location precision) limited by the size of the probe or gap within which it occurs: e.g. if a 40 kb probe is estimated to measure 39 kb in a sample, one can conclude that a 1 kb deletion occurred somewhere within the probe, with no further precision—thus, somewhere in a 40 kb genomic region. If the same 1 kb deletion had occurred within a 10 kb probe, the location of that deletion would be known with a better precision, as the range would be reduced to a 10 kb genomic region. Therefore, the smaller the probes and gaps, the better the location precision.
There are limits to small probes: (i) below a certain size, they become difficult to detect; (ii) they involve more complex color schemes (as there are relatively more probes); (iii) there are more distinct probes to cover a given region, and the experiments are therefore more expensive and time-consuming; (iv) most importantly, fast and reliable identification of probes, whether by a human operator or a piece of software, is easier with longer probes, as they are more readily distinguished from background. Indeed, background is mainly constituted of roughly circular fluorescent spots. When large enough, the shape of these spots allows to one to easily distinguish them from probes. However, when their size is small enough, they appear difficult to distinguish from small probes.
In operating conditions according to the invention, probes shorter than ˜3 kb are detected with a diminished efficiency. Within the 3-10 kb range, the standard deviation of measurements varies little, and there is therefore little benefit in resolution with the shorter probes within this range. Therefore, this range is usually considered to be a good compromise for probe size. However, in cases where probes are close enough (less than 10 kb gaps), smaller probes (within the 500-3000 bp range) are still useful, as they will be detected in at least a fraction of signals and the presence of the corresponding sequences may therefore be established with certainty. It was also found that detection of isolated probes longer than 12 kb (preferably longer than 14 kb) is more reliable, whether for a human operator or for automatic detection software.
Exclusion of Repeats
Eukaryotic genomic DNA contains various repetitive sequences, i.e., sequences that appear more than once (and more than statistically predicted based on their length and base content) in a normal haploid genome. Among these, some appear with very high frequency (tens of thousands to millions of copies). In human genomic DNA, the most abundant of these is the Alu family, which has ˜1,000,000 copies constituting ˜10% of the genome. In any hybridization procedure involving human genomic DNA, it is expected that probes carrying such repeats would hybridize on numerous targets, generating non-specific signal from regions throughout the genome. Other types of repetitive sequences exist, with lower frequency, and often more specific localization. The number of copies and repeat sequence length may vary widely, as well as the degree of homology. Beta-satellite sequences, for example, are present in multiple copies (hundreds to thousands), usually as tandem repeat arrays comprising hundreds of copies of the same 50-100 bp long sequence, specifically localized in a limited number of loci. Strategies to get rid of the non-specific signals depend on the type of procedure and probe. Schematically, when probes are very short sequences of DNA (oligonucleotides, typically less than 100 bp), as in aCGH procedures, the sequence of the oligonucleotides is chosen to be free of repetitive sequences, by comparison with repetitive sequences found in databases. This strategy is only practical for very short probes, as short sequences free of repetitive sequences are relatively abundant, but unpractical for longer probes, as long stretches completely devoid of repetitive elements are rare (although this has been adapted to longer FISH probes, in an approach that suffers multiple drawbacks, see below). Besides, even for short probes, it constrains the design of probes heavily and some genomic regions, rich in repetitive sequences, have lower density of coverage (and thus lower resolution of events) due to this constraint.
When probes are longer (typically PCR products or cloned DNA inserts—1 to 150 kb), in Southern Blot or in FISH procedures, non-labeled competitive DNA, enriched in repetitive elements such as Alu repeats (usually Cot-1 DNA), is added in large excess along with the labeled probe. Competition of unlabelled probes on the repetitive sequences minimizes the hybridization of labeled probes. This strategy is expensive and since the competitor DNA is not purely made of repetitive sequences, competition also occurs on the unique sequences for which the probes were designed, thus limiting the amount of competitor DNA that may be used. Therefore, the efficiency of this approach is limited.
An alternative approach for longer probes has been proposed by Knoll and collaborators (U.S. Pat. No. 7,014,997), resembling the strategy usually adopted for oligonucleotides: probes are chosen within sequence intervals devoid from repetitive elements. This strategy is based on bioinformatics analysis of the regions of interest and exclusion of known repetitive sequences by comparison with sequence databases. However, this approach has several limitations: prior knowledge of the repetitive sequences is required, which can be a problem e.g. in species where such knowledge is unavailable. More importantly, intervals longer than 2 kb devoid of repetitive sequences appear only once in 20-30 kb on average and are unevenly distributed (Considering the distribution is gaussian, and the precision is the half-width of the confidence interval, i.e. 2.sd/√n where sd=standard deviation and n=number o) so the design of probes would be highly constrained, impairing the possibility to design a high-resolution code. This would prove especially difficult in repeat-rich regions, and/or regions where pseudogenes are located next to homologous genes of interest—such low-copy repetitive sequences being also excluded with the strategy from Knoll and co (ref 3). Since regions targeted in rearrangement tests, e.g., for diagnostics purposes, often display these features, this approach is not suitable for the design of high-resolution barcodes and especially not if such a code is to be used for diagnostics purposes. Distinctions between this approach and the invention are disclosed in more detail below.