The present invention, in some embodiments thereof, relates to molecular biology and, more particularly, but not exclusively, to simultaneous copy number quantification and DNA methylation detection.
Nearly half of the human genome is composed of DNA repeats: homologous DNA fragments, exhibiting identity (or great similarity) to each other, that lay along varying stretches of the genome. The length of a single repeat unit may vary between one and millions of bases and repetitive genomic regions may consist of two to millions of copies. Repeats are divided into two types: tandem repeats, which are composed of concatenated units forming a continuous repetitive region, and interspersed repeats, which are scattered along the entire genome. Many repetitive regions have been shown to be involved in gene regulation, genome stability and disease. Examples include telomeres and the CAG repeats related to Huntington's disease, both of which are tandem repeats with direct functional impact on cellular activity and fate. Large populations of repeats are mobile elements, which can transpose across the genome and perform homologous recombination events, promoting dynamic genomic transformations. Thus, it is believed that repetitive elements play a major role in human evolution, allowing the exponential speed that characterizes the development of our species. The dynamic nature of repetitive elements also leads to variations both among different individuals and between different cells of the same individual.
A second dimension of dynamics is added to repeat arrays by their methylation status. CpG methylation is a major epigenetic modification responsible for genetic regulation and often associated with the promoter region of genes. It has been shown that methylation levels of DNA repeats are correlated with various types of cancer and its severity. Consequently, a given repeat array may vary not only in the number of units but also in the methylation status of each of its units.
Despite their significance, repetitive elements are almost inaccessible to Next Generation Sequencing (NGS) methods. This is due to the fact that current NGS techniques employ random fragmentation of genomic material and only sequence relatively short “reads” that do not span the full repetitive array. Assembly of these short reads into a long contiguous sequence is fundamentally limited because assembly algorithms cannot distinguish between identical sequence copies that comprise the investigated region. Hence, despite the emergence of new sequencing technologies and sample preparation techniques, the study of repetitive regions is still limited by the typical read length (˜50-500 bp) of current gold-standard NGS platforms. With this in mind, it is possible that some of the many sequence duplications that have been discovered are in fact longer repetitive elements.
There are several alternative methods for copy number analysis, including southern blotting, quantitative PCR and comparative genomic hybridization assays. These methods offer indirect, relative information, making them prone to bias, and typically suffer from resolution problems in size estimation. In addition, similar to NGS, they are based on “bulk” measurements, resulting in averaged information, which often masks intercellular variability. While these methods offer a partial solution for quantification of repeats, studying the methylation status of repeat arrays remains a major stumbling block. Bisulfite sequencing, the “gold-standard” method for CpG methylation studies, is NGS-based and thus limited in the information it can provide on repetitive sequences. These limitations become critical if the length of a repeat unit is longer than the length of the sequencing read. In such cases, only parts of the first and last repeat units can be addressed for methylation studies. As a result, there is currently no method capable of analyzing methylation patterns of repeat arrays.
One of the most striking and well-studied examples of a pathogenic repeat region is the D4Z4 macrosatellite array. D4Z4 is a 3.3-kbp long repeat unit organized in an array at the subtelomeric region of chromosome 4q. This array is directly linked to facioscapulohumeral muscular dystrophy (FSHD), the third most common form of muscular dystrophy with a world-wide prevalence of 1 in 20,000 (1 in 7,500 in Europe). This disease is characterized by progressive weakening of selective facial and limb-girdle muscle groups. The autosomal dominant genotype of FSHD is a constellation in the D4Z4 array; while healthy individuals carry more than 10 copies of this repeat, carriers would possess less than 10 repeats (FIG. 1A). This reduction is believed to have an epigenetic effect, altering the repression of the DUX4 gene, leading to expression of this, typically silenced, transcription factor in patients' muscle cells. Recent studies show that some patients possess a normal number of D4Z4 repeat units but still display FSHD symptoms. The genotype for this variant of the disease, which is termed FSHD2, shows decreased DNA methylation of the typically hypermethylated D4Z4 repeat array (FIG. 1A). Moreover, methylation seems to play a significant role in the activation of the disease, as individuals who possess a genetically pathogenic allele, containing less than 10 D4Z4 repeats, do not manifest FSHD symptoms if the repeats are hypermethylated (FIG. 1A). Studies on FSHD patients and their families show that a high percentage (˜30%) of patients do not inherit the disease, but rather develop de-novo mutations, perhaps due to recombination events between the chromosome 4 repeat array and a very similar (non-pathogenic) array that lies in chromosome 10. Many of these patients exhibit somatic mosaicism, meaning that they have different numbers of repeats in different cells, which further complicates diagnosis by currently available techniques.
Optical mapping of DNA stands out as an attractive approach for studying large genomic arrangements such as repeat arrays. The approach consists of a set of techniques for stretching long genomic fragments, followed by imaging of these fragments using fluorescence microscopy. Image processing is then used to read out a low-resolution physical barcode along the molecules that provides genetic information such as the size and number of large repeat units. Furthermore, the technique offers access to additional genomic information such as DNA damage lesions and epigenetic marks. These attributes have driven the commercialization of several optical mapping approaches, and facilitated the development of a clinical genetic test for FSHD [Nguyen et al., Ann. Neurol. 2011, 70, 627-633]. Recently, whole genome optical mapping has been performed by electrokinetically stretching barcoded DNA molecules in highly parallel nanochannel arrays. This technique, commercialized by BionanoGenomics Inc., is capable of automated copy number analysis on a genome scale [Cao, H. Gigascience 2014, 3, 34; Huichalaf, C. et al., PLoS One 2014, 9, e115278; Mostovoy, Y Nat. Methods 2016, 12-17]. However, commercial protocols do not support detection of the D4Z4 macrosatellite repeat array.
Additional background art includes Kunkel, F.; Lurz, R.; Weinhold, E. A Molecules 2015, 20, 20805-20822; Grunwald, A.; et al. Nucleic Acids Res. 2015, 43, e117 and Nelson, M et al., Nucleic Acids Res. 1991, 19, 2045-2071.