DNA methylation is an epigenetic modification that can occur at the C5 position of cytosine residues. In mammals, 5-methylcytosine can appear in the CpG dinucleotide context (Ramsahoye et al., Proc Natl Acad Sci USA 97:5237-5242, 2000). Recent data suggests that approximately 25% of all cytosine methylation identified in stem cells can occur in non-CpG context (see Ziller et al., PLoS Genet. 7 (12):e1002389, 2011). Although CpG dinucleotides can be underrepresented in the genome, stretches of sequences known as CpG islands can exist that are rich in CpG dinucleotides. These CpG islands can be associated with promoter regions and span several hundred nucleotides or more.
Epigenomics, i.e., the study of the complete set of epigenetic modifications on the genetic material of a cell, has revealed that epigenetic modifications (e.g., DNA methylation) can play a role in mammalian development and disease. For example, DNA methylation is implicated in embryonic development, genomic imprinting and X-chromosome inactivation through regulation of transcriptional activity, chromatin structure and chromatin stability (Robertson, Nat Review Genet 6:597-610, 2005). Increased DNA methylation (hypermethylation) at promoter regions of genes can be associated with transcriptional silencing, whereas decreased methylation (hypomethylation) at promoter regions can be associated with increased gene activity. Aberrant methylation patterns can be associated with various human pathologies, including tumor formation and progression (Feinberg and Fogelstein, Nature 301:89-92, 1983; Esteller, Nat Review Genet 8: 286-298, 2007; and Jones and Paylin, Cell 128:683-692, 2007). Therefore, analysis of DNA methylation status across the human genome can be of interest.
Methods for measuring DNA methylation at specific genomic loci include, for example, immunoprecipitation of methylated DNA, methyl-binding protein enrichment of methylated fragments, digestion with methylation-insensitive restriction enzymes, and bisulfite conversion followed by Sanger sequencing (reviewed in Laird, Nat Review Genet 11: 191-203, 2010). Bisulfite treatment can convert unmethylated cytosine residues into uracils (the readout of which can be thymine after amplification with a polymerase). Methylcytosines can be protected from conversion by bisulfite treatment to uracils. Following bisulfite treatment, methylation status of a given cytosine residue can be inferred by comparing the sequence to an unmodified reference sequence.
Techniques have been developed for profiling methylation status of the whole genome, i.e. the methylome, at a single-base resolution using high throughput sequencing technologies. Bisulfite conversion of genomic DNA combined with next generation sequencing (NGS), or BS-seq, is one strategy. Because of the high cost still associated with genome-wide methylation sequencing, variations of BS-seq technology that enable genome partitioning to enrich for regions of interest can be used. One such variation is reduced representation BS-seq (RRBS), which can involve digestion of a DNA sample with a methylation-insensitive restriction endonuclease that has CpG dinucleotide as a part of its recognition site, followed by bisulfite sequencing of the selected fragments (Meissner et al., Nucleic Acids Res. 33(18):5868-5877, 2005). RRBS provides less coverage than whole genome bisulfite sequencing, but allows the researcher to obtain quantitative DNA methylation information across many features with approximately 50-fold fewer sequencing reads, resulting in substantial sequencing cost reduction. Since the RRBS technique was first described (Meissner, et al. (2005) Nucleic Acids Res 33(18):5868), there have been many enhancements and improvements (see for example, Boyle, et al. (2012) Genome Biol 13:R92). The current approach utilizes the methylation insensitive restriction enzyme MspI, which recognizes CCGG. As a result of partial fragmentation during bisulfite conversion, PCR, and efficiency of cluster generation, only a subset of these fragments, typically under 300 bp in length, can be sequenced. However these smaller fragments are derived from genomic DNA that has a high frequency of MspI sites and therefore a high frequency of potential CpG methylation sites. This is how RRBS can achieve reduced representation. One feature of RRBS can be that all forward reads start with either TGG or CGG. This feature can create a lack of color balance that can be problematic for some NGS sequencers. For example, the complete lack of color balance at bases 2 and 3 can make these samples very challenging for both Illumina MiSeq and HiSeq instruments. The methods, compositions and kits provided herein can allow RRBS libraries to be generated easily and sequenced more efficiently by increasing the diversity of nucleotide bases sequenced during the initial cycles of sequencing. The increased diversity can increase the efficiency and accuracy of cluster identification by sequencers during certain types of next generation sequencing applications (e.g., Illumina based sequencing by synthesis).