The ability to phase modified nucleotides (e.g., methylated or hydroxymethylated nucleotides) in a genome (i.e., determine whether two or more modified nucleotides are linked on the same single DNA molecule or on different DNA molecules) can provide important information in epigenetic studies, particularly for studies on imprinting, gene regulation, and cancer. In addition, it would be useful to know which modified nucleotides are linked to sequence variations.
Modified nucleotides cannot be phased using conventional methods for investigating DNA modification because such methods typically involve bisulfite sequencing (BS-seq). In BS-seq methods, a DNA sample is treated with sodium bisulfite, which converts cytosines (C) to uracil (U), but 5-methylcytosine (5mC) remains unchanged. When bisulfite-treated DNA is sequenced, unmethylated C is read as thymine (T), and 5mC is read as C, yielding single-nucleotide resolution information about the methylation status of a segment of DNA. However, sodium bisulfite is known to fragment DNA (see, e.g., Ehrich M 2007 Nucl. Acids Res. 35:e29), making it impossible to determine whether modified nucleotides are linked on the same DNA molecule over a long distance. Specifically, it is impossible for nucleotide modifications to be phased in the same way that sequence variants (e.g., polymorphisms) are phased because those methods require intact, long molecules.
Moreover, bisulfite sequencing displays a bias toward cytosine (C) adjacent to certain nucleotides and not others. It would be desirable to remove the observed bias.