Most genomic studies to date ignore the diploid nature of the human genome. However, the context in which variation occurs on each individual chromosome can have a significant impact on gene regulation and may have strong clinical significance. Applications that can greatly benefit from phased genomes include medical genetics (e.g. detecting compound heterozygosity); non-invasive fetal genome sequencing, population genetics, cancer genetics, and HLA (Human Leukocyte Antigen) typing and matching. Thus, there is a strong need for cost-effective methods that support accurate and comprehensive haplotype-resolved sequencing of human genomes.
There are two general approaches for genome-wide haplotyping: computational phasing and experimental phasing. Computational approaches to haplotyping, in general, pool information across multiple individuals, preferentially relatives, by using existing pedigree or population-level data. Based on the quality of the reference genomes used, these methods cannot necessarily deliver phasing information across the whole genome. Because the performance of computational phasing is contingent upon multiple parameters including sample size, density of genetic markers, degree of relatedness, sample ethnicity, and allele frequency, its performance for genome-wide phasing will inevitably be limited. Rare and de novo variations, which are medically relevant but not observed at appreciable frequencies at the population level, fail to phase accurately with computational methods.
Most experimental approaches to genome-wide haplotype-resolved sequencing employ sub-haploid complexity reduction, thereby providing a direct and hypothesis-free approach to genome-wide phasing. In vitro implementations of complexity reduction separate the parental copies in compartments through sub-haploid dilution, amplify the individual copies using random primer amplification, and then derive haplotypes by inferring and genotyping the haploid molecules present in each compartment. However, these methods suffer from several limitations. First, random primer amplification-based methods generate false variants through chimeric sequence formation, can result in a biased representation of the genome with allelic drop-out in the diploid context, and can yield underrepresentation of GC-rich sequences. In part as a consequence, very deep sequencing, i.e. 200-500 Gb, is required to obtain phasing information with N50 block sizes in the range of 700 kb to 1 Mb (in which N50 is defined as the phased block length such that blocks of equal or longer lengths cover half the bases of the total phased portion of the genome). Second, the requirement of diluting to sub-haploid content and thus starting with minute amounts of DNA may put a burden on reproducibility, accuracy and uniformity of amplification. The complexity of this step scales linearly with the number of compartments (usually between 96 and 384), in which each compartment represents an individual library preparation from a picogram-scale starting amount. Cloning-based approaches allow working with reasonable amounts of DNA, but require high-efficiency cloning which is time consuming and technically challenging and are also limited to the size of the cloning platform (fosmids/BACs). Finally, for some methods, there is a requirement for upfront size-selection of genomic DNA prior to sub-haploid complexity reduction. Since the reconstruction of long haplotypes is challenging, any limits on the length of input DNA molecules will fundamentally constrain the length of the resulting haplotypes. Alternative approaches for obtaining long-range phasing information include long read technologies, but these currently suffer from low accuracy and throughput. Thus, despite advances in phasing methods, there remain major practical obstacles to their integration with routine human genome sequencing.