Short read next generation sequencing (NGS) analysis has some limitations in both research and diagnostics. One key drawback is the problem of phasing. That is, when interrogating multiple loci of sequence variation, it is often impossible to determine which loci are co-located on the same chromosome or on the same chromosomal fragment. One example of a phasing problem occurs in diploid organisms in which two parental chromosomes, one from the mother and one from the father, are inherited, resulting in two copies of each gene (except for the genes carried on the sex chromosomes). Within each copy of the two copies of a gene in a diploid cell are regions of sequence variation, or loci, that fall within distinct sequence types known as alleles. Thus, allelic variation across different loci might exist within a single chromosome (maternal or paternal) of a chromosome pair, or across both chromosomes of a chromosome pair. Determining which loci or regions of sequence variation are co-located on the same (maternal or paternal) chromosome is useful for a variety of reasons, as discussed further below.
The pattern of alleles within each individual chromosome is referred to as haplotype. Haplotyping has many diagnostic and clinical applications. For example, two inactivating mutations across different loci within a single gene might be of little or no consequence if present on the same individual chromosome (i.e. chromosome of either maternal or paternal origin), because the other copy of the gene product will remain functional. On the other hand, if one of the inactivating mutations is present in the maternal chromosome and the other in the paternal chromosome, there is no functional copy of the gene product, resulting in a negative phenotype (non-viability, increased risk for disease and others). Haplotyping is also used to predict risk or susceptibility to specific genetic diseases, as many genetic associations are tied to haplotypes. For example, the various haplotypes of the human leukocyte antigen (HLA) system are associated with genetic diseases ranging from autoimmune disease to cancers.
Another instance in which phasing information is useful is distinguishing between functional genes and their non-functional pseudogene counterparts within the genome. One well known functional gene/pseudogene pair is the genes SMN1 and SMN2, which differ in sequence by only five nucleotides over many Kb of sequence, yet one of the nucleotide differences renders the SMN2 gene almost completely non-functional. Using short read sequencing, a mutation may be found in one of the two genes, but unless the mutation happens to occur within the sequencing read that also covers one of the known nucleotide differences between SMN1 and SMN2, it will be impossible to know which of the genes (the functional gene, or the nonfunctional pseudogene) is mutated.
The present NGS methods employ short read sequencing to query regions of variable DNA sequence (polymorphisms etc.) interspersed within regions of conserved DNA sequence. As significant blocks of conserved sequence are typically interspersed between the variable regions, short read sequencing does not lend itself to phasing analysis. Although methods have been developed to obtain phasing information, these methods (for example, Sanger sequencing and subcloning), are typically labor intensive and/or costly.
There is a need for improved NGS methods that provide phasing information. Such methods would ideally provide a highly parallel platform for performing multiple sequencing reactions from the same immobilized templates. The invention described herein fulfills this need.