Methods to sequence or identify significant segments of the human genome and genetic variations within those segments are becoming commonplace. However, a major impediment to understanding the health implications of genomic variation lies in the ability to correlate genomic differences with the human health consequences of those differences. Whole genome sequencing is an important first step toward elucidation of the genomic underpinnings of human health. Once sequenced, genomic DNA must be assembled or aligned to a reference sequence. A generally-accepted protocol for genome assembly involves using fosmids and BAC libraries in which long pieces of DNA are introduced into bacterial cells that are sequenced independently and reassembled. Such a process is expensive, laborious, and time consuming (e.g., a few weeks to months).
Recent advances in sequencing throughput and library preparation has allowed mammalian-sized genomes to be sequenced in a matter of days. Current sequencing technologies allow the generation of enormous amounts of sequence using short sequence reads (i.e., lengths of about 100 bp to about 200 bp). Those technologies provide up to 30 GB of sequences per lane, which is equivalent to 10× coverage of the human genome.
However, application of those technologies to de-novo genome assemblies is limited by short sequence read length, which is insufficient to resolve complex genome structure and to produce consistent genome assembly. Further, short sequence reads cannot be used to obtain phasing data (i.e., which variants are on the same chromosome). Additionally, assembly from short reads requires construction of a de-bruign graph, which is a computationally-intensive process requiring supercomputers with large amount of RAM, which limits application to large sequencing centers with access to supercomputers. Thus, it is difficult and expensive to use short sequence reads to get quality de-novo reference genome assemblies.