The goal of many sequencing projects is to determine, for the first time, the entire genome sequence of a target organism (de novo draft genome sequencing). Having a draft genome sequence at hand enables identification of useful genetic information of an organism, for instance for the identification of the origin of genetic variation between species or individuals of the same species. Hence, it is a general desire in the art to come to techniques that allow the de novo determination of the entire genome sequence of an individual, whether human, animal or plant at a reasonable cost and effort. This quest is typically indicated as the quest for the 1000$-genome, i.e. determining the entire genome sequence of an individual for a maximum of 1000$ (without considering currency fluctuations). However, in practice the 1000$ genome does not necessarily rely on de novo genome sequencing and assembly strategy but may also be based on a re-sequencing approach. In case of the latter, the re-sequenced genome will not be assembled de novo, but its DNA sequenced compared to (mapped onto) an existing reference genome sequence for the organism of interest. A re-sequencing approach is therefore technically less challenging and less costly. For sake of clarity, the focus of the current invention is on de novo genome sequencing strategies, capable to be applied to organisms for which a reference genome sequence is lacking.
Current efforts are varying, plentiful and rapidly increasing results are achieved. Nevertheless, the goal has not been achieved yet. It is still not economically feasible to sequence and assemble an entire genome in a straight forward fashion. There exists still a need in the art for improved de novo genome sequencing strategies. General requirements for such strategies are that they are cheaper, efficient in terms of computational power necessary to process data from sequence reads to an assembled draft genome, efficient in terms of the use of high throughput sequencing equipment to generate data of sufficient quality, i.e. the redundancy with which sequences need to be determined to create sufficiently accurate data etc.
WO03/027311 describes a clone-array pooled shotgun sequencing method (CAPPS). The method employs random sequence reads from differently pooled (BAC) clones. Based on the cross-assembly of the random reads a sequence contig can be generated from a plurality of clones and a map of the clones relative to the sequence can be generated. The publication describes, in more detail, the generation of a BAC library in a multidimensional pool, for example a two-dimensional format where each pool and row contain 148 BAC clones (148×148 format). Using CAPPS, BAC pools are sequenced to 4-5× coverage on average, which generates 8-10× coverage per BAC in case of the two-dimensional pool scheme. The contigs are made per BAC separately based on sequences that are unique to the BAC based on their occurrence in a single row and an single pool in case of a two-dimensional pooling scheme. Subsequently these BACs are assembled in a contig for the genome. The publication demonstrates the technology based on 5 BACs only and leaves the problem of data-processing untouched. One of the disadvantages of this technology is that the use of randomly sheared fragments requires an enormous amount of reads to cover a genome at a sequence redundancy level of 8 to 10 fold, making this method very laborious on larger scale. Furthermore it does not yield a sequence based physical BAC map.
US2007/0082358 describes a method of assembly of sequence information based on a clonally isolated and amplified library of single stranded genomic DNA to create whole genome shotgun sequence information combined with whole genome optical restriction mapping using a restriction enzyme for the creation of an ordered restriction map.
US2002/0182630 discloses a method on BAC contig mapping by comparison of subsequences. The method aims at avoiding the difficulties associated with repetitive sequences and the generation of contigs by the creation of bridges across repeat-rich regions.
Determining physical maps based on BACs can be based on sequencing BAC libraries (sequence-based physical mapping of BAC clones) using for instance the method described in WO2008/007951 from Keygene also indicated as ‘whole genome profiling’ or WGP. In brief, WGP relates to the generation of a physical map of at least part of a genome comprising the steps of generating an artificial chromosome library from a sample DNA, pooling the clones, digesting the pooled clones with restriction enzymes, ligating identifier-containing adapters, amplifying the identifier-containing adapter-ligated restriction fragments, correlating the amplicons to the clones and ordering the fragments to generate a contig to thereby create a physical map.
Despite all developments in high throughput sequencing, determining draft genome sequences with high accuracy is still considered expensive and laborious and fierce competition is present in the market. There hence remains a need to complement the currently existing methods to come to efficient and economic methods for the generation of draft genome sequences.