The Next-Generation DNA sequencing technology is a high-throughput sequencing technology with low cost, with a fundamental of sequencing synthesis. Taking Solexa sequencing method as an example, it comprises: firstly randomly fragmenting DNA strands using a physical method, secondly ligating a specific adaptor to an obtained DNA fragments at both ends, in which the specific adaptor has an amplification primer sequence; thirdly subjecting obtained DNA fragments ligated with the specific adaptor to sequencing. During the step of sequencing, DNA polymerase synthesizes a complementary strand of the DNA fragments to be analyzed by means of the adaptor, and obtains a base sequence by detecting fluorescence signal carrying by the newly-incorporated base, so as to obtain a sequence of the DNA fragments to be analyzed. These obtained sequences are regarded as reads. A basic process of the Solexa sequencing method may refer to, for example www.Illumine.com.
To retrieve an intact sequence of genome (for example, assembling reads into genome sequence such as chromosome sequence), the Next-Generation sequencing technology usually connects reads in a gradient way. First of all, by means of an overlapping relationship between reads, the reads are extended as much as possible (namely, connect together), to form contigs; secondly, by means of a distance relationship between reads of pair-ends in a Pair-End sequencing, different contigs having pair-end reads are connected together by adding the certain number of N in the middle, to form scaffolds. In the scaffolds, a sequential order of the contigs before and after the N region is already known, and a distance thereof in the DNA sequence is also known; finally, information of these N regions are retrieved to sequence information by “gap closure” methods. One of the “gap closure” methods is that: finding a pair-end reads, in which one end thereof is located in the known sequence of the scaffolds, and the other end thereof is located in the N region of the scaffolds; calculating all reads located in the N region; and then performing local assembly by the overlapping relationship to obtain sequence information of the N region. A general protocol of sequence connecting may refer to, for example Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20, 265-72 (2010).
Although it may connect sequencing data (namely, reads) of the Next-Generation sequencing technology using known software, since the reads obtained by the Next-Generation sequencing technology generally have a relatively short read length (commonly just 100 bp), there is a certain limitation for connecting sequencing data: it is very hard to assemble reads into genome sequence such as chromosome sequence simply relying on assembly software.
Therefore, it is urgent in the art to improve the method of assembling reads, to further optimize an assembling result of sequencing data, and increase the accuracy of the assembling result (namely, obtaining high-accurate genome sequence). Particularly, the present disclosure also provides a new method of obtaining and improving genomes of parents using sequencing data of inbred lines progeny population.