1. Field of the Invention
The present invention relates to the field of bioinformatics, specifically to the field of the automated alignment and merging short DNA fragments, and the assembly of these fragments into larger DNA molecules.
2. Description of Related Art
The field of bioinformatics involves the practice of sequence assembly, which refers to the aligning and merging of smaller fragments of a much larger DNA sequence in order to reconstruct that original large sequence. Current sequence technology does not allow the sequencing of very large DNA fragments. Instead, smaller pieces, generally between 20 and 1000 bases, are sequenced and then merged.
The problem of sequence assembly can be compared to passing multiple copies of a book through a shredder and then attempting to piece a single copy of the book back together from only shredded pieces. The resulting book may have many repeated paragraphs while some shreds may be modified to have typos. Excerpts from another book may be added in and some shreds may be completely unrecognizable.
Current sequencing techniques rely on breaking large DNA fragments into small fragments which are then individually sequenced. This procedure is performed in a redundant or overlapping procedure in a way that maximizes the likelihood that all the portions of the larger DNA fragments are sequenced one or more times by the sequencing of the overlapping small fragments. This process results in a logic, or computational problem in that the sequences of the small fragments must be assembled or aligned into larger pieces, which larger pieces are then assembled into still larger pieces in order to create the entire DNA sequence of the large fragment sought to be sequenced.
DNA is a biochemical polymer made up of monomers referred to as “bases” which are conventionally represented by one of four letters, A, T, C, or G. As used herein, the small piece of DNA which is subjected to actual biochemical analysis to determine its base sequence is referred to as a “fragment,” and the data representing the DNA sequence generated from each fragment is referred to as a “fragment read”. Again, in the overall sequencing process, fragment reads are created which are redundant or overlapping to cover most or all sections of the larger DNA piece from which the overlapping was created. The fragment reads must be aligned into one or more contiguous larger segments, such a larger segment being referred to here as a “contig”. The overall layout of fragments into contigs is used to determine the sequence of large fragments of DNA. This process is referred to here as “fragment assembly”.
Because DNA is a polymer, it is common to refer to DNA pieces using the nomenclature of polymers. Hence, the terminology “mer” is used to refer to a sequence of bases in a fragment read. In the conventional terminology used, “mer” refers to a sequence of any length and, when prefixed with the number, is used to refer to a sequence of defined length. Thus a 20-mer is a portion of DNA 20 bases in length.
Technological development of sequencing continues to improve. The Solexa™ technology is available and heavily used to generate roundabout 100 million reads per day on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. The downside is that these reads have a length of only 36 bases. This makes sequence alignment an even more daunting task.