Developments in the field of genetics and molecular biology have led to a concerted scientific effort to determine the entire DNA sequence of whole organisms. Scientists have already sequenced the complete genome of several microorganisms, and genome sequencing efforts are now turning to larger organisms leading inevitably to determining the complete DNA sequence of human beings. The current biochemical methods used to ascertain the sequence of DNA molecules typically are performed on small segments of DNA due to practical constraints in those processes. While it is theoretically possible to sequence a large DNA molecule by sequencing small pieces systematically beginning at one end of the large piece and proceeding to the other end, there are pragmatic reasons why such a process is difficult. Instead, most current techniques rely on breaking large DNA fragments into small fragments which are then individually sequenced. This is done in a redundant or overlapping procedure in a way that maximizes the likelihood that all the portions of the larger DNA fragments are sequenced one or more times by the sequencing of the overlapping small fragments. This process results in a logic, or computational, problem in that the sequences of the small fragments must be assembled or aligned into larger pieces, which larger pieces are then assembled into still larger pieces in order to create the entire DNA sequence of the large fragment sought to be sequenced.
DNA is a biochemical polymer made up of monomers referred to as "bases" which are conventionally represented by one of four letters, A, T, C, or G. As used herein the small piece of DNA which is subjected to actual biochemical analysis to determine its base sequence is referred to as a "fragment," and the data representing the DNA sequence generated from each fragment is referred to as a "fragment read." Again, in the overall sequencing process, fragment reads are created which are redundant or overlapping to cover most or all sections of the larger DNA piece from which the overlapping were created. The fragment reads must be aligned into one or more contiguous larger segments, such a larger segment being referred to here as a "contig." The overall layout of fragments into contigs is used to determine the sequence of large fragments of DNA. This process is referred to here as "fragment assembly."
Because DNA is a polymer, it is common to refer to DNA pieces using the nomenclature of polymers. Hence, the terminology "mer" is used to refer here to a sequence of bases in a fragment read. In the conventional terminology used, "mer" refers to a sequence of any length and, when prefixed with the number, is used to refer to a sequence of defined length. Thus a 20-mer is a portion of DNA 20 bases in length.
The length of fragment reads and the complexity of problems inherent in assembling the fragment reads into contigs depends on the length of the overall DNA being sequenced and the molecular biology strategy adopted for sequencing that piece of DNA. DNA fragments or pieces are typically inserted into biological carriers, referred to as vectors, and different classes of vectors can host different length DNA fragments. For larger genomes, for example, a popular DNA sequencing strategy involves cloning DNA fragments in excess of 100 kilobases (kb) from one or more chromosomes into so-called BAC vectors, shotgun-cloning at random smaller DNA fragments from the inserts in the BAC vectors, and then sequencing the smaller fragments. The smaller fragments are sequenced to make the fragment reads. The fragment reads must be reconstituted into "contigs," which represent the sequence of the larger BAC inserts, following which the BAC contigs must be arranged according to their positions on the chromosomes. There are a variety of strategies for assembling of the larger genome sequence from these larger contigs, for example, optical mapping.
It should be appreciated that the sequencing strategies can be different, and the problems in using such strategies can be much different in severity depending on the size of the genome which is sought to be sequenced. The whole genome of some smaller organisms, such as for example, the bacterium E. coli, has been entirely derived by random shotgun cloning of small inserts from the whole genome, and then assembling overlaps among the fragment reads in an exhaustive procedure to assemble the pieces into the whole genome. This strategy is feasible where the genome of the organism is small. The genome of an E. coli organism is approximately 4.7 megabases. This approach becomes more untenable when dealing with larger genome sizes. For example, the genome of humans approximates three billion bases. The adoption of a similar assembly method on a genome of this size presents forbidding practical assembly problems.
Significant effort has been undertaken toward software methods that can efficiently handle large amounts of DNA sequence data. Such methods will differ, however, in the efficiency with which they can process the DNA fragment reads into contigs or whole genomic information. The speed with which such algorithms operate is dependent, of course, principally on the number of fragment reads, referred to here by the number n. A number of commonly used computer software fragment read assemblers are now available. However, the assemblers now in use commonly have a processing time that is proportional to n.sup.2. With such algorithms whose processing time increases with the square of the number of reads, as the number of reads increases, the rate of increase in the computational time necessary for execution increases quadratically. There are at least seven known programs in operation which do sequence assembly, and all of them have a processing time proportional to n.sup.2. These packages include the following: The Phred/Phrap Package (Green et al.), The TIGR Assembler (Sutton et al.), GAP (Bonfield et al.), CAP 2 (Huang), The Genome Construction Manager (Lawrence et al.), Bio Image Sequence Assembly Manager (Smith et al.), and SeqMan (Swindell & Plasterer).
Another method that takes time proportional to the square of n (or perhaps the cube of n) uses genetic algorithms to search for possible layouts of fragment reads (Parson and Johnson). At each iteration of processing, a number of possible layouts are considered and evaluated for their likeliness of being correct. Two functions for evaluating the layouts were developed, one with a processing time proportional to n and the other proportional to n.sup.2. In the most successful experiments with this method, the number of possible layouts considered is set to be larger than n. Depending on which evaluation function is used, the overall processing time to run the algorithm is then proportional either to n.sup.2 or n.sup.3.
For large scale sequencing operations, the use of algorithms whose processing time is dependent on the square of n results in computational time that may become impractically large. As an example, current efforts at the University of Oklahoma focusing on the genome of the organism Neisseria gonorrhoeae have resulted in over 22,000 fragment reads. Reportedly, the assembly of those fragment reads takes about four hours using high stringency fragment read assembly parameters using the Phred/Phrap Package on a Spark Ultra work station. Extrapolating based on differences required by processing speed proportional to n.sup.2, and assuming fragment read lengths of 500 and a redundancy of two, it would take over three months to assemble fragment reads from one chromosome of the human genome on this platform. In contrast, if a method has a processing time which is proportional to n, if 4 hours were required to assemble 22,000 fragment reads, using the same assumptions, a human chromosome could be theoretically assembled in less than four days.
More powerful computers can shorten the execution time of fragment assembly software, even for approaches which vary in heir execution time in proportion to n.sup.2. However, the computational power of the computer cannot shorten that time by enough. As biochemical sequence data acquisition procedures are becoming more rapid, lengthy fragment assembly procedures could create a bottleneck in DNA sequencing. Moreover, it is typically useful to be able to run sequence assembly procedures multiple times in a growing data set created by the sequencing operation. At various times during the sequencing process, it is appropriate to test whether fragment redundancy is approaching the level at which finishing strategies (to cover gaps in the data, for example) should be initiated. Later, when it is determined that coverage of the genome appears adequate, fragment assembly may be run repeatedly to test the effectiveness of assembly using various different assembly criteria. Fast assembly software, running in time proportional to n, would enhance the efficiency of such analysis even for projects based on small genomes or individual clones.