Sequence determination technologies for proteins, RNAs and DNAs, have been pivotal in the development of modern molecular biology. During the past fifteen years, DNA sequencing in particular has been the core technology in an on-going revolution in the scope and the depth of understanding of genomic organization and function. The on-going development of sequencing technology is, perhaps, best symbolized by the determination of the complete sequence of a human genome.
The human genome sequencing project served a number of purposes. It served as a platform for programmatic development of improved sequencing technologies and of genome sequencing efforts. It also served to provide a framework for the production and distribution of sequencing information from increasingly large scale sequencing projects. These projects provided complete genome sequences for a succession of model organisms of increasingly large genetic complements. These accomplishments, culminating in the completion of a human genome sequence, highlight the very considerable power and throughput of contemporary sequencing technology.
At the same time, however, they highlight the limitations of current technology and the need for considerable improvements in speed, accuracy, and cost before sequencing can be fully exploited in research and medicine. Among the areas that can be seen most readily to require advances in sequencing technology are clinical sequencing applications that require whole genome information, environmental applications involving multiple organisms in mixtures, and applications that require processing of many samples. These are, of course, just a few among a great many areas that either require or will benefit greatly from more capable and less expensive sequencing methods.
To date, virtually all sequencing has been done by Sanger chain elongation methods. All Sanger methods require separating the elongation products with single base resolution. Currently, while PAGE still is used for this purpose in some commercial sequencers, capillary electrophoresis is the method of choice for high throughput DNA sequencers. Both gel-based and capillary-based separation methods are time consuming, costly, and limit throughput. Chip based methods, such as Affymetrix GeneChips and HySeq's sequencing by hybridization methods, require chips that can be produced only by capital intensive and complex manufacturing processes. These limitations pose obstacles to the utilization of sequencing for many purposes, such as those described above. Partly to overcome the limitations imposed by the necessity for powerful separation techniques in chain termination sequencing methods and the manufacturing requirements of chip-based methods, a number of technologies are currently being developed that do not require the separation of elongation products with integer resolution and do not require chips.
A lead technology of this type is a bead, emulsion amplification, and pyrosequencing-based method developed by 454 Life Sciences. (See Marguilles, et al. (2005) Nature 437: 376, which is incorporated herein by reference in its entirety, particularly as to the afore-mentioned methods. The method utilizes a series of steps to deposit single, amplified DNA molecules in individual wells of a plate containing several million picoliter wells. The steps ensure that each well of the plate either contains no DNA or the amplified DNA from a single original molecule. Pyrosequencing is carried out in the wells by elongation of a primer template in much the same way as Sanger sequencing. Pyrosequencing does not involve chain termination and does not require separation of elongation products. Instead sequencing proceeds stepwise by single base addition cycles. In each cycle one of the four bases—A, T, G, or C—is included in the elongation reaction. The other three bases are omitted. A base is added to the growing chain if it is complementary to the next position on the template. Light is produced whenever a base is incorporated into the growing complimentary sequence. By interrogating with each of A, C, G, or T in succession, the identity of the base at each position can be determined. Sequencing reactions are carried out in many wells simultaneously. Signals are collected from all the wells at once using an imaging detector. Thus, a multitude of sequences can be determined at the same time.
In principle, each well containing a DNA will emit a signal for only one of the four bases for each position. In practice, runs of the same base at two or more positions in succession lead to the emission of proportionally stronger signals for the first position in the run. Consequently, reading out the sequence from a given well is a bit more complicated then simply noting, for each position, which of the four bases is added. Nevertheless, because signals are proportional to the number of incorporations, sequences can be accurately reconstructed from the signal strength for most runs.
The technology has been shown to read accurately an average of about 250 or so bases per well with acceptable accuracy. A device offered by 454 Life Sciences currently uses a 6.4 cm2 picoliter well “plate” containing 1,600,000 picoliter sized wells for sequencing about 400,000 different templates. The throughput for a single run using this plate currently is about 100 million bases in four hours. Even though this is a first generation device, its throughput is nearly 100 times better than standard Sanger sequencing devices.
Numerous other methods are being developed for ultra high throughput sequencing by other institutions and companies. Sequencing by synthesis methods that rely on target amplification are being developed and/or commercialized by George Church at Harvard University, by Solexa, and by others. Ligation sequencing methods have been developed and/or are being commercialized by Applied Biosystems and Solexa, among others. Array and hybridization sequencing methods are commercially available and/or are being developed by Affymetrix, Hyseq, Biotrove, Nimblegen, Illumina, and others. Methods of sequencing single molecules are being pursued by Helicos based on sequencing by synthesis and U.S. Genomics (among others) based on poration.
These methods represent a considerable improvement in throughput over past methods, in some regards. And they promise considerable improvement in economy as well. However, currently they are expensive to implement and use, they are limited to relatively short reads and, although massively parallel, they have limitations that must be overcome to realize their full potential.
One particular disadvantage of these methods, for example, is that samples must be processed serially, reducing throughput and increasing cost. This is a particularly great disadvantage when large numbers of samples are being processed, such as may be the case in clinical studies and environmental sampling, to name just two applications.
The incorporation of indexing sequences by ligation to random shotgun libraries has been disclosed in U.S. Pat. Nos. 7,264,929, 7,244,559, and 7211390, but the direct ligation methods therein disclosed distort the distribution of the components within the samples (as illustrated in FIG. 4 herein) and therefore are inappropriate for enumerating components within each sample.
Accordingly, there is a need to improve sample throughput, to lower the costs of sequencing polynucleotides from many samples at a time, and to accurately enumerate the components of samples analyzed by high throughput, parallelized and multiplex techniques.