The present invention relates to the analysis of polymer molecules, and particularly of polynucleotides. Polynucleotides are polymeric molecules comprising repeating bases of nucleosides bound together in a linear fashion. Examples of polynucleotides are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). DNA is the genetic material of living organisms. It is the molecule that stores and transmits the code of life. Segments of DNA known as genes act as the templates for the formation of proteins, which are the molecules that comprise the structure and function of all living matter.
DNA polymers are made up of strings of four different nucleotide bases known as adenine (A), guanine (G), cytosine (C), and thymine (T). The particular order, or xe2x80x9csequencexe2x80x9d of these bases in a given gene determines the structure of the protein encoded by the gene. Furthermore, the sequence of bases surrounding the gene typically contains information about how often the particular protein should be made, in which cell types, etc. Knowledge of the DNA sequence in and around a gene provides valuable information about the structure and function of the gene, the protein it encodes, and its relationship to other genes and proteins.
The complete nucleotide sequence of all DNA polymers in a particular individual is known as that individual""s xe2x80x9cgenomexe2x80x9d. Whereas most bacteria have genomes on the order of a few million bases long, the human genome contains more than 3.5 billion bases. In recent years, both government and private organizations have expended enormous resources attempting to build a complete, detailed map of the human genome. In particular the Human Genome Project, a government-funded effort directed by the National Institutes of Health, has promised to deliver a complete human sequence by the year 2003 at a cost that is expected to exceed $3 billion. Private corporations have also entered into the race. For example, Celera, Inc. a Rockville, Md. company, has spent over $300 million dollars with the purpose of sequencing the genome by 2001.
There are significant reasons that so much effort and money is focused on sequencing the human genome. First, there is the obvious scientific merit associated with having a detailed map of mankind""s genetic template. The information embodied by such a map will allow scientists to better understand the relationship between our genetic code and the functions of the nearly 100,000 proteins that make up our bodies. It is already known that there is a direct relationship between particular DNA sequences and certain-disease states. This fact has encouraged many pharmaceutical companies to invest heavily in the field of genomics research in the hope of discovering the underlying genetic nature of these diseases.
Another reason that sequence information is important is the expected ability to determine an individual""s susceptibility to particular diseases based on his or her genetic sequence. The field of genetic diagnostics is dedicated to identifying nucleotide sequence elements whose presence in a genome correlates with development of a particular disorder or feature. The more information is available about genomic sequence elements observed in the population the more powerful this field becomes. Furthermore, the more rapidly information about the prevalence and penetrance of sequence elements in the general population, as well as the presence of particular such elements in the genomes of particular individuals being tested, the more effective the analysis becomes.
Yet another reason that sequence information is valuable is that a number of pharmaceutical companies seek to develop drugs that are custom-tailored to an individual""s genetic profile. The hope is to provide targeted, potent drugs, possibly with decreased dosage levels appropriate to the genetic characteristics of the particular individual to whom the drug is being administered.
Current Sequencing Technology
Most currently available nucleotide sequencing technologies determine the nucleotide sequence of a given polynucleotide strand by generating a collection of complementary strands of different lengths, so that the collection includes molecules terminating at each base of the target sequence and ranging in size from just a few nucleotides to the full length of the target molecule. The target molecule""s sequence is then determined by analyzing the truncated complementary strands and determining which terminate with each of four DNA nucleotides. A xe2x80x9cladderxe2x80x9d is constructed by arranging the truncated molecules in order by length, and the terminal residue of each rung is read off to provide the complement of the target polynucleotide sequence.
The most popular DNA sequencing systems generate the collection of truncated complementary molecules by performing a template synthesis reaction in the presence of low concentrations of modified versions of each of the four natural nucleotides. These modified compounds can be added to a polynucleotide chain but cannot be extended. Furthermore, each one is labeled with a different fluorescent dye, so that chains terminated with different nucleotides can be distinguished from one another by the color of fluorescence they emit (see, for example, Smith et al., U.S. Pat. No. 5,821,058; Smith et al., U.S. Pat. No. 5,747,249; Kaiser et al., Methods Enzymol 218:122-153, 1992; Automated DNA Sequencing Chemistry Guide, PE Applied Biosystems, A division of Perkin-Elmer (1998); each of which is incorporated herein by reference). The sequence of the target molecule is then determined by reading the sequence of fluorescent colors emitted by the arranged rungs of the molecular ladder of complementary strands.
These DNA sequencing methods have been automated, and machines that perform them are available in the commercial marketplace. The most advanced of these machines are capable of carrying out the above sequencing process in parallel reactions (up to 96 at a time). Under certain conditions the output from one machine may exceed 300,000 bases per day. Using large numbers of such machines, some organizations claim sequencing rates as high as 100 million bases per day (The Economist 347(8068):87-88 May 16 1998; incorporated herein by reference).
Currently-available DNA sequencing systems are very powerful. However, they are limited by their speed, their complexity, and their cost. Because of these problems, their use is not widespread in clinical environments. For example, even the most sophisticated genetic diagnostic procedures involve the analysis of only very short regions of sequence (often not by direct sequencing but rather by indirect methods that probe the underlying sequence). Large scale sequencing of patient DNA is simply not performed in the clinic.
The speed of currently available automated sequencers is limited by the inability of the machines to analyze more than several hundred (typically around 600) nucleotides of sequence at a time. Allowing for the overlaps needed to piece together correctly strands less than 1000 bases longs, the standard sequencing process may have to be performed as many as 70 million times in order to determine the human genome sequence (Technology Review 102(2):64-68 1999 Mar/Apr; incorporated herein by reference). As has been noted, it takes many 600 s to fit into 3 billion and in practice it takes many more than that to make a full sequence because the individual fragments have to be linked together by matching up their overlaps. (The Economist 347(8068):87-88 May 16 1998; incorporated herein by reference). At a theoretical rate of even 100 million bases per day it will take at least a year to sequence the human genome once. With these techniques, large-scale sequencing cannot become a clinical tool. For genetic diagnostics to become practical in a clinical setting, the sequencing rate will have to be increased by at least three to five orders of magnitude.
The complexity of current sequencing technology arises from the need to amplify and modify the genetic molecules being sequenced. This modification is carried out either chemically or enzymatically, and amplification is achieved by numerous cycles of heating and cooling. One of the more popular ways of amplifying and modifying the DNA to be sequenced is using the polymerase chain reaction (PCR). The PCR involves successive rounds of denaturing, annealing, and extension using a DNA polymerase and resulting in the exponential amplification of the original strand of DNA. The length of time associated with each part of the cycle depends on the fluid volume and the length of DNA to be amplified. Typical times are on the order of 10-30 seconds for the denaturation step, 5-30 seconds for the annealing step, and 1-4 minutes for the extension step. This cycle is usually carried out 15 to 30 times. Therefore, normal PCR times are one-to three hours depending on the length of the DNA to be amplified. The fundamental physical processes that constrain the denaturing, annealing, and labeling are the number of detectable strands needed, the time needed to carry out this process, and the processivity of the enzyme. This entire process is time consuming and requires following involved procedures. Although many of the available sequencing machines have automated much of these processes, even these machines must be operated by highly trained technicians and consume large quantities of chemicals.
In addition to the chemical complexity of traditional sequencing technology, there is also a tremendous amount of computational complexity. As mentioned above, 70 million separate and overlapping sequence fragments; each only about 600 bases long, must be pieced together to form the human genome using current sequencing technology. Numerous powerful computers, sophisticated algorithms, and highly trained computer scientists are required to carry out this task.
Finally, the cost of current sequencing systems, while not overwhelming for research needs, becomes prohibitive when considered on a scale large enough to encompass wide-spread clinical genetic diagnostics. Indeed, the current state of the art automatic sequencing system costs approximately $100,000. According to the National Human Genome Research Initiative, the current cost of producing fully confirmed sequence information is approximately $0.50 per base. To sequence an entire person would cost about 1.5 billion dollars assuming three orders of magnitude cost savings by parallelizing current technology.
Clearly, there is a need for the development of improved sequencing technologies that are faster, easier to use, and less expensive.
The present invention provides an improved system for the analysis of polymer molecules. In general, the system provides an optical agent whose detectable properties are altered by interaction or association with the polymer. In preferred embodiments, the agent""s detectable properties are differentially affected by individual monomers in the polymer, so that the sequence of monomer units in a given polymer molecule may be determined by sequentially contacting the polymer""s monomers with the agent, for example by passing the polymer in a linear fashion past the agent.
In preferred embodiments of the invention, the optical agent comprises a fluorescent compound whose emissions are characteristically quenched by individual monomers. In particularly preferred embodiments, a polynucleotide strand is drawn linearly past such a fluorescent agent, in sufficient proximity to allow the individual nucleotide monomers to exert their characteristic quenching effects. Analysis of fluorescent quenching over time determines the sequence of the polynucleotide molecule.
The inventive polymer analysis-techniques allow the rapid analysis of very long polymer chains at the level of a single polymer molecule. Amplification of target molecules is not required. Also, target molecules may be analyzed directly, in their native state; chemical or other modifications are not required. Certain embodiments of the invention allow the integrity of the target molecule to be preserved so that additional analyses may be performed on the same molecule whose sequence has been determined. The speed, simplicity, and effectiveness of the inventive system offer significant advantages over, currently available technologies. In certain preferred embodiments, the inventive system allows the analysis of large (greater than 1000, and preferably greater than 10,000, 20,000, 30,000, 40,000, 50,000, or 100,000 nucleotides) nucleic acid molecules at speeds of more than 1,000 bases per second and up to 100,000 bases per second.
The system can be used to analyze any type of polymer with monomer differences that can be optically detected by choosing the appropriate optical agent.
Characterization of the number of bases in a polymer (without making any distinction amongst the bases, i.e.xe2x80x94base counting) is also possible. In this case the optical agent need only produce a signal that indicates the presence or absence of the base. This would allow determination of the size of the polymer with single monomer accuracy.
Polarization of the light can yield information about the relative orientation of the emitting and quenching groups (Dickson et al., Phys. Rev. Lett. 81,: 5322, 1998; incorporated herein by reference). This method would be useful for determining the orientation of the bases as they pass by the optical agent. This method could determine whether the chemical rings are perpendicular to the backbone or how they are oriented with respect to each other.