Four classes of biological molecules are known, namely, those comprising proteins, lipids, carbohydrates and nucleic acids. Nucleic acids, in turn, comprise two subsumed classes: DNA which is a genetic component of all cells, and RNA which usually functions in a synthesis of proteins.
The purview of the present invention extends to biomolecules, generally, but a working point for the sake of pedagogy is now established by referencing biomolecules comprising DNA. DNA is emphasized because it is the prime genetic molecule, carrying all hereditary information within chromosones.
DNA stands for deoxyribonucleic acid. The DNA of most cells residues in a cell's nucleus. Its structure comprises long chains of relatively simple molecules called nucleotides. Each nucleotide comprises three parts: (1) a phosphate group stripped of one special oxygen atom; (2) a sugar called "ribose"; and (3) a base. It is the base along which distinguishes one nucleotide from another--thus it suffices to specify a base to identify a nucleotide. The four types of bases which occur in DNA nucleotides are adenine (A); guanine (G), cytosine (C) and thymine (T).
A single strand of DNA comprises many nucleotides strung together like a chain of beads. DNA usually comes in double strands, that is, two single strands which are paired up, nucleotide by nucleotide, in the form of the well known DNA double helix.
DNA carries a vast array of information through its nucleotide sequence. Accordingly, the order of nucleotides (considered as a linear progression e.g., "A T T C G G A C C . . . ") is highly varied. A nucleotide sequence may comprise inter alia a single nucleotide, a duplet (adjacent pairs of bases), a codon (three consecutive bases), a gene (a portion of a strand which codes for a single enzyme), a strand of arbitrary nucleotides, or a genome comprising a total set of DNA molecules for an organism (e.g., 3.times.10.sup.9 nucleotides for a human cell).
Our work relates to a novel approach, assembly and method for biomolecular code sequencing. We proceed from the following considerations.
First, we set forth why it is significant and of great utility to have a biomolecular code sequencing capability. This effort, secondly, can help elicit problems, difficulties and constraints in an attempt to realize and effect such a capability. Thirdly, we state what is of pertinence with respect to the prior art as it relates to this situation. Finally, we define a novel assembly and method of the present invention, and argue that it addresses and solves the problems to be overcome in realizing a qualitatively new approach to biomolecular code sequencing. Furthermore, we set the novel assembly and method in apposition to the prior art, thereby highlighting its novel and unobvious aspects as well as attesting to its advantages.
Accordingly, we assume firstly that one somehow has nucleotide sequencing information, and that this information may be accessed by conventional computer techniques. Then, once in the computer, nucleotide sequences can be scanned (at least theoretically, in some cases) inter alia for RNA synthesis, a presence of inverted palindromes, preferred segments of potential Z - DNA (alternating purine and pyrimidine stretches), homologies to other known DNA sequences, mutation detection, genotyping, genetic database comparing, or large-scale supersequencing specifying a human genome by way of its component nucleotides and their location with respect to the entire genome.
It is believed that this recital makes self-evident the significance and utility of a biomolecular code sequencing capability. At the same time, it helps elicit outstanding difficulties, problems and constraints implicit in an hypothesized method for effecting such a sequencing capability. For example, a genome comprises approximately 10.sup.9 nucleotides and has an average length of approximately 0.6 m, and a single nucleotide has an average length of approximately 1 to two angstroms. A candidate methodology must at least, therefore, somehow be able to resolve one nucleotide from an adjacent nucleotide, presumably without damage to the nucleotide, and resolve significant numbers of such nucleotides with precision and accuracy and within a meaningful time span.
Two important and representative prior art methodologies that are pertinent to this situation comprise separation techniques including gel electrophoresis and free-solution electrophoresis.
Gel electrophoresis requires a physical separation of DNA fragments produced during a sequencing reaction. Instruction on conventional gel electrophoresis may be found in (1) J. Sambrook, E. F. Fritsch, T. Maniantis, "Molecular Cloning: A Laboratory Manual" (Cold Spring Harbor Laboratory, N.Y. 1989), (2) A. T. Bankier and B. G. Barrel, "Nucleic Acids Sequencing: A Practical Approach", Eds. E. M. Howe, C. J. Rowlings, IRL Press, Oxford 1989, pp. 37-73, which instruction is incorporated by reference herein.
In overview, gel electrophoresis methodology typically comprises the steps of: (1) fragmenting a DNA strand to be sequenced into a series starting from the same point on the strand, each fragment different in length to the other by one nucleotide; (2) labelling each fragment with e.g., fluorescent tags which can fluoresce at different colours depending on the end base (A,T,C or G); (3) doing gel electrophoresis for sequentially separating the fragments into bands of decreasing molecular size; and (4) using a suitable detection means for determining the end label of each band.
To this end, present gel electrophoresis methodology relies on a dispersion in the mobility of the DNA molecules with length to separate and effect bands in an electric field. Gel electrophoresis methodology, as it is presently understood, accordingly, is therefore disadvantageously limited to approximately 700 bases (nucleotides) because there is a saturation in the dispersion for molecular lengths longer than 700 nucleotides. Further, due to the low dispersion and mobility, it takes several hours to achieve the separation of 700 nucleotides. It is true that this speed can be marginally increased by having several lanes/up to say 36 sequencing different portions of a strand.
An important advantage of the present invention is that, notwithstanding the present difficulties or deficiencies of gel electrophoresis, as just noted, it is able to offset or remedy these limitations, so that as modified or re-evaluated from the standpoint of the present invention, gel electrophoresis can provide a significantly enhanced utility. This advantage comes about in the following way.
The present invention includes a method which can resolve at least a portion of a biomolecule specifically distinguishable against chemically complex backgrounds. In one embodiment, the present invention can be used for determining a code sequencing of large duplex DNA molecules in polyacrylamide gels using conventional electrophoretic equipment.
In explanation of this advantage, we note that a critical parameter that may limit the performance of present gel-based techniques is a band-broadening of DNA sequencing reactions, as they are separated through a fixed distance of gel at continuous field strengths, often ranging from 50-400 Vcm. The size-dependence of band widths may be a result of various mechanisms of reorientation and migration of the nucleic acid fragments in the gel, such as diffusion and thermal gradient broadenings.
Now, when a sample biomolecule migrates through a polymer solution chemically cross-linked, such as polyacrylamide or agarose gels, an overall friction coefficient can become a complicated function of the pore size in the gel, the size of the sample and the electric field strength, thereby limiting resolution.
Several approaches based upon the use of capillaries or pulsed fields can partially overcome this limit of resolution (C. R. Cantor et al., Pulsed-Field gel electrophoresis of very large-DNA molecules Annual Review of Biophysics and Biophysical Chemistry, vol. 17, 287, 1988).
A spatial resolution of the detection system may also be a source of band broadening, relying on the fact that a detector does not interrogate an infinitely thin section of the sample as it reaches a finite detection volume, thereby precluding single nucleotide resolution. Present confocal-fluorescence microscopes typically provide a far field detection system to interrogate either capillaries or slab gels with a limiting sensitivity, defined as a signal-to-noise ratio of 1, or about 10.sup.-17 mole of fluorescently labeled DNA per band and a spatial resolution ranging from 10 um (Smith L. M., et al., Nature, vol. 321, 12 Jun. 1986). Based upon several theoretical approaches of band broadening in sequencing analysis by gel electrophoresis (Y. F. Chen et al., Anal Chem., 62, 496-503, 1990), a theoretical peak width of a band may be determined to be a complex function of starting conditions (i.e., injection time and volume), detection (spot size of the focused laser beam), diffusion and thermal gradient variances.
Now, starting conditions begin with an injection process.
During an injection process, which comprises loading biomolecules in the gel, the biomolecules are not stacked by moving boundaries of buffer conditions, and the biomolecules therefore enter the gel at different rates corresponding to their electrophoretic velocity in the gel, thereby contributing to the net effect on the band width variance. Subsequent detection of the biomolecule may comprise using a focused laser with a Gaussian beam profile. For this situation, a standard deviation of the beam profile can be estimated to be equal to one-half the beam spot. This yields a detection variance of the form ##EQU1## where w is the spot size. In most conventional equipment, lenses or fiber optics may be used to focus the laser on the slab gel or filled gel capillary vessel, but due to an orthogonal direction of the excitation radiation with the emitted radiation, the numerical aperture of the lens of the optical detection system may therefore be limited to about 0.20-0.75. For example, several collinear arrangements for on-column detection in capillary electrophoresis have been reported using narrower capillaries and higher numerical aperture, permitting more fluorescence to be collected, thereby contributing to sensitivity improvement.
In preparation for gel electrophoresis, a sample is loaded in each lane of a slab gel in a well of typically 0.4 mm.times.6 mm, or 2.4 mm.sup.2, whilst for example in a 50 um capillary, the surface area of the top of the gel is one thousandth of that in the slab gel, corresponding to about 10.sup.-17 mole of sample in a given band. Accordingly, loading conditions not taking advantage of sample stacking and optical diffraction threshold of detection system may be significant sources of band broadening, affecting resolution.
In sharp contrast, the procedures and embodiments of the present invention define innovative approaches to overcoming the above limitations by employing, in a specific embodiment, a mechanism that can focus sample bands to the sample dimensions, at least 0.1 micron, and a near-field detection system that permits spatial resolution beyond the diffraction limit, thereby extending the limit of concentration detection to at least the mass of a single molecule.
One way to increase conventional gel electrophoresis low mobility is to use free-solution electrophoresis. Here, there is no dispersion in mobility with molecular length (M bases). This is due to the fact that mobility (velocity divided by electric field) is equivalent to electric charge divided by friction coefficient, and both electric charge and friction coefficient scale linearly with molecular length, M. In Mayer et al (Anal. Chem. 1994, 66, 1777-1780), there is a proposal to attaching a large molecule at the end of each fragment in order to add a constant friction contribution to each. In this way, mobility is no longer independent of the number of bases. Theoretical calculations based on this reference suggest that dispersion can allow one to separate 3000 nucleotides in five minutes, in a best case comprising a far field detection limit.
Finally, we reference in passing proposed advanced technologies comprising large-scale automated DNA sequencing methodologies, namely, applying mass spectrometry to fast sequencing DNA, or sequencing by hybridization. See references 1) R. J. Lewis et al, J. AM. Chem. Soc., 113, 9665, 1991 and 2) R. Drmanac et al, "Sequencing of Magabase Plus DNA by Hybridization: Theory of the Method" in Genomics, vol. 4. pp. 114-118 (1989), respectively.
We have now discovered an approach to biomolecular code sequencing which is qualitatively distinct from the prior art. This different approach is manifest in a novel method and assembly suitable for identifying a code sequence of at least a portion of a biomolecule.
The method comprises the steps of:
1) using a near-field probe technique for generating a super-resolution chemical analysis of the portion of a biomolecule; and PA1 2) correlating the chemical analysis with a broad spectral content of a reference biomolecule for generating a code sequencing. PA1 1) first means for migrating and separating a portion of a biomolecule in a free-solution; PA1 2) second means comprising a near-field probe for generating a super-resolution chemical analysis of the portion of the biomolecule; and PA1 3) third means for correlating the super-resolution chemical analysis of the portion of the biomolecule with a broad spectral content of a referent biomolecule, for generating a code sequencing of the portion of the biomolecule. PA1 1) identifying a first nucleotide from a second (adjacent) nucleotide; or PA1 2) locating with respect to an arbitrary strand or to a genome, a location of an identified nucleotide; or PA1 3) identifying a first duplet, codon, gene from a second (adjacent) duplet, codon, gene; or PA1 4) locating with respect to an arbitrary strand or to a genome, a location of an identified duplet, codon, gene.
The assembly comprises:
The present invention as defined can realize several significant advantages.
First of all, the novel method and assembly have an immanent capability for generating nucleotide sequencing information of such a quality, quantity and time-responsiveness, that heretofore even merely theorized applications requiring such information can now become a straightforward reality. For example, the invention can be employed for developing a map that accurately reflects both individual nucleotide identification (i.e., A, G, C and T) and the location of an individual nucleotide with respect to a strand of arbitrary length, including an entire genome.
In this sense, moreover, the present invention can evince a remarkable versatility, since it may be selectively and variously employed e.g., in dependent steps, for:
To this end, the present invention has a capability for generating a fast and/or high throughput code sequence e.g., comprising at least 1000 bases/portion of biomolecule, preferably at least 100 kilobases bases/portion of biomolecule within less than 1 hour, particularly an entire human genome within less than one day, for example, 3 kilobases in less than 5 minutes.
Other advantages of the present invention proceed from the following considerations. An application of the method can generate, for the first time, nucleotide information of a quality and quantity sui generis. This information, in turn, can become a centerpiece for new and efficient approaches to gene testing or drug design, DNA sequence homology or biomolecular computing.
Other advantages of the present invention are enumerated below.