The present invention relates to methods and apparatuses for analyzing molecules, particularly polymers, and molecular complexes with extended or rod-like conformations. In particular, the methods and apparatuses are used to identify repetitive information, e.g., sequence information, in molecules or molecular ensembles, which is subsequently used to determine structural information about the molecules. The methods are based on the use of an autocorrelation function to identify common information in multiple molecules having at least one overlapping repetitive sequence.
Macromolecules are involved in diverse and essential functions in living systems. The ability to decipher the functions, dynamics, and interactions of macromolecules is dependent upon an understanding of their chemical and three-dimensional structures. These three aspectsxe2x80x94chemical and three-dimensional structures and dynamicsxe2x80x94are interrelated. For example, the chemical composition of a protein, and more particularly the linear arrangement of amino acids, explicitly determines the three-dimensional structure into which the polypeptide chain folds after biosynthesis (Kim and Baldwin (1990) Ann. Rev. Biochem. 59: 631-660), which in turn determines the interactions that the protein will have with other macromolecules, and the relative mobilities of domains that allow the protein to function properly.
Biological macromolecules are either polymers or complexes of polymers. Different types of macromolecules are composed of different types of monomers, i.e., twenty amino acids in the case of proteins and four major nucleobases in the case of nucleic acids. A wealth of information can be obtained from a determination of the linear, or primary, sequence of the monomers in a polymer chain. For example, by determining the primary sequence of a nucleic acid, it is possible to determine the primary sequences of proteins encoded by the nucleic acid, to generate expression maps for the determination of MRNA expression patterns, to determine protein expression patterns, and to understand how mutations in genes correspond to a disease state. Furthermore, the characteristic pattern of distribution of specific nucleobase sequences along a particular DNA polymer can be used to unequivocally identify the DNA, as in forensic analysis.
In general, DNA identification and sequencing has been performed using methods, such as those described by Maxam and Gilbert (Maxam and Gilbert (1977) Proc. Natl. Acad. Sci. USA 74: 560-564) and by Sanger (Sanger et al. (1977) Proc. Natl. Acad. Sci. USA 74: 5463-5467) that determine the exact sequence of relatively short pieces of DNA. There are also techniques that arrange these short DNA fragments of known sequence in the proper order to obtain a longer sequence, such as those described by Evans (U.S. Pat. No. 5,219,726). Other methods of nucleic acid detection and sequencing have been developed, however, these too have limitations in the number of nucleotides they can read, in their abilities to resolve the identities of adjacent nucleotides, and in the practicality of their implementation.
Several methods for rapid sequencing of nucleic acids have been developed that use exonucleases to cleave individual bases from the nucleic acid polymer, which are subsequently identified in order to generate the sequence of the nucleic acid. U.S. Pat. No. 4,962,037 discloses a method wherein the nucleic acid fragment is suspended in a flowing stream while an exonuclease sequentially cleaves individual bases from the end of the fragment. The flowing stream delivers the cleaved bases in an ordered fashion to a detector for subsequent identification. A similar approach with some modifications is disclosed in U.S. Pat. No. 5,674,743. In this method, the DNA strand to be sequenced is processed with an exonuclease to cleave bases from the strand, and each cleaved base is then transported away from the strand and is incorporated into a fluorescence-enchancing matrix. In a particular embodiment, the intrinsic fluorescence of the nucleotide is induced and is used to identify it. Using a processive exonuclease, it is theoretically possible to sequence 10,000 bases or more at a rate of 10 bases per second. However, exonuclease sequencing has encountered many problems. If extrinsic labels are used to identify each base, all four bases must be tagged with, e.g., different fluorophores, which is sterically difficult; in addition, introduction of fluorophores may interfere with the enzymatic activity of the exonuclease. Furthermore, difficult optical trapping is needed to suspend DNA molecules in a flowing stream. Lastly, single molecules of fluorophore need to be detected with high efficiency, and only 95% efficiency has been achieved.
Methods of nucleic acid sequencing by hybridization with a specific set of oligonucleotide probes are also known in the art (Strezoska et al. (1991) Proc. Natl. Acad. Sci. USA 88: 10089-10093; Bains (1992) BioTechnology 10: 757-758). Although this approach is very costly to set up, sequencing by these methods is ultimately low-cost ($0.03-0.08 per base). Another advantage is the potential integration of the technique with microelectronics using special microchips for sequencing of nucleic acids fragments and even analysis of entire genomes (Service (1998) Science 282: 396-399 and 399-401). Traditional sequencing by hybridization techniques have the limitation of imperfect hybridization, especially under conditions in which hybridization is not favored, e.g., low-salt, or upon formation of secondary structure in the target nucleic acid, which interferes with binding to the probes. Imperfect hybridization leads to difficulties in generating adequate sequence because the error in hybridization is amplified many times.
U.S. Pat. No. 5,846,727 discloses a microsystem for rapid DNA sequencing in which a DNA template is amplified using the polymerase chain reaction (xe2x80x9cPCRxe2x80x9d) and the PCR products are labeled and immobilized on a capillary tube wall. Then, Sanger extension products of the amplified DNA are prepared, labeled, and electrophoretically separated in a capillary channel. Near-infrared, laser-induced fluorescence of the oligonucleotides is detected. The same fluorophore is used to label all bases; however, different bases can be distinguished by difference of the fluorescence lifetimes induced by different bases upon the labeling. T The substrate used is selected for compatibility with both the solutions and the conditions to be used in analysis, including but not limited to extremes of salt concentrations, acid or base concentration, temperature, electric fields, and transparence to wavelengths used for optical excitation or emission. The substrate material may include those associated with the semiconductor industry, such as fused silica, quartz, silicon, or gallium arsenide, or inert polymers such as polymethylmetacrylate, polydimethylsiloxane, polytetrafluoroethylene, polycarbonate, or polyvinylchloride. Because of its transmissive properties across a wide range of wavelengths, quartz is a preferred embodiment.
The use of quartz as a substrate with an aqueous solution means that the surface in contact with the solution has a positive charge. When working with charged molecules, especially under electrophoresis, it is desirable to have a neutral surface. In one embodiment, a coating is applied to the surface to eliminate the interactions which lead to the charge. The coating may be obtained commercially (capillary coatings by Supelco, Bellafonte Pa.), or it can be applied by the use of a silane with a functional group on one end. The silane end will bond effectively irreversibly with the glass, and the functional group can react further to make the desired coating. For DNA, a silane with polyethyleneoxide effectively prevents interaction between the polymer and the walls without further reaction, and a silane with an acrylamide group can participate in a polymerization reaction to create a polyacrylamide coating which not only does not interact with DNA, but also inhibits electro-osmotic flow during electrophoresis.
The microchannels may be constructed on the substrate by any number of techniques, many derived from the semiconductor industry, depending on the substrated selected. These techniques include, but are not limited to, photolithography, reactive ion etching, wet chemical etching, electron beam writing, laser or air ablation, LIGA, and injection molding. A variety of these techniques applied to polymer-handling chips have been discussed in the literature including by Harrison et al. (Analytical Chemistry 1992 (64) 1926-1932), Seiler et al. (Analytical Chemistry 1993 (65) 1481-1488), Woolley et al. (Proceedings of the National Academy of Sciences November 1994 (91) 11348-11352), and Jacobsen et al. (Analytical Chemistry 1995 (67) 2059-2063). he disclosed microsystem offers several advantages like the need of only sub-microliter volumes of expensive reagents, the ability to automate the procedure and perform several analyses simultaneously, and the use of a xe2x80x9chighly efficient base-calling scheme using a single lane, single-dye formatxe2x80x9d. Despite these advantages, typical read lengths of this method are still only on the order of 400-500 bases.
There are several other methods (U.S. Pat. No. 4,962,037 and U.S. Pat. No. 5,674,743, see below) that can be used to sequence long DNA molecules. However, the maximal length of a single DNA fragment that can be sequenced by existing techniques is still less than 2,000 bases (Mullikin and McMurray (1999) Science 283: 1867-1868; Sinclair (1999) The Scientist 15 (9): 18-20).
Methods have also been developed for quantitative detection of macromolecules in a sample. Recent developments in experimental techniques and available hardware have increased dramatically the sensitivity of detection so that optical measurements can be made of even single molecules in a sample. Such measurements can be done in aqueous solution, at room temperature (Weiss (1999) Science 283: 1676-1683), and in very small volumes to reduce background scattering.
Fluorescence correlation spectroscopy (xe2x80x9cFCSxe2x80x9d) uses an autocorrelation function to process fluctuations in fluorescence emission from a restricted volume (Elson and Magde (1974) Biopolymers 13:1-27). This approach is essentially based on the assumptions that: (a) one or zero fluorescent molecules can be within an illuminated volume; and (b) the fluorescence emitted by the fluorescent molecule in the illuminated volume noticeably exceeds background. The detected fluorescent bursts, whose lengths are related to the time a molecule spends within the illuminated volume, can be used to identify and count molecules, as well as to determine diffusion coefficients (U.S. Pat. No. 4,979,824, WO Pat. No. 94/16313, the latter patent uses FCS).
Eigen and Rigler [(1994) Proc. Natl. Acad. Sci. USA 91: 5740-5747], describe the use of FCS for parallel screening of large amounts of genetic material for a particular sequence pattern. In particular, the interaction of a fluorescent ligand, e.g., a labeled oligonucleotide, with a larger target DNA can be measured by the correlation function describing the diffusion of the free and bound ligand. An oligonucleotide hybridized with a large DNA fragment would diffuse more slowly than free oligonucleotide, and therefore, the bound form of the fluorescent oligonucleotide exhibits longer photon bursts. A modification of this technique uses the cross correlation of signals obtained from different oligonucleotides labeled with different fluorophores to detect the presence of different oligonucleotide sequences within a DNA target sample (Schwille et al. (1997) Biophys. J. 72: 1878-1886).
PCT Publication No. WO 98/10097 discloses a method and apparatus for detection of single molecules emitting two-color fluorescence and determination of molecular weight and concentration of the molecules. The method involves the labeling of individual molecules with at least two fluorescent probes. The velocity is determined by measuring the time required for the molecules to travel a fixed distance between two laser beams. Comparison of the molecule""s velocity with that of standard species permits determination of the molecular weight of the molecule, which may be present in a concentration as small as one femtomolar. The accuracy of the technique is limited by the time the molecule under scrutiny spends traveling through the spot of the focused laser beam. The diameter of the laser beam is diffraction limited and exceeds 0.4 xcexcm for visible light.
Castro and Shera [(1995) Anal. Chem. 67: 3181-3186] describe the use of single molecule electrophoresis (SME) for the detection and identification of single molecules in solution. The technique involves the determination of electrophoretic velocities by measuring the time required for individual molecules labeled with a single fluorophore to travel a fixed distance between two laser beams. This technique has been applied to DNA, to fluorescent proteins and to simple organic fluorophores. An advantage of SME over conventional zone electrophoresis is that SME is a continuous flow system that permits real-time analysis, which is important when sample concentration and/or composition changes with time. The disclosed system has disadvantages when applied to the detection of a specific DNA sequence within a large genomic background. If a single fluorescent probe complementary to the sequence of interest is used, it can bind non-specifically to other sequences in the genomic DNA, which results in detection of a false positive. Moreover, an unbound probe also produces a detectable signal that could be misinterpreted as the presence of the target sequence.
U.S. Pat. No. 5,807,677 discloses a method and device for direct identification of a specific target nucleic acid sequence having a low copy number in a test solution. This method involves the preparation of a reference solution of a mixture of different short oligonucleotides. Each oligonucleotide includes a sequence complementary to a section of the target sequence and is labeled with one or more fluorescent dye molecules. The reference solution is incubated with the test solution under conditions favorable to hybridization of the short oligonucleotides with the nucleic acid target. The target sequence is identified in the solution by detection of the nucleic acid strands to which one or more of the labeled oligonucleotides are hybridized. To amplify the fluorescence signal, a xe2x80x9ccocktailxe2x80x9d of different oligonucleotides are used which are capable of hybridizing with sequences adjacent to but not overlapping with the target sequence. The disadvantage of this method is that, in order to design probes of the proper sequence, the exact sequence of the target nucleic acid and surrounding sequences must be known.
PCT Publication No. WO 96/06189 describes a method for quantitative detection of oligonucleotides using capillary electrophoresis. Typically, capillary electrophoresis employs fused silica capillary tubes whose inner diameters are between about 10-200 microns, and which can range in length between about 5-100 cm or more. As the inner diameter of such a capillary is small, electric fields 10 to 100 times stronger than those applicable in conventional electrophoretic systems can be applied because of reduced Joule heating. This permits very high speeds and superior resolution. In the methods described in PCT Publication No. WO 96/06189, a fluorescently labeled peptide nucleic acid ranging in size from 5-50 monomers is hybridized to a DNA sample and capillary electrophoresis through a polyacrylamide gel is performed under denaturing conditions (7 M urea) where the PNA/DNA complex is stable. This method suffers from limited detection sensitivity and cannot be used to detect single copy genes in large genomes.
The existing methods for sequencing polymers and for detecting the presence of small amounts of specific polymers in a sample each have drawbacks. The major drawbacks of sequencing techniques are that they are slow, labor intensive, and have fairly short read lengths (under 2,000 bases for nucleic acid sequencing) and limited accuracy. The methods for detecting molecules in a sample have the drawbacks of lack of sensitivity, frequent occurrence of false positive results, and, in some cases, a requirement that the sequence of the molecule to be detected must already be known. Clearly, there is a need for faster, simpler, more reliable and more universally applicable methods of sequencing and of detecting copies of sequences in a sample in order to elucidate complex genetic function and diagnose diseases and genetic dysfunctions more rapidly and accurately.
Citation of a reference herein shall not be construed as indicating that such reference is prior art to the present invention.
In a first embodiment, the present invention relates to a method for analyzing an extended object comprising: (a) moving with respect to at least one station a plurality of similar extended objects that are each similarly labeled with at least two unit-specific markers to generate a plurality of object-dependent impulses as the labeled extended objects pass the station; (b) measuring the generated plurality of object-dependent impulses as a function of one or more system parameters; and (c) calculating an autocorrelation function of said object-dependent impulses, to analyze the extended object.
In a second embodiment, the present invention relates to a method for analyzing an extended object comprising calculating an autocorrelation function of object-dependent impulses.
In a third embodiment, the present invention relates to an article of manufacture comprising a lattice of spherical beads having a plurality of fixed stations with at least one fluorophore positioned at each fixed station.
In a fourth embodiment, the present invention relates to a system for analyzing an extended object labeled with at least two unit-specific markers comprising: a central processing unit; an input device for inputting a plurality of object-dependent impulses of an extended object; an output device; a memory; at least one bus connecting the central processing unit, the memory, the input device, and the output device; the memory storing a calculating module configured to calculate an autocorrelation function for said plurality of object-dependent impulses of said extended object input using said input device.
In a fifth embodiment, the present invention relates to a computer program product for use in conjunction with a computer, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising a calculating module configured to calculate an autocorrelation function of a plurality of object-dependent impulses.
The methods, articles of manufacture, computer system, and computer program products of the invention are useful for analyzing polymers, particularly DNA.