1. Field of the Invention
The present invention generally relates to the field of molecular biology. The invention particularly provides novel methods and compositions to enable highly efficient sequencing of nucleic acid molecules. The methods of the invention are suitable for sequencing long nucleic acid molecules, including chromosomes and RNA, without cloning or subcloning steps.
2. Description of the Related Art
Nucleic acid sequencing forms an integral part of scientific progress today. Determining the sequence, i.e. the primary structure, of nucleic acid molecules and segments is important in regard to individual projects investigating a range of particular target areas. Information gained from sequencing impacts science, medicine, agriculture and all areas of biotechnology. Nucleic acid sequencing is, of course, vital to the human genome project and other large-scale undertakings, the aim of which is to further our understanding of evolution and the function of organisms and to provide an insight into the causes of various disease states.
The utility of nucleic acid sequencing is evident, for example, the Human Genome Project (HGP), a multinational effort devoted to sequencing the entire human genome, is in progress at various centers. However, progress in this area is generally both slow and costly. Nucleic acid sequencing is usually determined on polyacrylamide gels that separate DNA fragments in the range of 1 to 500 bp, differing in length by one nucleotide. The actual determination of the sequence, i.e., the order of the individual A, G, C and T nucleotides may be achieved in two ways. Firstly, using the Maxam and Gilbert method of chemically degrading the DNA fragment at specific nucleotides (Maxam and Gilbert, 1977), or secondly, using the dideoxy chain termination sequencing method described by Sanger and colleagues (Sanger et al., 1977). Both methods are time-consuming and laborious.
More recently, other methods of nucleic acid sequencing have been proposed that do not employ an electrophoresis step, these methods may be collectively termed Sequencing By Hybridization or SBH (Drmanac et al., 1991; Cantor et al., 1992; Drmanac and Crkvenjakov, U.S. Pat. No. 5,202,231). Development of certain of these methods has given rise to new solid support type sequencing tools known as sequencing chips. The utility of SBH in general is evidenced by the fact that U.S. Patents have been granted on this technology. However, although SBH has the potential for increasing the speed with which nucleic acids can be sequenced, all current SBH methods still suffer from several drawbacks.
SBH can be conducted in two basic ways, often referred to as Format 1 and Format 2 (Cantor et al., 1992). In Format 1, oligonucleotides of unknown sequence, generally of about 100-1000 nucleotides in length, are arrayed on a solid support or filter so that the unknown samples themselves are immobilized (Strezoska et al., 1991; Drmanac and Crkvenjakov, U.S. Pat. No. 5,202,231). Replicas of the array are then interrogated by hybridization with sets of labeled probes of about 6 to 8 residues in length. In Format 2, a sequencing chip is formed from an array of oligonucleotides with known sequences of about 6 to 8 residues in length (Southern, WO 89/10977; Khrapko et al., 1991; Southern et al., 1992). The nucleic acids of unknown sequence are then labeled and allowed to hybridize to the immobilized oligos.
Unfortunately, both of these SBH formats have several limitations, particularly the requirement for prior DNA cloning steps. In Format 1, other significant problems include attaching the various nucleic acid pieces to be sequenced to the solid surface support or preparing a large set of longer probes. In Format 2, major problems include labelling the nucleic acids of unknown sequence, high noise to signal ratios that generally result, and the fact that only short sequences can be determined. Further problems of Format 2 include the secondary structure formation that prevents access to some targets and the different conditions that are necessary for probes with different GC contents. Therefore, the art would clearly benefit from a new procedure for nucleic acid sequencing, and particularly, one that avoids the tedious processes of cloning and/or subcloning.
The present invention seeks to overcome these and other drawbacks inherent in the prior art by providing new methods and compositions for the sequencing of nucleic acids. The novel techniques described herein have been generally termed Format 3 by the inventors and represent marked improvements over the existing Format 1 and Format 2 SBH methods. In the Format 3 sequencing provided by the invention, nucleic acid sequences are determined by means of hybridization with two sets of small oligonucleotide probes of known sequences. The methods of the invention allow high discriminatory sequencing of extremely large nucleic acid molecules, including chromosomal material or RNA, without prior cloning, subcloning or amplification. Furthermore, the present methods do not require large numbers of probes, the complex synthesis of longer probes, or the labelling of a complex mixture of nucleic acids segments.
To determine the sequence of a nucleic acid according to the methods of the present invention, one would generally identify sequences from the nucleic acid by hybridizing with complementary sequences from two sets of small oligonucleotide probes (oligos) of defined length and known sequence, which cover most combinations of sequences for that length of probe. One would then analyze the sequences identified to determine stretches of the identified sequences that overlap, and reconstruct or assemble the complete nucleic acid sequence from such overlapping sequences.
The sequencing methods may be conducted using sequential hybridization with complementary sequences from the two sets of small oligos. Alternatively, a mode described as xe2x80x9ccyclingxe2x80x9d may be employed, in which the two sets of small oligos are hybridized with the unknown sequences simultaneously. The term xe2x80x9ccyclingxe2x80x9d is applied as the discriminatory part of the technique comes from then increasing the temperature to xe2x80x9cmeltxe2x80x9d those hybrids that are non-complementary. Such cycling techniques are commonly employed in other areas of molecular biology, such as PCR, and will be readily understood by those of skill in the art in light when reading the present disclosure.
The invention is applicable to sequencing nucleic acid molecules of very long length. As a practical matter, the nucleic acid molecule to be sequenced will generally be fragmented to provide small or intermediate length nucleic acid fragments that may be readily manipulated. The term nucleic acid fragment, as used herein, most generally means a nucleic acid molecule of between about 10 base pairs (bp) and about 100 bp in length. The most preferred methods of the invention are contemplated to be those in which the nucleic acid molecule to be sequenced is treated to provide nucleic acid fragments of intermediate length, i.e., of between about 10 bp and about 40 bp. However, it should be stressed that the present invention is not a method of completely sequencing small nucleic acid fragments, rather it is a method of sequencing nucleic acid molecules per se, which involves determining portions of sequence from within the moleculexe2x80x94whether this is done using the whole molecule, or for simplicity, whether this is achieved by first fragmenting the molecule into smaller sized sections of from about 4 to about 1000 bases.
Sequences from nucleic acid molecules are determined by hybridizing to small oligonucleotide probes of known sequence. In referring to xe2x80x9csmall oligonucleotide probesxe2x80x9d, the term xe2x80x9csmallxe2x80x9d means probes of less than 10 bp in length, and preferably, probes of between about 4 bp and about 9 bp in length. In one exemplary sequencing embodiment, probes of about 6 bp in length are contemplated to be particularly useful. For the sets of oligos to cover all combinations of sequences for the length of probe chosen, their number will be represented by 4F, wherein F is the length of the probe. For example, for a 4-mer, the set would contain 256 probes; for a 5-mer, the set would contain 1024 probes; for a 6-mer, 4096 probes; a 7-mer, 16384 probes; and the like. The synthesis of oligos of this length is very routine in the art and may be achieved by automated synthesis.
In the methods of the invention, one set of the small oligonucleotide probes of known sequence, which may be termed the first set, will be attached to a solid support, i.e., immobilized on that support in such a way so that they are available to take part in hybridization reactions. The other set of small oligonucleotide probes of known sequence, which may be termed the second set, will be probes that are in solution and that are labelled with a detectable label. The sets of oligos may include probes of the same or different lengths.
The process of sequential hybridization means that nucleic acid molecules, or fragments, of unknown sequence can be hybridized to the distinct sets of oligonucleotide probes of known sequences at separate times (FIG. 1). The nucleic acid molecules or fragments will generally be denatured, allowing hybridization, and added to the first, immobilized set of probes under discriminating hybridization conditions to ensure that only fragments with complementary sequences hybridize. Fragments with non-complementary sequences are removed and the next round of discriminating hybridization is then conducted by adding the second, labelled set of probes, in solution, to the combination of fragments and probes already formed. Labelled probes that hybridize adjacent to a fixed probe will remain attached to the support and can be detected, which is not the case when there is space between the fixed and labelled probes (FIG. 1).
The process of simultaneous hybridization means that the unknown sequence nucleic acid molecules can be contacted with the distinct sets of oligonucleotide probes of known sequences at the same time. Hybridization will occur under discriminating hybridization conditions. Fragments with non-complementary sequences are then xe2x80x9cmeltedxe2x80x9d, i.e., removed by increasing the temperature, and the next round of discriminating hybridization is then conducted, allowing any complementary second probes to hybridize. Labelled probes that hybridize adjacent to a fixed probe will then be detected in the same manner.
Nucleic acid sequences that are xe2x80x9ccomplementaryxe2x80x9d are those that are capable of base-pairing according to the standard Watson-Crick complementarity rules, and variations of the rules as they apply to modified bases. That is, that the larger purines, or modified purines, will always base pair with the smaller pyrimidines to form only known combinations. These include the standard paris of guanine paired with Cytosine (G:C) and Adenine paired with either Thymine (A:T), in the case of DNA, or Adenine paired with Uracil (A:U) in the case of RNA. The use of modified bases, or the so-called Universal Base (M, Nichols et al., 1994) is also contemplated.
As used herein, the term xe2x80x9ccomplementary sequencesxe2x80x9d means nucleic acid sequences that are substantially complementary over their entire length and have very few base mismatches. For example, nucleic acid sequences of six bases in length may be termed complementary when they hybridize at five out of six positions with only a single mismatch. Naturally, nucleic acid sequences that are xe2x80x9ccompletely complementaryxe2x80x9d will be nucleic acid sequences that are entirely complementary throughout their entire length and have no base mismatches.
After identifying, by hybridization to the oligos of known sequence, various individual sequences that are part of the nucleic acid fragments, these individual sequences are next analyzed to identify stretches of sequences that overlap. For example, portions of sequences in which the 5xe2x80x2 end is the same as the 3xe2x80x2 end of another sequence, or vice versa, are identified. The complete sequence of the nucleic acid molecule or fragment can then be delineated, i.e., it can be reconstructed from the overlapping sequences thus determined.
The processes of identifying overlapping sequences and reconstructing the complete sequence will generally be achieved by computational analysis. For example, if a labelled probe 5xe2x80x2-TTTTTT-3xe2x80x2 hybridizes to the spot containing the fixed probe 5xe2x80x2-AAAAAA-3xe2x80x2, a 12-mer sequence from within the nucleic acid molecule is defined, namely 5xe2x80x2-AAAAAATTTTTT-3xe2x80x2 (SEQ ID NO:1), i.e. the sequence of the two hybridized probes is combined to reveal a previously unknown sequence. The next question to be answered is which nucleotide follows next after the newly determined 5xe2x80x2AAAAAATTTTTT-3xe2x80x2 (SEQ ID NO:1) sequence. There are four possibilities represented by the fixed probe 5xe2x80x2-AAAAAT-3xe2x80x2 and labelled probes 5xe2x80x2-TTTTTA-3xe2x80x2 for A; 5xe2x80x2-TTTTTT-3xe2x80x2 for T; 5xe2x80x2-TTTTTC-3xe2x80x2 for C; and 5xe2x80x2-TTTTTG-3xe2x80x2 for G. If, for example, the probe 5xe2x80x2-TTTTTC-3xe2x80x2 is positive and the other three are negative, then the assembled sequence is extended to 5xe2x80x2-AAAAAATTTTTTC-3xe2x80x2 (SEQ ID NO:2). In the next step, an algorithm determines which of the labelled probes TTTTCA, TTTTCT, TTTTCC or TTTTCG are positive at the spot containing the fixed probe AAAATT. The process is repeated until all positive (F+P) oligonucleotide sequences are used or defined as false positives.
The present invention thus provides a very effective way to sequence nucleic acid fragments and molecules of long length. Large nucleic acid molecules, as defined herein, are those molecules that need to be fragmented prior to sequencing. They will generally be of at least about 45 or 50 base pairs (bp) in length, and will most often be longer. In fact, the methods of the invention may be used to sequence nucleic acid molecules with virtually no upper limit on length, so that sequences of about 100 bp, 1 kilobase (kb), 100 kb, 1 megabase (Mb), and 50 Mb or more may be sequenced, up to and including complete chromosomes, such as human chromosomes, which are about 100 Mb in length. Such a large number is well within the scope of the present invention and sequencing this number of bases will require two sets of 8-mers or 9-mers (so that F+P≈16-18). The nucleic acids to be sequenced may be DNA, such as cDNA, genomic DNA, microdissected chromosome bands, cosmid DNA or YAC inserts, or may be RNA, including MRNA, rRNA, tRNA or snRNA.
The process of determining the sequence of a long nucleic acid molecule involves simply identifying sequences of length F+P from the molecule and combining the sequences using a suitable algorithm. In practical terms, one would most likely first fragment the nucleic acid molecule to be sequenced to produce smaller fragments, such as intermediate length nucleic acid fragments. One would then identify sequences of length F+P by hybridizing, e.g., sequentially hybridizing, the fragments to complementary sequences from the two sets of small oligonucleotide probes of known sequence, as described above. In this manner, the complete nucleic acid sequence of extremely large molecules can be reconstructed from overlapping sequences of length F+P.
Whether the nucleic acid to be sequenced is itself an intermediate length fragment or is first treated to generate such length fragments, the process of identifying sequences from such nucleic acid fragments by hybridizing to two sets of small oligonucleotide probes of known sequence is central to the sequencing methods disclosed herein. This process generally comprises the following steps:
(a) contacting the set or array of attached or immobilized oligonucleotide probes with the nucleic acid fragments under hybridization conditions effective to allow fragments with a complementary sequence to hybridize sufficiently to a probe, thereby forming primary complexes wherein the fragment has both hybridized and non-hybridized, or xe2x80x9cfreexe2x80x9d, sequences;
(b) contacting the primary complexes with the set of labelled oligonucleotide probes in solution under hybridization conditions effective to allow probes with complementary sequences to hybridize to a non-hybridized or free fragment sequence, thereby forming secondary complexes wherein the fragment is hybridized to both an attached (immobilized) probe and a labelled probe;
(c) removing from the secondary complexes any labelled probes that have not hybridized adjacent to an attached probe, thereby leaving only adjacent secondary complexes;
(d) detecting the adjacent secondary complexes by detecting the presence of the label in the labelled probe; and
(e) identifying oligonucleotide sequences from the nucleic acid fragments in the adjacent secondary complexes by combining or connecting the known sequences of the hybridized attached and labelled probes.
The hybridization or xe2x80x98washing conditionsxe2x80x99 chosen to conduct either one, or both, of the hybridization steps may be manipulated according to the particular sequencing embodiment chosen. For example, both of the is hybridization conditions may be designed to allow oligonucleotide probes to hybridize to a given nucleic acid fragment when they contain complementary sequences, i.e., substantially matching sequences, such as those sequences that hybridize at five out of six positions. The hybridization steps would preferably be conducted using a simple robotic device as is routinely used in current sequencing procedures.
Alternatively, the hybridization conditions may be designed to allow only those oligonucleotide probes and fragments that have completely complementary sequences to hybridize. These more discriminating or xe2x80x98stringentxe2x80x99 conditions may be used for both distinct steps of a sequential hybridization process or for either step alone. In such cases, the oligonucleotide probes, whether immobilized or labelled probes, would only be allowed to hybridize to a given nucleic acid fragment when they shared completely complementary sequences with the fragment.
The hybridization conditions chosen will generally dictate the degree of complexity required to analyze the data obtained. Equally, the computer programs available to analyze any data generated may dictate the hybridization conditions that must be employed in a given laboratory. For example, in the most discriminating process, both hybridization steps would be conducted under conditions that allow only oligos and fragments with completely complementary sequences to hybridize. As there will be no mismatched bases, this method involves the least complex computational analyses and, for this reason, it is the currently preferred method for practicing the invention. However, the use of less discriminating conditions for one or both hybridization steps also falls within the scope of the present invention.
Suitable hybridization conditions for use in either or both steps may be routinely determined by optimization procedures or xe2x80x98pilot studiesxe2x80x99. Various types of pilot studies are routinely conducted by those skilled in the art of nucleic acid sequencing in establishing working procedures and in adapting a procedure for use in a given laboratory. For example, conditions such as the temperature; the concentration of each of the components; the length of time of the steps; the buffers used and their pH and ionic strength may be varied and thereby optimized.
In preferred embodiments, the nucleic acid sequencing method of the invention involves a discriminating step to select for secondary hybridization complexes that include immediately adjacent immobilized and labelled probes, as distinct from those that are not immediately adjacent and are separated by one, two or more bases. A variety of processes are available for removing labelled probes that are not hybridized immediately adjacent to an attached probe, i.e., not hybridized back to back, each of which leaves only the immediately adjacent secondary complexes.
Such discriminatory processes may rely solely on washing steps of controlled stringency wherein the hybridization conditions employed are designed so that immediately adjacently probes remain hybridized due to the increased stability afforded by the stacking interactions of the adjacent nucleotides. Again, washing conditions such as temperature, concentration, time, buffers, pH, ionic strength and the like, may be varied to optimize the removal of labelled probes that are not immediately adjacent.
In preferred embodiments the immediately adjacent immobilized and labelled probes would be ligated, i.e., covalently joined, prior to performing washing steps to remove any non-ligated probes. Ligation may be achieved by treating with a solution containing a chemical ligating agent, such as, e.g., water-soluble carbodiimide or cyanogen bromide. More preferably, a ligase enzyme, such as T4 DNA ligase from T4 bacteriophage, which is commercially available from many sources (e.g., Biolabs), may be employed. In any event, one would then be able to remove non-immediately adjacent labelled probes by more stringent washing conditions that cannot affect the covalently connected labeled and fixed probes.
The remaining adjacent secondary complexes would be detected by observing the location of the label from the labelled probes present within the complexes. The oligonucleotide probes may be labeled with a chemically-detectable label, such as fluorescent dyes, or adequately modified to be detected by a chemiluminescent developing procedure, or radioactive labels such as 35S, 3H, 32P or 33P, with 33P currently being preferred. Probes may also be labeled with non-radioactive isotopes and detected by mass spectrometry.
Currently, the most preferred method contemplated for practicing the present invention involves performing the hybridization steps under conditions designed to allow only those oligonucleotide probes and fragments that have completely complementary sequences to hybridize and that allow only those probes that are immediately adjacent to remain hybridized. This method subsequently requires the least complex computational analysis.
Where the nucleic acid molecule of unknown sequence is longer than about 45 or 50 bp, one effective method for determining its sequence generally involves treating the molecule to generate nucleic acid fragments of intermediate length, and determining sequences from the fragments. The nucleic acid molecule, whether it be DNA or RNA may be fragmented by any one of a variety of methods including, for example, cutting by restriction enzyme digestion, shearing by physical means such as ultrasound treatment, by NaOH treatment or by low pressure shearing.
In certain embodiments, e.g., involving small oligonucleotide probes between about 4 bp and about 9 bp in length, one may aim to produce nucleic acid fragments of between about 10 bp and about 40 bp in length. Naturally, longer length probes would generally be used in conjunction with sequencing longer length nucleic acid fragment, and vice versa. In certain preferred embodiments, the small oligonucleotide probes used will be about 6 bp in length and the nucleic acid fragments to be sequenced will generally be about 20 bp in length. If desired, fragments may be separated by size to obtain those of an appropriate length, e.g., fragments may be run on a gel, such as an agarose gel, and those with approximately the desired length may be excised.
The method for determining the sequence of a nucleic acid molecule may also be exemplified using the following terms. Initially one would randomly fragment an amount of the nucleic acid to be sequenced to provide a mixture of nucleic acid fragments of length T. One would prepare an array of immobilized oligonucleotide probes of known sequences and length F and a set of labelled oligonucleotide probes in solution of known sequences and length P, wherein F+Pxe2x89xa6T and, preferably, wherein T≈3F.
One would then contact the array of immobilized oligonucleotide probes with the mixture nucleic acid fragments under hybridization conditions effective to allow the formation of primary complexes with hybridized, complementary sequences of length F and non-hybridized fragment sequences of length Txe2x88x92F. Preferably, the hybridized sequences of length F would contain only completely complementary sequences.
The primary complexes would then be contacted with the set of labelled oligonucleotide probes under hybridization conditions effective to allow the formation of secondary complexes with hybridized, complementary sequences of length F and adjacent hybridized, complementary sequences of length P. In preferred embodiments, only those labelled probes with completely complementary sequences would be allowed to hybridize and only those probes that hybridize immediately adjacent to an immobilized probe would be allowed to remain hybridized. In the most preferred embodiments, the adjacent immobilized and labelled oligonucleotide probes would also be ligated at this stage.
Next one would detect the secondary complexes by detecting the presence of the label and identify sequences of length F+P from the nucleic acid fragments in the secondary complexes by combining the known sequences of the hybridized immobilized and labelled probes. Stretches of the sequences of length F+P that overlap would then be identified, thereby allowing the complete nucleic acid sequence of the molecule to be reconstructed or assembled from the overlapping sequences determined.
In the methods of the invention, the oligonucleotides of the first set may be attached to a solid support, i.e. immobilized, by any of the methods known to those of skill in the art. For example, attachment may be via addressable laser-activated photodeprotection (Fodor et al., 1991; Pease et al., 1994). One generally preferred method is to attach the oligos through the phosphate group using reagents such as nucleoside phosphoramidite or nucleoside hydrogen phosphorate, as described by Southern and Maskos (PCT Patent Application WO 90/03382, incorporated herein by reference), and using glass, nylon or teflon supports. Another preferred method is that of light-generated synthesis described by Pease et al. (1994; incorporated herein by reference). One may also purchase support bound oligonucleotide arrays, for example, as have been offered for sale by Affymetrix and Beckman.
The immobilized oligonucleotides may be formed into an array comprising all probes or subsets of probes of a given length (preferably about 4 to 10 bases), and more preferably, into multiple arrays of immobilized oligonucleotides arranged to form a so-called xe2x80x9csequencing chipxe2x80x9d. One example of a chip is that where hydrophobic segments are used to create distinct spatial areas. The sequencing chips may be designed for different applications like mapping, partial sequencing, sequencing of targeted regions for diagnostic purposes, mRNA sequencing and large scale genome sequencing. For each application, a specific chip may be designed with different sized probes or with an incomplete set of probes.
In one exemplary embodiment, both sets of oligonucleotide probes would be probes of six bases in length, i.e., 6-mers. In this instance, each set of oligos contains 4096 distinct probes. The first set probes is preferably fixed in an array on a microchip, most conveniently arranged in 64 rows and 64 columns. The second set of 4096 oligos would be labeled with a detectable label and dispensed into a set of distinct tubes. In this example, 4096 of the chips would be combined in a large array, or several arrays. After hybridizing the nucleic acid fragments, a small amount of the labeled oligonucleotides would be added to each microchip for the second hybridization step, only one of each of the 4096 nucleotides would be added to each microchip.
Further embodiments of the invention include kits for use in nucleic acid sequencing. Such kits will generally comprise a solid support having attached an array of oligonucleotide probes of known sequences, as shown in FIG. 2A, FIG. 2B and FIG. 2C, wherein the oligonucleotides are capable of taking part in hybridization reactions, and a set of containers comprising solutions of labelled oligonucleotide probes of known sequences. Arrangements such as those shown in FIG. 4 are also contemplated. This depicts the use of the Universal Base, either as an attachment method, or at the terminus to give an added dimension to the hybridization of fragments.
In the kits, the attached oligonucleotide probes and those in solution may be between about 4 bp and about 9 bp in length, with ones of about 6 bp in length being preferred. The oligos may be labelled with chemically-detectable or radioactive labels, with 32P-labelled probes being generally preferred, and 33P-labelled probes being even more preferred. The kits may also comprise a chemical or other ligating agent, such as a DNA ligase enzyme. A variety of other additional compositions and materials may be included in the kits, such as 96-tip or 96-pin devices, buffers, reagents for cutting long nucleic acid molecules and tools for the size selection of DNA fragments. The kits may even include labelled RNA probes so that the probes may be removed by RNAase treatment and the sequencing chips re-used.