This invention relates to imaging for biochemical analysis and more particularly to methods and systems for imaging high density biochemical arrays used in high-throughput genome sequencing.
High-throughput analysis of chemical and/or biological species is an important tool in the fields of diagnostics and therapeutics. Biochemical arrays allow multiple biochemical experiments to be performed in parallel. This ability accrues from the development of techniques to perform each experiment in a small volume and to pack the experiments closely together. Arrays of attached chemical and/or biological species on a substrate can be designed to define specific target sequences, analyze gene expression patterns, identify specific allelic variations, determine copy number of DNA sequences and identify, on a genome-wide basis, binding sites for proteins (e.g., transcription factors and other regulatory molecules). In a specific example, the advent of the human genome project required that 25 improved methods for sequencing nucleic acids, such as DNA (deoxyribonucleic acid) and RNA (ribonucleic acid), be developed. Determination of the entire 3,000,000,000 base sequence of the haploid human genome has provided a foundation for identifying the genetic basis of numerous diseases. However, a great deal of work remains to be done to identify the genetic variations associated with a statistically significant number of human genomes, and improved high throughput methods for analysis can aid greatly in this endeavor.
The high-throughput analytical approaches conventionally utilize assay devices, known as flow cells that contain arrays of chemicals and/or biological species for analysis. The biological species are typically tagged with multiple fluorescent colors that can be read with an imaging system.
Due to the sheer volume of data to be observed, captured and analyzed, a critical factor in genome sequencing analysis is the throughput of the assaying instrument. Throughput has a direct impact on cost. While imaging systems are capable of capturing a large amount of data as compared to other technologies, the throughput of such systems is limited by camera speed and number of pixels per spot. Camera speed is limited by inherent physical limitations, and the smallest number of pixels per spot is one. While it is desirable to reduce number of pixels per spot to a minimum, there are typically many pixels per spot in practical instruments.
Images captured in pixels from light emitted from spots associated with attachment sites on a substrate must be aligned and registered in order to be analyzable. The conventional registration technology, which involves registration marks and guides on the substrate, requires space on the substrate, reducing number of sites available for analysis and thus the volume of analysis per unit time.
Several different approaches to DNA chips are under development. In one approach a combinatorial array of DNA fragments is created on a chip and these are used for sequencing by hybridization. In another, DNA is randomly arrayed on a surface for the same purpose. One research group is trying to use arrays of DNA polymerase to observe sequencing base by base. Still another research group uses self-assembled DNA nanoarrays interrogated by combinatorial probe-anchor ligation. Although these approaches are quite different from one another, especially in their biochemical details, they all depend on fluorescence imaging techniques to literally “see” the data generated by individual experiments in an array.
Fluorescence imaging is used to identify DNA bases—A, C, G, or T—by designing biochemical reactions such that a different colored dye (for example, red, green, blue, or yellow) corresponds to each one. One may then observe a DNA experiment with a fluorescence microscope. The color observed indicates the DNA base at that particular step. Extracting data from a DNA chip thus depends on recording the color of fluorescence emitted by many millions or even billions of biochemical experiments on a chip.
The practice of the techniques described herein may employ, unless otherwise indicated, conventional techniques and descriptions of organic chemistry, polymer technology, molecular biology (including recombinant techniques), cell biology, biochemistry, and sequencing technology, which are within the skill of those who practice in the art. Such conventional techniques include polymer array synthesis, hybridization and ligation of polynucleotides, and detection of hybridization using a label. Specific illustrations of suitable techniques can be had by reference to the examples herein. However, other equivalent conventional procedures can, of course, also be used. Such conventional techniques and descriptions can be found in standard laboratory manuals such as Green, et al., Eds. (1999), Genome Analysis: A Laboratory Manual Series (Vols. I-IV); Weiner, Gabriel, Stephens, Eds. (2007), Genetic Variation: A Laboratory Manual; Dieffenbach, Dveksler, Eds. (2003), PCR Primer: A Laboratory Manual; Bowtell and Sambrook (2003), DNA Microarrays: A Molecular Cloning Manual; Mount (2004), Bioinformatics: Sequence and Genome Analysis; Sambrook and Russell (2006), Condensed Protocols from Molecular Cloning: A Laboratory Manual; and Sambrook and Russell (2002), Molecular Cloning: A Laboratory Manual (all from Cold Spring Harbor Laboratory Press); Stryer, L. (1995) Biochemistry (4th Ed.) W.H. Freeman, New York N.Y.; Gait, “Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press, London; Nelson and Cox (2000), Lehninger, Principles of Biochemistry 3rd Ed., W. H. Freeman Pub., New York, N.Y.; and Berg et al. (2002) Biochemistry, 5th Ed., W.H. Freeman Pub., New York, N.Y., all of which are herein incorporated in their entirety by reference for all purposes.
As used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a channel” refers to one or more channels available on an assay substrate, and reference to “the method” includes reference to equivalent steps and methods known to those skilled in the art, and so forth.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. All publications mentioned herein are incorporated by reference for the purpose of describing and disclosing devices, formulations and methodologies that may be used in connection with the presently described invention.
Where a range of values is provided, it is understood that each intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either both of those included limits are also included in the invention.
In the following description, numerous specific details are set forth to provide a more thorough understanding of the present invention. However, it will be apparent to one of skill in the art upon reading the present disclosure that the present invention may be practiced without one or more of these specific details. In other instances, well-known features and procedures well known to those skilled in the art have not been described in order to avoid obscuring the invention.
Selected Definitions
“Amplicon” means the product of a polynucleotide amplification reaction. That is, it is a population of polynucleotides that are replicated from one or more starting sequences. Amplicons may be produced by a variety of amplification reactions, including but not limited to polymerase chain reactions (PCRs), linear polymerase reactions, nucleic acid sequence-based amplification, circle dependant amplification and like reactions (see, e.g., U.S. Pat. Nos. 4,683,195; 4,965,188; 4,683,202; 4,800,159; 5,210,015; 6,174,670; 5,399,491; 6,287,824 and 5,854,033; and U.S. Published Pat. App. No. 2006/0024711).
“Attachment site” or “site” herein refers to functionalized locations arranged in a regular pattern on a substrate to which bioactive structures can be bound. The sites in practice are submicron regions of reactive positive amines that are attached to an oxide surface via a silanization process.
“Circle dependant replication” or “CDR” refers to multiple displacement amplification of a circular template using one or more primers annealing to the same strand of the circular template to generate products representing only one strand of the template. In CDR, no additional primer binding sites are generated and the amount of product increases only linearly with time. The primer(s) used may be of a random sequence (e.g., one or more random hexamers) or may have a specific sequence to select for amplification of a desired product. Without further modification of the end product, CDR often results in the creation of a linear construct having multiple copies of a strand of the circular template in tandem, i.e. a linear, single-stranded concatamer of multiple copies of a strand of the template.
“Circle dependant amplification” or “CDA” refers to multiple displacement amplification of a circular template using primers annealing to both strands of the circular template to generate products representing both strands of the template, resulting in a cascade of multiple-hybridization, primer-extension and strand-displacement events. This leads to an exponential increase in the number of primer binding sites, with a consequent exponential increase in the amount of product generated over time. The primers used may be of a random sequence (e.g., random hexamers) or may have a specific sequence to select for amplification of a desired product. CDA results in formation of a set of concatemeric double-stranded fragments.
“Field” as used herein is a two-dimensional subunit of analysis, referring typically to the data captured by a camera and grouped together for the purpose of analysis.
“Grid” as used herein refers to an abstract Cartesian pattern which is employed to analyze location of information in an image constructed of pixels. The grid for the present purposes has constant periodicity in x and y and is preferably square. The location of the grid is conveniently specified in a pixel reference frame.
“Ligand” as used herein refers to a molecule that may attach, covalently or noncovalently, to a molecule on an assay substrate, either directly or via a specific binding partner. Examples of ligands which can be employed by this invention include, but are not restricted to, antibodies, cell membrane receptors, monoclonal antibodies and antisera reactive with specific antigenic determinants (such as on viruses, cells or other materials), drugs, polynucleotides, nucleic acids, peptides, cofactors, lectins, sugars, polysaccharides, cells, cellular membranes, and organelles.
“Microarray” or “array” refers to a solid phase support having a surface, which in the present embodiment is necessarily a planar or substantially planar surface, which carries an array of sites containing nucleic acids such that each site of the array comprises many copies of oligonucleotides or polynucleotides, the sites being spatially discrete. The oligonucleotides or polynucleotides of the array may be covalently bound to the substrate, or may be non-covalently bound. Conventional microarray technology is reviewed in, e.g., Schena, Ed. (2000), Microarrays: A Practical Approach (IRL Press, Oxford).
“Nucleic acid” and “oligonucleotide” are used herein to mean a polymer of nucleotide monomers. As used herein, the terms may also refer to double stranded forms. Monomers making up nucleic acids and oligonucleotides are capable of specifically binding to a natural polynucleotide by way of a regular pattern of monomer-to-monomer interactions, such as Watson-Crick type of base pairing, base stacking, Hoogsteen or reverse Hoogsteen types of base pairing, or the like, to form duplex or triplex forms. Such monomers and their internucleosidic linkages may be naturally occurring or may be analogs thereof, e.g., naturally occurring or non-naturally occurring analogs. Non-naturally occurring analogs may include peptide nucleic acids, locked nucleic acids, phosphorothioate internucleosidic linkages, bases containing linking groups permitting the attachment of labels, such as fluorophores, or haptens, and the like. Whenever the use of an oligonucleotide or nucleic acid requires enzymatic processing, such as extension by a polymerase, ligation by a ligase, or the like, one of ordinary skill would understand that oligonucleotides or nucleic acids in those instances would not contain certain analogs of internucleosidic linkages, sugar moieties, or bases at any or some positions, when such analogs are incompatible with enzymatic reactions. Nucleic acids typically range in size from a few monomeric units, e.g., 5-40, when they are usually referred to as “oligonucleotides,” to several hundred thousand or more monomeric units. Whenever a nucleic acid or oligonucleotide is represented by a sequence of letters (upper or lower case), such as “ATGCCTG,” it will be understood that the nucleotides are in 5′→3′ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine, “I” denotes deoxyinosine, “U” denotes uridine, unless otherwise indicated or obvious from context. Unless otherwise noted the terminology and atom numbering conventions will follow those disclosed in Strachan and Read, Human Molecular Genetics 2 (Wiley-Liss, New York, 1999). Usually nucleic acids comprise the natural nucleosides (e.g., deoxyadenosine, deoxycytidine, deoxyguanosine, deoxythymidine for DNA or their ribose counterparts for RNA) linked by phosphodiester linkages; however, they may also comprise non-natural nucleotide analogs, e.g., modified bases, sugars, or internucleosidic linkages. It is clear to those skilled in the art that where an enzyme has specific oligonucleotide or nucleic acid substrate requirements for activity, e.g., single stranded DNA, RNA/DNA duplex, or the like, then selection of appropriate composition for the oligonucleotide or nucleic acid substrates is well within the knowledge of one of ordinary skill, especially with guidance from treatises, such as Sambrook et al, Molecular Cloning, Second Edition (Cold Spring Harbor Laboratory, New York, 1989), and like references. As used herein, “targeted nucleic acid segment” refers to a nucleic acid targeted for sequencing or re-sequencing.
“Pixel” is an indivisible light sensing element of a camera reporting level of detected light at an indivisible location. A monochromatic pixel is a single photodetection element. Colors filters can be used to determine spectrum of light received at a pixel.
“Primer” means an oligonucleotide, either natural or synthetic, which is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3′ end along the template so that an extended duplex is formed. The sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Usually primers are extended by a DNA polymerase. Primers usually have a length in the range of from 9 to 40 nucleotides, or in some embodiments, from 14 to 36 nucleotides.
“Probe” as used herein refers to an oligonucleotide, either natural or synthetic, which is used to interrogate complementary sequences within a nucleic acid of unknown sequence. The hybridization of a specific probe to a target polynucleotide is indicative of the specific sequence complementary to the probe within the target polynucleotide sequence.
“Sequencing” in reference to a nucleic acid means determination of information relating to the sequence of nucleotides in the nucleic acid. Such information may include the identification or determination of partial as well as full sequence information of the nucleic acid. The sequence information may be determined with varying degrees of statistical reliability or confidence. In one aspect, the term includes the determination of the identity and ordering of a plurality of contiguous nucleotides in a nucleic acid starting from different nucleotides in the target nucleic acid.
“Spot” as used herein refers to the location of light emitted from a fluorescing molecule. A spot is not necessarily centered on an attachment site.
“Substrate” refers to a material or group of materials having a rigid or semi-rigid surface or surfaces. In the present context, at least one surface of the substrate will be substantially flat, although in other contexts not related to the present invention, it may be desirable to physically separate synthesis regions for different compounds with, for example, wells, raised regions, pins, etched trenches, or the like. According to other embodiments, the substrate(s) will take the form of beads, resins, gels, microspheres, or other geometric configurations. In the present invention, the surface of the substrate is limited to a planar structure to promote analysis.
As used herein, the term “Tm” is used in reference to the “melting temperature.” The melting temperature is the temperature at which a population of double-stranded nucleic acid molecules becomes half dissociated into single strands. Several equations for calculating the Tm of nucleic acids are well known in the art. As indicated by standard references, a simple estimate of the Tm value may be calculated by the equation. Tm=81.5+0.41 (% G+C), when a nucleic acid is in aqueous solution at 1M NaCl (see e.g., Anderson and Young, Quantitative Filter Hybridization, in Nucleic Acid Hybridization (1985). Other references (e.g., Allawi, H. T. & Santa Lucia, J., Jr., Biochemistry 36, 10581-94 (1997)) include alternative methods of computation which take structural and environmental, as well as sequence characteristics into account for the calculation of Tm.
A conventional analysis slide consists of a 1″×3″ silicon chip upon which arrays of functionalized sites are created. The sites are submicron regions of reactive positive amines that are attached to an oxide surface via a silanization process. The surrounding field consists of neutral, non-reactive methyl groups. The sites are arranged in 4.5 mm wide lanes down the narrow direction of the analysis slide. Currently a 19 mm×60 mm cover slip is bonded to the chip using glue. The glue forms lanes that are a maximum of 4.5 mm×19 mm. The spacing between the cover slip and the silicon slide is approximately 50 um. This 50 um space is maintained by adding 50 um glass beads into the glue.
The 19 mm width of the cover slip is substantially less than the maximum 25 mm width of the silicon slide because 5 mm is required for an entrance port. The entrance port is a region onto which pipettes dispense fluids onto the top of lanes. Capillary forces move reagents from the top of the lanes into the gap under the cover slip. At the bottom of the slide, 1 mm of additional distance is required to evacuate excess fluid.
There is a 1 mm to 4 mm keep-out region at the top and bottom of the lanes directly under the cover slip. This keep-out region is needed because of reagent evaporation, cover slip alignment accuracies and glue encroachment due to narrowed entrance ports. Taking into account all of these tolerances, the usable width of the analysis slide is about 12-15 mm of a total possible 25 mm in a conventional slide.
In a known design, twelve 4.5 mm lanes are constructed on an analysis slide. This yields a maximum usable width of 54 mm. However, 1 mm per lane is dedicated for glue lines. This gives a maximum usable width of only 42 mm. Fewer lanes could be fabricated, but this has been observed to destabilize the chip because of reduced bond line area and loss of alignment guides.
Given these dimensions, the overall useful percentage area of the chip of a conventional slide is approximately (12.5 mm×42 mm)/(25 mm×75 mm)=28%. What is needed is a design that provides increased usable area and a process that attains accurate alignment.
The following is a description of the analysis techniques under development in which the present invention is employed.
A promising approach to whole genome studies was recently introduced by a group from the assignee of the present invention led by Radoje Drmanac. (“Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays”, Radoje Drmanac et al., Science, 327, p 78-81, Jan. 1, 2010 (which is not prior art under U.S. law). Combinatorial probe anchor ligation chemistry was used to independently assay each base from patterned nanoarrays of self-assembling DNA nanoballs. Three human genomes were sequenced with an accuracy of about one false variant per 100,000 bases. The high accuracy, low cost, and scalability of this platform enable complete human genome sequencing for the detection of rare variants in large-scale genetic studies.
Biochemical experiments in the Drmanac study were performed on rectangular chips measuring approximately 25 mm×75 mm. Each chip reportedly had approximately one billion DNA nanoballs arrayed on it in a regular, rectangular pattern. It is useful to visualize this array structure. FIG. 1 shows a conceptual diagram of such a biochemical-array chip 100. Because of the vast number of nanoballs, the chip is divided conceptually into fields; e.g. field 105. A typical field size might be 0.5 mm×1.5 mm, although the exact size is not critical. Fields of manageable size enable imaging analysis to be performed in manageable chunks. In a step-and-repeat imaging system a field size may correspond to the system's field of view; in a continuous scanning system, the field size may be a convenient unit for data processing.
Referring to FIG. 2, a conceptual diagram is shown of a field 200 of a biochemical array chip. The field contains an array of spots (e.g., spots 205, 210, 215) where DNA sequencing experiments are performed. Although the field in FIG. 2 is drawn with only a few hundred spots, an actual field may contain approximately 10,000 to 1,000,000 spots. Inset 220 shows six spots from which fluorescence in any of four colors: blue (“B”), red (“R”), yellow (“Y”), and green (“G”) can be observed. The actual colors used depend on the choice of fluorescent dye chosen and may be specified in terms of dye emission spectral data. The six spots shown in inset 220 correspond to data read out from six parallel DNA experiments, each reading a different spectrum. In this case, the fluorescence data indicate adenine (“A”), guanine (“G”), cytosine (“C”), and thymine (“T”) as shown in inset 225.
It is intended that each site on a DNA chip contain a strand of DNA whose sequence is to be determined. The readout of inset 220 shown in FIG. 2 corresponds to a single step in determining the sequence of DNA in strands. The reading process is repeated many times.
It is important to keep track of exactly which spots on a chip one is looking at; otherwise the data obtained by recording fluorescence colors are meaningless. The field spots, i.e., the locations at which florescent dye molecules emit light, are nominally located in a regular, rectangular pattern. The actual pattern is not exact because DNA nanoballs do not always fall exactly on the centers of DNA attachment sites defined on the chip. The field spots are viewed with a camera whose image sensor contains a regular, rectangular array of light sensing pixels.
What is needed is a mechanism and methodology to maximize the informational content on a chip, provide registration targets, and provide control information for an imaging system in order to enhance throughput and thus improve sequencing proficiency.