Nothing in the following discussion is admitted to be prior art unless specifically identified as “prior art.” The present invention is related to processing of data scanned from arrays. Array technologies have gained prominence in biological research and are likely to become important and widely used diagnostic tools in the healthcare industry. Currently, molecular-array techniques are most often used to determine the concentrations of particular nucleic-acid polymers in complex sample solutions. Molecular-array-based analytical techniques are not, however, restricted to analysis of nucleic acid solutions, but may be employed to analyze complex solutions of any type of molecule that can be optically or radiometrically scanned and that can bind with high specificity to complementary molecules synthesized within, or bound to, discrete features on the surface of an array. Because arrays are widely used for analysis of nucleic acid samples, the following background information on arrays is introduced in the context of analysis of nucleic acid solutions following a brief background of nucleic acid chemistry.
Deoxyribonucleic acid (“DNA”) and ribonucleic acid (“RNA”) are linear polymers, each synthesized from four different types of subunit molecules. The subunit molecules for DNA include: (1) deoxy-adenosine, abbreviated “A,” a purine nucleoside; (2) deoxy-thymidine, abbreviated “T,” a pyrimidine nucleoside; (3) deoxy-cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) deoxy-guanosine, abbreviated “G,” a purine nucleoside. The subunit molecules for RNA include: (1) adenosine, abbreviated “A,” a purine nucleoside; (2) uracil, abbreviated “U,” a pyrimidine nucleoside; (3) cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) guanosine, abbreviated “G,” a purine nucleoside. FIG. 1 illustrates a short DNA polymer 100, called an oligomer, composed of the following subunits: (1) deoxy-adenosine 102; (2) deoxy-thymidine 104; (3) deoxy-cytosine 106; and (4) deoxy-guanosine 108. When phosphorylated, subunits of DNA and RNA molecules are called “nucleotides” and are linked together through phosphodiester bonds 110–115 to form DNA and RNA polymers. A linear DNA molecule, such as the oligomer shown in FIG. 1, has a 5′ end 118 and a 3′ end 120. A DNA polymer can be chemically characterized by writing, in sequence from the 5′ end to the 3′ end, the single letter abbreviations for the nucleotide subunits that together compose the DNA polymer. For example, the oligomer 100 shown in FIG. 1 can be chemically represented as “ATCG.” A DNA nucleotide comprises a purine or pyrimidine base (e.g. adenine 122 of the deoxy-adenylate nucleotide 102), a deoxy-ribose sugar (e.g. deoxy-ribose 124 of the deoxy-adenylate nucleotide 102), and a phosphate group (e.g. phosphate 126) that links one nucleotide to another nucleotide in the DNA polymer. In RNA polymers, the nucleotides contain ribose sugars rather than deoxy-ribose sugars. In ribose, a hydroxyl group takes the place of the 2′ hydrogen 128 in a DNA nucleotide. RNA polymers contain uridine nucleosides rather than the deoxy-thymidine nucleosides contained in DNA. The pyrimidine base uracil lacks a methyl group (130 in FIG. 1) contained in the pyrimidine base thymine of deoxy-thymidine.
The DNA polymers that contain the organization information for living organisms occur in the nuclei of cells in pairs, forming double-stranded DNA helixes. One polymer of the pair is laid out in a 5′ to 3′ direction, and the other polymer of the pair is laid out in a 3′ to 5′ direction. The two DNA polymers in a double-stranded DNA helix are therefore described as being anti-parallel. The two DNA polymers, or strands, within a double-stranded DNA helix are bound to each other through attractive forces including hydrophobic interactions between stacked purine and pyrimidine bases and hydrogen bonding between purine and pyrimidine bases, the attractive forces emphasized by conformational constraints of DNA polymers. Because of a number of chemical and topographic constraints, double-stranded DNA helices are most stable when deoxy-adenylate subunits of one strand hydrogen bond to deoxy-thymidylate subunits of the other strand, and deoxy-guanylate subunits of one strand hydrogen bond to corresponding deoxy-cytidilate subunits of the other strand.
FIGS. 2A–B illustrates the hydrogen bonding between the purine and pyrimidine bases of two anti-parallel DNA strands. FIG. 2A shows hydrogen bonding between adenine and thymine bases of corresponding adenosine and thymidine subunits, and FIG. 2B shows hydrogen bonding between guanine and cytosine bases of corresponding guanosine and cytosine subunits. Note that there are two hydrogen bonds 202 and 203 in the adenine/thymine base pair, and three hydrogen bonds 204–206 in the guanosine/cytosine base pair, as a result of which GC base pairs contribute greater thermodynamic stability to DNA duplexes than AT base pairs. AT and GC base pairs, illustrated in FIGS. 2A–B, are known as Watson-Crick (“WC”) base pairs.
Two DNA strands linked together by hydrogen bonds forms the familiar helix structure of a double-stranded DNA helix. FIG. 3 illustrates a short section of a DNA double helix 300 comprising a first strand 302 and a second, anti-parallel strand 304. The ribbon-like strands in FIG. 3 represent the deoxyribose and phosphate backbones of the two anti-parallel strands, with hydrogen-bonding purine and pyrimidine base pairs, such as base pair 306, interconnecting the two strands. Deoxy-guanylate subunits of one strand are generally paired with deoxy-cytidilate subunits from the other strand, and deoxy-thymidilate subunits in one strand are generally paired with deoxy-adenylate subunits from the other strand. However, non-WC base pairings may occur within double-stranded DNA.
Double-stranded DNA may be denatured, or converted into single stranded DNA, by changing the ionic strength of the solution containing the double-stranded DNA or by raising the temperature of the solution. Single-stranded DNA polymers may be renatured, or converted back into DNA duplexes, by reversing the denaturing conditions, for example by lowering the temperature of the solution containing complementary single-stranded DNA polymers. During renaturing or hybridization, complementary bases of anti-parallel DNA strands form WC base pairs in a cooperative fashion, leading to reannealing of the DNA duplex. Strictly A-T and G-C complementarity between anti-parallel polymers leads to the greatest thermodynamic stability, but partial complementarity including non-WC base pairing may also occur to produce relatively stable associations between partially-complementary polymers. In general, the longer the regions of consecutive WC base pairing between two nucleic acid polymers, the greater the stability of hybridization between the two polymers under renaturing conditions.
The ability to denature and renature double-stranded DNA has led to the development of many extremely powerful and discriminating assay technologies for identifying the presence of DNA and RNA polymers having particular base sequences or containing particular base subsequences within complex mixtures of different nucleic acid polymers, other biopolymers, and inorganic and organic chemical compounds. One such methodology is the array-based hybridization assay. FIGS. 4–7 illustrate the principle of the array-based hybridization assay. An array (402 in FIG. 4) comprises a substrate upon which a regular pattern of features is prepared by various manufacturing processes. The array 402 in FIG. 4, and in subsequent FIGS. 5–7, has a grid-like 2-dimensional pattern of square features, such as feature 404 shown in the upper left-hand corner of the array. Each feature of the array contains a large number of identical oligonucleotides covalently bound to the surface of the feature. These bound oligonucleotides are known as probes. In general, chemically distinct probes are bound to the different features of an array, so that each feature corresponds to a particular nucleotide sequence. In FIGS. 4–6, the principle of array-based hybridization assays is illustrated with respect to the single feature 404 to which a number of identical probes 405–409 are bound. In practice, each feature of the array contains a high density of such probes but, for the sake of clarity, only a subset of these are shown in FIGS. 4–6.
Once an array has been prepared, the array may be exposed to a sample solution of target DNA or RNA molecules (410–413 in FIG. 4) labeled with fluorophores, chemiluminescent compounds, or radioactive atoms 415–418. Labeled target DNA or RNA hybridizes through base pairing interactions to the complementary probe DNA, synthesized on the surface of the array. FIG. 5 shows a number of such target molecules 502–504 hybridized to complementary probes 505–507, which are in turn bound to the surface of the array 402. Targets, such as labeled DNA molecules 508 and 509, that do not contains nucleotide sequences complementary to any of the probes bound to array surface do not hybridize to generate stable duplexes and, as a result, tend to remain in solution. The sample solution is then rinsed from the surface of the array, washing away any unbound-labeled DNA molecules. In other embodiments, unlabeled target sample is allowed to hybridize with the array first. Typically, such a target sample has been modified with a chemical moiety that will react with a second chemical moiety in subsequent steps. Then, either before or after a wash step, a solution containing the second chemical moiety bound to a label is reacted with the target on the array. After washing, the array is ready for scanning. Biotin and avidin represent an example of a pair of chemical moieties that can be utilized for such steps.
Finally, as shown in FIG. 6, the bound labeled DNA molecules are detected via optical or radiometric scanning. Optical scanning involves exciting labels of bound labeled DNA molecules with electromagnetic radiation of appropriate frequency and detecting fluorescent emissions from the labels, or detecting light emitted from chemiluminescent labels. When radioisotope labels are employed, radiometric scanning can be used to detect the signal emitted from the hybridized features. Additional types of signals are also possible, including electrical signals generated by electrical properties of bound target molecules, magnetic properties of bound target molecules, and other such physical properties of bound target molecules that can produce a detectable signal. Optical, radiometric, or other types of scanning produce an analog or digital representation of the array as shown in FIG. 7, with features to which labeled target molecules are hybridized similar to 706 optically or digitally differentiated from those features to which no labeled DNA molecules are bound. In other words, the analog or digital representation of a scanned array displays positive signals for features to which labeled DNA molecules are hybridized and displays negative features to which no, or an undetectably small number of, labeled DNA molecules are bound. Features displaying positive signals in the analog or digital representation indicate the presence of DNA molecules with complementary nucleotide sequences in the original sample solution. Moreover, the signal intensity produced by a feature is generally related to the amount of labeled DNA bound to the feature, in turn related to the concentration, in the sample to which the array was exposed, of labeled DNA complementary to the oligonucleotide within the feature.
Array-based hybridization techniques allow extremely complex solutions of DNA molecules to be analyzed in a single experiment. An array may contain from hundreds to tens of thousands of different oligonucleotide probes, allowing for the detection of a subset of complementary sequences from a complex pool of different target DNA or RNA polymers. In order to perform different sets of hybridization analyses, arrays containing different sets of bound oligonucleotides are manufactured by any of a number of complex manufacturing techniques. These techniques generally involve synthesizing the oligonucleotides within corresponding features of the array through a series of complex iterative synthetic steps.
One, two, or more than two data subsets within a data set can be obtained from a single molecular array by scanning the molecular array for one, two or more than two types of signals. Two or more data subsets can also be obtained by combining data from two different arrays. When optical scanning is used to detect fluorescent or chemiluminescent emission from chromophore labels, a first set of signals, or data subset, may be generated by scanning the molecular at a first optical wavelength, a second set of signals, or data subset, may be generated by scanning the molecular at a second optical wavelength, and additional sets of signals may be generated by scanning the molecular at additional optical wavelengths. Different signals may be obtained from a molecular array by radiometric scanning to detect radioactive emissions at one, two, or more than two different energy levels. Target molecules may be labeled with either a first chromophore that emits light at a first wavelength, or a second chromophore that emits light at a second wavelength. Following hybridization, the molecular array can be scanned at the first wavelength to detect target molecules, labeled with the first chromophore, hybridized to features of the molecular array, and can then be scanned at the second wavelength to detect target molecules, labeled with the second chromophore, hybridized to the features of the molecular array. In one common molecular array system, the first chromophore emits light at a red visible-light wavelength, and the second chromophore emits light at a green, visible-light wavelength. The data set obtained from scanning the molecular array at the red wavelength is referred to as the “red signal,” and the data set obtained from scanning the molecular array at the green wavelength is referred to as the “green signal.” While it is common to use one or two different chromophores, it is possible to use three, four, or more than four different chromophores and to scan a molecular array at three, four, or more than four wavelengths to produce three, four, or more than four data sets.
FIG. 8 shows a small region of a scanned image of a molecular array containing an image of a single feature. In FIG. 8, the small region of the scanned image comprises a grid, or matrix, of pixels, such as pixel 802. In FIG. 8, the magnitude of the signal scanned from the small region of the surface of a molecular array spatially corresponding to a particular pixel in the scanned image is indicated by a kind of gray scaling. Pixels corresponding to high-intensity signals, such as pixel 804, are darkly colored, while pixels having very low signal intensities, such as pixel 802, are not colored. The range of intermediate signal intensities is represented, in FIG. 8, by a generally decreasing density of crosshatch lines within a pixel. In FIG. 8, there is a generally disc-shaped region in the center of the region of the scanned image of the molecular array that contains a high proportion of high-intensity pixels. Outside of this central, disc-shaped region corresponding to a feature, the intensities of the pixels fall off relatively quickly, although pixels with intermediate intensities are found, infrequently, even toward the edges of the region of the scanned image, relatively distant from the obvious central, disc-shaped region of high-intensity pixels that corresponds to the feature.
In general, data sets collected from molecular arrays comprise an indexed set of numerical signal intensities associated with pixels. The pixel intensities range over the possible values for the size of the memory-storage unit employed to store the pixel intensities. In many current systems, a 16-bit word is employed to store the intensity value associated with each pixel, and a data set can be considered to be a 2-dimensional array of pixel-intensity values corresponding to the 2-dimensional array of pixels that together compose a scanned image of a molecular array.
FIG. 9 shows a 2-dimensional array of pixel-intensity values corresponding to a portion of the central, disc-shaped region corresponding to a feature in the region of a scanned image of a molecular array shown on FIG. 8. In FIG. 9, for example, pixel intensity 902 corresponds to pixel 806 in FIG. 8.
Features on the surface of a molecular array may have various different shapes and sizes, depending on the manufacturing process by which the molecular array is produced. In one important class of molecular arrays, features are tiny, disc-shaped regions on the surface of the molecular array produced by ink-jet-based application of probe molecules, or probe-molecular-precursors, to the surface of the molecular array substrate. FIG. 10 shows an idealized representation of a feature, such as the feature shown in FIG. 8, on a small section of the surface of a molecular array. FIG. 11 shows a graph of pixel-intensity values versus position along a line bisecting a feature in the scanned image of the feature. For example, the graph shown in FIG. 11 may be obtained by plotting the intensity values associated with pixels along lines 1002 or 1004 in FIG. 10. Consider a traversal of the pixels along line 1002 starting from point 1006 and ending at point 1008. In FIG. 11, points 1106 and 1108 along the horizontal axis correspond to positions 1006 and 1008 along line 1002 in FIG. 10. Initially, at positions well removed from the central, disc-shaped region of the feature in 1010, the scanned signal intensity is relatively low. As the central, disc-shaped region of the feature is approached, along line 1002, the pixels intensities remain at a fairly constant, background level up to point 1012, corresponding to point 1112 in FIG. 11. Between points 1012 and 1014, corresponding to points 1112 and 1114 in FIG. 11, the average intensity of pixels rapidly increases to a relatively high intensity level 1115 at a point 1014 coincident with the outer edge of the central, disc-shaped region of the feature. The intensity remains relatively high over the central, disc-shaped region of the feature 1116, and begins to fall off starting at point 1018, corresponding to point 1118 in FIG. 11, at the far side of the central, disc-shaped region of the feature. The intensity rapidly falls off with increasing distance from the central, disc-shaped region of the feature until again reaching a relatively constant, background level at point 1008, corresponding to point 1108 in FIG. 11. The exact shape of the pixel-intensity-versus-position graph, and the size and shape of the feature, are dependent on the particular type of molecular array and molecular-array substrate, chromophore or a radioisotope used to label target molecules, experimental conditions to which the molecular array is subjected, the molecular-array scanner used to scan a molecular array, and on data processing components of the molecular-array scanner and an associated computer that produce the scanned image and pixel-intensity data sets. For example, with some type of array manufacture processes or with different hybridization and washing protocols, the features may resemble donuts, or even more irregular blobs.
The background signal generated during scanning regions of the surface of a molecular array outside of the areas corresponding to features arises from many different sources, including contamination of the molecular-array surface by fluorescent or radioactively labeled or naturally radioactive compounds, fluorescence or radiation emission from the molecular-array substrate, dark signal generated by the photo detectors in the molecular-array scanner, and many other sources. When this background signal is measured on the portion of the array that is outside of the areas corresponding to a feature, it is often referred to as the “local” background signal.
An important part of molecular-array data processing is a determination of the background signal that needs to be subtracted from a feature. With appropriate background-subtraction, it is possible to distinguish low-signal features from no-signal features and to calculate accurate and reproducible log ratios between multi-channel and/or inter-array data. The sources of background signal that appear in the local background region may be identical to the sources of background signal that occur on the feature itself; that is, the signal represented in the local background region may be additive to the signal that arises from the specific labeled target hybridized to probes on that feature. In this case, it is appropriate to use the signal from the local background region as the best estimate of the background to subtract from that feature.
FIG. 12 illustrates a currently employed technique for measuring the local background signal for a feature. FIG. 12 corresponds to the small region of the scanned image of a molecular array shown on FIG. 8. Initial pixel-based coordinates for the center of the feature can be estimated from manufacturing data for the molecular array and from a number of scanned-image processing techniques. Using these initial pixel-based coordinates for the center of the feature, the integrated intensities of disc-shaped regions with increasing radii centered at those coordinates can be computed to determine, by a decrease in integrated intensities, the outer edge 1202 of the central, disc-shaped feature region. An intermediate region, in which the integrated pixel intensities rapidly fall off with increasing radius, corresponding to the regions in FIG. 11 between points 1112 and 1114 and between points 1118 and 1108, can be determined to provide the outer boundary 1204 of a region of interest (“ROI”) surrounding and including the central, disc-shaped region of the feature. Finally, an annulus lying between the outer edge of the ROI 1204 and a somewhat arbitrary outer background circumference 1206 is considered to be the background region for the feature, and the integrated intensity of this background region 1208, divided by the area of the background region, is taken to be the background signal for the entire region comprising the feature region, feature ROI, and background annulus. Alternatively, the locations and sizes of feature regions may be known in advance of the image processing stage, based on array manufacturing data and other information, and so the ROI may not need to be determined by a method such as the method described above. Thus, a current technique for background signal estimation is based on a local method involving determining an integrated signal intensity for an annulus surrounding the ROI disc associated with a feature, and determining a background-signal intensity per image area. The estimated local background signal for a feature is the background-signal intensity per image area, and is subtracted from the normalized raw feature signal, to produce a background-subtracted feature signal. A feature-based data set includes background-subtracted data subset, for each signal scanned, comprising feature signals or raw feature signals.
Unfortunately, as the density of features placed on molecular-array substrates increases, the local background-signal estimation technique illustrated in FIG. 12 begins to fail. FIGS. 13A–B illustrate a problem with local background-signal estimation that arises with high feature densities. In FIG. 13A, the background annuli, for example background annulus 1302, surrounding features laid out in a grid-like pattern on a small region of a molecular array substrate 1304 are shown to be relatively well-spaced and discrete. However, in a higher-density molecular array, where the same features are more closely crowded together, as shown in FIG. 13B, the background annulus of one feature, for example, background annulus 1306, may overlap with the background annuli 1308 and 1310 of neighboring features and may, in addition, overlap 1312 and 1314 with the ROI or even the central, disc-shaped region of neighboring features. Overlap 1312 and 1314 of a background annulus 1306 with neighboring ROIs can significantly raise the background signal estimation above the true, non-feature and non-ROI background-signal intensity level. At certain feature densities, it may be possible to decrease the thickness of background annuli in order to prevent overlap, but background annuli cannot be arbitrarily decreased in size past a certain limit. There must be, for example, a minimum number of pixels within the background annulus in order to generate a statistically significant estimation of the intensity of pixels within the background region surrounding a feature. There is, in addition, another problem with the currently-employed local background-signal estimation technique illustrated in FIG. 12. As seen in FIG. 13A, the background annuli are discrete, so that the background signal estimated across features of a molecular array is not a continuous function of position with respect to the molecular array. Thus, it is difficult to use local backgrounds for estimating non-local background-related phenomena, such as background-signal gradients and other such phenomena. For these reasons, designers, manufacturers, and users of molecular arrays have recognized the need for a method for accurately determining an estimated background signal for densely packed features and for estimating background signal in a continuous fashion with respect to position on the surface of a molecular array.