The present invention is related to computationally aided analysis of molecular-array data. In order to facilitate discussion of the graphical user interface (“GUI”), a general background of molecular-array technology is provided in this section, and the paragraphs that follow.
Molecular arrays are also referred to as “microarrays” and simply as “arrays” in the literature. Molecular arrays are not regular patterns of molecules, such as occur on the faces of crystalline materials, nor arbitrary patterns produced in a manufacturing or printing processes, but, as the following discussion shows, molecular arrays are manufactured articles specifically designed for analysis of solutions of compounds of chemical, biochemical, biomedical, and other interests.
Array technologies have gained prominence in biological research and are likely to become important and widely used diagnostic tools in the healthcare industry. Currently, molecular-array techniques are most often used to determine the concentrations of particular nucleic-acid polymers in complex sample solutions. Molecular-array-based analytical techniques are not, however, restricted to analysis of nucleic acid solutions, but may be employed to analyze complex solutions of any type of molecule that can be optically or radiometrically scanned or read and that can bind with high specificity to complementary molecules synthesized within, or bound to, discrete features on the surface of an array. Because arrays are widely used for analysis of nucleic acid samples, the following background information on arrays is introduced in the context of analysis of nucleic acid solutions following a brief background of nucleic acid chemistry.
Deoxyribonucleic acid (“DNA”) and ribonucleic acid (“RNA”) are linear polymers, each synthesized from four different types of subunit molecules. The subunit molecules for DNA include: (1) deoxy-adenosine, abbreviated “A,” a purine nucleoside; (2) deoxy-thymidine, abbreviated “T,” a pyrimidine nucleoside; (3) deoxy-cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) deoxy-guanosine, abbreviated “G,” a purine nucleoside. FIG. 1 illustrates a short DNA polymer 100, called an oligomer, composed of the following subunits: (1) deoxy-adenosine 102; (2) deoxy-thymidine 104; (3) deoxy-cytosine 106; and (4) deoxy-guanosine 108. When phosphorylated, subunits of DNA and RNA molecules are called “nucleotides” and are linked together through phosphodiester bonds 110-115 to form DNA and RNA polymers. A linear DNA molecule, such as the oligomer shown in FIG. 1, has a 5′ end 118 and a 3′ end 120. A DNA polymer can be chemically characterized by writing, in sequence from the 5′ end to the 3′ end, the single letter abbreviations for the nucleotide subunits that together compose the DNA polymer. For example, the oligomer 100 shown in FIG. 1 can be chemically represented as “ATCG.” A DNA nucleotide comprises a purine or pyrimidine base (e.g. adenine 122 of the deoxy-adenylate nucleotide 102), a deoxy-ribose sugar (e.g. deoxy-ribose 124 of the deoxy-adenylate nucleotide 102), and a phosphate group (e.g. phosphate 126) that links one nucleotide to another nucleotide in the DNA polymer.
The DNA polymers that contain the organization information for living organisms occur in the nuclei of cells in pairs, forming double-stranded DNA helixes. One polymer of the pair is laid out in a 5′ to 3′ direction, and the other polymer of the pair is laid out in a 3′ to 5′ direction. The two DNA polymers in a double-stranded DNA helix are therefore described as being anti-parallel. The two DNA polymers, or strands, within a double-stranded DNA helix are bound to each other through attractive forces including hydrophobic interactions between stacked purine and pyrimidine bases and hydrogen bonding between purine and pyrimidine bases, the attractive forces emphasized by conformational constraints of DNA polymers. Because of a number of chemical and topographic constraints, double-stranded DNA helices are most stable when deoxy-adenylate subunits of one strand hydrogen bond to deoxy-thymidylate subunits of the other strand, and deoxy-guanylate subunits of one strand hydrogen bond to corresponding deoxy-cytidilate subunits of the other strand.
FIGS. 2A-B illustrates the hydrogen bonding between the purine and pyrimidine bases of two anti-parallel DNA strands. AT and GC base pairs, illustrated in FIGS. 2A-B, are known as Watson-Crick (“WC”) base pairs. Two DNA strands linked together by hydrogen bonds forms the familiar helix structure of a double-stranded DNA helix. FIG. 3 illustrates a short section of a DNA double helix 300 comprising a first strand 302 and a second, anti-parallel strand 304.
Double-stranded DNA may be denatured, or converted into single stranded DNA, by changing the ionic strength of the solution containing the double-stranded DNA or by raising the temperature of the solution. Single-stranded DNA polymers may be renatured, or converted back into DNA duplexes, by reversing the denaturing conditions, for example by lowering the temperature of the solution containing complementary single-stranded DNA polymers. During renaturing or hybridization, complementary bases of anti-parallel DNA strands form WC base pairs in a cooperative fashion, leading to reannealing of the DNA duplex.
The ability to denature and renature double-stranded DNA has led to the development of many extremely powerful and discriminating assay technologies for identifying the presence of DNA and RNA polymers having particular base sequences or containing particular base subsequences within complex mixtures of different nucleic acid polymers, other biopolymers, and inorganic and organic chemical compounds. One such methodology is the array-based hybridization assay. FIGS. 4-7 illustrate the principle of the array-based hybridization assay. An array (402 in FIG. 4) comprises a substrate upon which a regular pattern of features is prepared by various manufacturing processes. The array 402 in FIG. 4, and in subsequent FIGS. 5-7, has a grid-like 2-dimensional pattern of square features, such as feature 404 shown in the upper left-hand corner of the array. Each feature of the array contains a large number of identical oligonucleotides covalently bound to the surface of the feature. These bound oligonucleotides are known as probes. In general, chemically distinct probes are bound to the different features of an array, so that each feature corresponds to a particular nucleotide sequence. In FIGS. 4-6, the principle of array-based hybridization assays is illustrated with respect to the single feature 404 to which a number of identical probes 405-409 are bound. In practice, each feature of the array contains a high density of such probes but, for the sake of clarity, only a subset of these are shown in FIGS. 4-6.
Once an array has been prepared, the array may be exposed to a sample solution of target DNA or RNA molecules (410-413 in FIG. 4) labeled with fluorophores, chemiluminescent compounds, or radioactive atoms 415-418. Labeled target DNA or RNA hybridizes through base pairing interactions to the complementary probe DNA, synthesized on the surface of the array. FIG. 5 shows a number of such target molecules 502-504 hybridized to complementary probes 505-507, which are in turn bound to the surface of the array 402. Targets, such as labeled DNA molecules 508 and 509, that do not contains nucleotide sequences complementary to any of the probes bound to array surface do not hybridize to generate stable duplexes and, as a result, tend to remain in solution. The sample solution is then rinsed from the surface of the array, washing away any unbound-labeled DNA molecules. In other embodiments, unlabeled target sample is allowed to hybridize with the array first. Typically, such a target sample has been modified with a chemical moiety that will react with a second chemical moiety in subsequent steps. Then, either before or after a wash step, a solution containing the second chemical moiety bound to a label is reacted with the target on the array. After washing, the array is ready for data acquisition by scanning or reading. Biotin and avidin represent an example of a pair of chemical moieties that can be utilized for such steps.
Finally, as shown in FIG. 6, the bound labeled DNA molecules are detected via optical or radiometric scanning or reading. Optical scanning and reading both involve exciting labels of bound labeled DNA molecules with electromagnetic radiation of appropriate frequency and detecting fluorescent emissions from the labels, or detecting light emitted from chemiluminescent labels. When radioisotope labels are employed, radiometric scanning or reading can be used to detect the signal emitted from the hybridized features. Additional types of signals are also possible, including electrical signals generated by electrical properties of bound target molecules, magnetic properties of bound target molecules, and other such physical properties of bound target molecules that can produce a detectable signal. Optical, radiometric, or other types of scanning and reading produce an analog or digital representation of the array as shown in FIG. 7, with features to which labeled target molecules are hybridized similar to 706 optically or digitally differentiated from those features to which no labeled DNA molecules are bound. In other words, the analog or digital representation of a scanned array displays positive signals for features to which labeled DNA molecules are hybridized and displays negative features to which no, or an undetectably small number of, labeled DNA molecules are bound. Features displaying positive signals in the analog or digital representation indicate the presence of DNA molecules with complementary nucleotide sequences in the original sample solution. Moreover, the signal intensity produced by a feature is generally related to the amount of labeled DNA bound to the feature, in turn related to the concentration, in the sample to which the array was exposed, of labeled DNA complementary to the oligonucleotide within the feature.
One, two, or more than two data subsets within a data set can be obtained from a single molecular array by scanning or reading the molecular array for one, two or more than two types of signals. Two or more data subsets can also be obtained by combining data from two different arrays. When optical scanning or reading is used to detect fluorescent or chemiluminescent emission from chromophore labels, a first set of signals, or data subset, may be generated by scanning or reading the molecular array at a first optical wavelength, a second set of signals, or data subset, may be generated by scanning or reading the molecular array at a second optical wavelength, and additional sets of signals may be generated by scanning or reading the molecular at additional optical wavelengths. Different signals may be obtained from a molecular array by radiometric scanning or reading to detect radioactive emissions one, two, or more than two different energy levels. Target molecules may be labeled with either a first chromophore that emits light at a first wavelength, or a second chromophore that emits light at a second wavelength. Following hybridization, the molecular array can be scanned or read at the first wavelength to detect target molecules, labeled with the first chromophore, hybridized to features of the molecular array, and can then be scanned or read at the second wavelength to detect target molecules, labeled with the second chromophore, hybridized to the features of the molecular array. In one common molecular array system, the first chromophore emits light at a red visible-light wavelength, and the second chromophore emits light at a green, visible-light wavelength. The data set obtained from scanning or reading the molecular array at the red wavelength is referred to as the “red signal,” and the data set obtained from scanning or reading the molecular array at the green wavelength is referred to as the “green signal.” While it is common to use one or two different chromophores, it is possible to use one, three, four, or more than four different chromophores and to scan or read a molecular array at one, three, four, or more than four wavelengths to produce one, three, four, or more than four data sets.
Many sophisticated computational techniques are applied to the raw, pixel-intensity-based data scanned from a molecular array. Many commercial systems employ a variety of techniques to scan the pixel-based image representation of molecular-array data to locate and index features, and to then extract data from the features and normalize extracted data. Quite often, these techniques produce satisfactory results. However, a great deal of seemingly random and systematic noise may be introduced into molecular-array data as a result of systematic errors that arise during manufacture of molecular arrays, during exposure of molecular arrays to sample solutions, and during post-exposure molecular-array processing. For example, when features are deposited by ink-jet technologies, the feature grid may be distorted due to mechanical irregularities, and features may be deposited in areas shaped differently from the desired disk shape. Because of the many different variables in chemical synthesis of probe molecules, probe molecules may end up distributed non-uniformly within the area of the molecular-array surface corresponding to a feature. During exposure of a molecular array to a sample solution, target molecules may be non-uniformly bound to molecular array features. Following exposure of the molecular array to a sample solution, features may be corrupted due to fingerprints, mechanical abrasion, chemical and particulate contamination, microbial growth, and various other types of events and processes.
FIGS. 8A-F illustrate a few of the many types of feature irregularities that may occur in a molecular array, and in the pixel-based representation of data scanned from a molecular array. In FIG. 8A, a feature 802 is seen with a desirable, circular disk shape perfectly aligned with an expected or calculated xy-position within a rectilinear coordinate grid used to describe feature positions on the surface of the molecular array or in the pixel-based representation of data extracted from a molecular array. However, as shown in FIG. 8B, a well-formed feature 804 may be translationally displaced with respect to an expected or calculated xy-position. Yet another type of irregularity that may occur is that the feature, rather than being disk shaped, may be instead elliptically shaped, as are the features in FIGS. 8C and 8D, 806 and 808, respectively. Note that, in general, the elliptical deformations, or directions of the major axes of elliptical features, tend to be oriented either vertically or horizontally with respect to the molecular array, and with respect to the rectilinear coordinate system describing positions of the surface of the molecular array, because mechanical irregularities in the manufacture of molecular arrays tend to produce distortions in the directions in which ink-jet pens, or other deposition devices, track across the surface of the molecular array.
Features may also be asymmetrically shaped, as is feature 810 in FIG. 8E, or may be symmetrically, but non-elliptically shaped, as is the feature 812 in FIG. 8F. The irregularities illustrated in FIGS. 8A-F all concern a region of interest, or data-containing area of a pixel-based representation of the data collected from the surface of a molecular array, that produces a significant signal above a calculated background signal. Many automated molecular-array-data processing systems attempt to automatically correct for the shape and position irregularities, examples of which are shown in FIGS. 8B-F. However, many of these automated systems are quite limited in the models that they employ for describing feature shapes and regions of interest. In many systems, a pixel-intensity centroid may be calculated from the pixels within a calculated region of interest in order to select a pixel corresponding to the center of the feature, from which subsequent calculations can be made. However, all of these methods may fail to properly account for feature shape and positional irregularities, and may lead to anomalies in signal data calculated from pixel-based representations of the data scanned from molecular arrays.
FIGS. 9-10 illustrate a second type of feature-signal irregularity, or non-uniformity, that commonly occurs in the pixel-based representation of data scanned from the surface of a molecular array. As shown in FIG. 9, a feature 902 in the scanned image of a molecular array comprises a number of pixels, such as pixel 904, within a region of interest corresponding to the feature. In the case of feature 902 in FIG. 9, the region of interest is disk shaped, and is centered about the grid-point origin 906. Each pixel within the region of interest, such as pixel 904, is associated with an intensity value, representing the signal strength read from the portion of the surface of the molecular array corresponding to the area and location of the pixel. FIG. 10 illustrates the signal intensities corresponding to each pixel within feature 902 of FIG. 9. In FIG. 10, the vertical height, in the z direction 1002, of the rectangular column rising from each pixel represents the signal intensity of that pixel. Note that the pixel intensities are relatively high at the center of the feature, and fall off dramatically towards the edge of the feature. Such a distribution of pixel intensities within the feature may arise from a variety of different sources. Chemical feature deposition methods may result in probe molecules being concentrated in central portions of a feature as the solution containing probe molecules or probe-molecule precursors deposited on the molecular-array surface evaporate inward from the original boundaries of the feature. Alternatively, different probe-synthetic or probe-deposition solutions may result in concentration of probe molecules in the original boundary regions of a deposited feature, producing an outer annular region of high intensity that falls off radially towards the center of the feature.
Automated feature extraction software may attempt to model signal distributions within features, and locally normalize intensities during computation of integrated pixel-intensity signals that represent the total signals for features scanned from a molecular array. However, such automated feature extraction methods are often constrained by relatively simplistic models used to model pixel-intensity distributions, and often do not allow for the knowledge of particular types of molecular arrays, or molecular-array experiments, to be employed in order to assist in integrating pixel intensities to produce feature signals. For these reasons, the designers, manufacturers, and, in particular, users of microarrays have all recognized the need for a more flexible method that would allow molecular-array users to tailor feature extraction and pixel-intensity integration to pixel-intensity-distribution models known to the users of molecular arrays based on the types of probe molecules included in the molecular arrays, the techniques by which the molecular arrays are manufactured, the types of experiments in which the molecular arrays are employed, and the types of contamination and post-exposure processing to which the molecular arrays may have been subjected prior to scanning.