Immobilized oligonucleotides or polynucleotides of known nucleic acid sequences in an array on a substrate can be used as "probes" for monitoring levels of gene expression or to determine the absence or presence of known or new mutations in gene sequences. Sequences of nucleic acids are synthesized into polynucleotides and oligonucleotides either directly on the substrate (in situ), or indirectly (e.g., pre-synthesized) and deposited onto the substrate into an array pattern using well-known methods. Such methods are referenced below. These oligonucleotides are immobilized on the substrate in the array pattern.
The plurality of probes in each location in the array is known in the art as a "nucleic acid feature" or "feature". A feature is defined as a locus onto which a large number of probes, all having the same nucleotide sequence are immobilized. The oligonucleotide probes are exposed, for hybridization purposes, to a sample containing nucleic acids of known sequences at unknown concentrations, or unknown sequences, to be tested or evaluated. These nucleic acids are known in the art as "targets". Note that some investigators also use the reverse definition, referring to the surface-bound oligonucleotides as targets and the solution sample nucleic acids as probes. Henceforth, this application shall use "probes" to describe surface-bound oligonucleotides and "targets" to describe nucleic acids in solution that comprise the analytic sample in some assay or procedure. The nucleic acids or nucleotides in the target sample may be complementary to the nucleic acid or nucleotide sequences in the oligonucleotide probes.
Hybridization is the process where complementary nucleic acids will pair up, associate or bond together. Using well-known processes and conditions for hybridization, the sample nucleic acid "targets", will hybridize with the nucleic acids of known oligonucleotide probe sequences and thus, information about the target samples can be obtained. The processes and conditions of hybridization between nucleotide sequences are referenced below.
Depending on the make-up of the target sample, hybridization of probe features may or may not occur at all probe feature locations and will occur to varying degrees at the different probe feature locations. After hybridization of the targets with the probe features, the array is analyzed by well-known methods. Hybridized arrays are often interrogated using optical methods. Typically, the targets are labeled using well known methods and with well-known substances, e.g. a fluorophore that will fluoresce when exposed to a light source. The targets are labeled with a fluorophore either before the targets are applied to the array substrate, or labeled with a fluorophore after hybridization with an array substrate, such that the fluorophore will associate only with probe-bound hybridized targets.
Typically, measuring the hybridization to an array of known nucleic acid probes gives valuable information about the target samples. A focused light source (usually a laser) is scanned across the hybridized array causing the hybridized areas to emit an optical signal, such as fluorescence. The fluorophore-specific fluorescence data is collected and measured during the scanning operation, and then an image of the array is reconstructed via appropriate algorithms, software and computer hardware. The expected or intended locations of probe nucleic acid features can then be combined with the fluorescence intensities measured at those locations, to yield the data that is then used to determine gene expression levels or nucleic acid sequence of the target samples. The process of collecting data from expected probe locations is referred to as "feature extraction". The conventional equipment and methods of feature extraction are limited by their dependence upon the expected or intended location of the probe features on the substrate array, which is subject to the accuracy of the manufacturing equipment.
The scanning equipment typically used for the evaluation of hybridized arrays includes a scanning fluorometer and is commercially available from different sources, such as Molecular Dynamics of Sunnyvale, Calif., General Scanning of Watertown, Mass., Hewlett Packard of Palo Alto, Calif., or Hitachi USA of So. San Francisco, Calif. Analysis of the data, (i.e., collection, reconstruction of image, comparison and interpretation of data) is performed with associated computer systems and commercially available software, such as IMAGEQUANT.TM. by Molecular Dynamics or GENECHIP.TM. by Affymetrix of Santa Clara, Calif.
The laser light source generates a collimated beam. The collimated beam sequentially illuminates small surface regions of known location. The resulting fluorescence signals from the surface regions are collected either confocally (employing the same lens used to focus the laser light onto the array) or off-axis (using a separate lens positioned to one side of the lens used to focus the laser onto the array). The collected signals are transmitted through appropriate spectral filters, to an optical detector. A recording device, such as a computer memory, records the detected signals and builds up a raster scan file of intensities as a function of position, or time as it relates to the position. Such intensities, as a function of position, shall henceforth be referred to as "pixels". The pixels within a region centered upon the expected or intended position of a feature can be averaged to yield the relative quantity of target hybridized to the probe in that feature, if the expected or intended position of the feature is sufficiently close to its true position. For a discussion of the optical scanning equipment, see for example, U.S. Pat. No. 5,760,951 (confocal scanner) and U.S. Pat. No. 5,585,639 (off axis scanner), each incorporated herein by reference.
A general problem in the feature extraction process described above is the extraction of features having weak or low fluorescence intensities, called "dim features". A feature that yields little or no hybridization to the target sample will produce a low average fluorescence intensity when scanned (i.e. will display poor intensity contrast, relative to a background). However the dim features are just as important in the analysis of genes as are bright features (having extensive hybridization). The majority of genes in a given cell type are expressed at low levels (for example, less than about 50 copies of the gene per cell). Therefore, an array constructed from features that measure expression levels of any plurality of available genes will result in a majority of the hybridized features being dim rather than bright.
If the dim hybridized probe feature is located or positioned accurately in the array and is of known shape, then accurate feature extraction can be performed automatically, using relatively simple algorithms. The computer is programmed to analyze predefined regions of interest on the array based on the expected or intended locations of the probe features that were placed by the manufacturing equipment. The computer will analyze the results of the optical scan by considering the predefined regions of interest. If a pixel within the raster-scan image of a dim feature is within the region of interest, the computer will include the pixel in its data collection.
One problem arises when the probe feature that produces a weak signal after hybridization is not accurately located on the array substrate by the manufacturing process. Although, it is conventional practice to provide fiduciary markings on the array substrate, for example, to which the manufacturing equipment aligns each manufacturing step, errors in the location of the features still occur. The fiduciary markings are also used during feature extraction. The optical scanning equipment aligns the light source with the array fiduciary markings and the computer aligns its predefined region for detection and analysis with the fiduciary markings on the substrate surface. When a pixel within the raster-scan image of a dim feature is outside the region of interest (i.e. the probe feature is misplaced or mislocated due to the manufacturing process, for example), the computer will not count the pixel. The computer will sum the intensities from all pixels within the region of interest, average the signals by dividing the sum by the total number of pixels involved and report the average signal per pixel within the region of interest. When the computer does not include pixels from the misplaced hybridized feature in the sum and extracts from an area on the substrate with no features or with partial features inaccurate data will result.
Another problem arises when the probe feature that produces a weak signal after hybridization is misshapened for some reason and the computer cannot detect the irregular shape. The computer will extract information from predefined regions with shapes that are capable of only partially overlapping the actual feature. The common source of misshapened features is in the manufacturing process. Common misshapened feature morphologies are annular features and football-shaped features. An annular feature has an intensity profile that looks like a donut rather than a uniform spot. Effectively, there are more nucleic acid probes at the edges than in the center of the feature. A football-shaped feature has an intensity profile shaped like the intersection of two or more overlapping circles. Effectively, there are more oligonucleotide probes on one side of the feature. Other, more complex morphologies such as crescents and defects due to scratches on the substrate surface are also observed. The computer is typically programmed to sample a disk-shaped region inside the edges of the feature, because a uniform spot is expected. When either an annular or football-shaped feature is sampled inside the edges, a quantity of potentially valuable information is overlooked, and a quantity of substandard information is included. Both measurement defects result in inaccurate assessment of the degree of hybridization of target to the probe feature.
When in situ synthesis of a 25-nucleotide probe feature is considered, the misshapened feature results from a given feature not being placed once, but instead 25 times, because probe ingredients are spotted onto each feature location 25 times. This process generates a Venn diagram of all of the spots (of all of the ingredients). If the spots are deposited in roughly the same place, the feature is approximately circular. Otherwise, if one or more spots are mispositioned during the synthesis, one could obtain a football-shaped feature, for example. Unless the equipment is preprogrammed to employ an alternative algorithm for extraction of misshapened and mislocated features, the quality of the resulting data suffers. However, all current algorithms fail to properly extract dim features that are misshapened or mispositioned.
The primary difficulty lies in the ability to determine with a level of certainty the actual position of the probe feature that gives rise to the weak signal to ensure its detection by the optical scanning equipment. A dim feature that is not located on the substrate consistently within the array pattern, may be missed during the feature extraction process, if the analysis equipment or the operator does not know the likely locations of inconsistently placed features. Therefore, this limitation in the conventional equipment and method yields less accurate results when analyzing the fluorescence data for the composition of the target sample.
The deposition of probes, or in situ synthesis thereof, on substrates is performed with automated equipment, as described above. Current manufacturing equipment can produce probe features ranging in size from about 20 to 1000 microns in diameter, with preferred size being equal to 200 microns in diameter or less. The features are positioned on substrates in arrays having a spacing less than or equal to two feature diameters, and preferably about 1.5 to 2 feature diameters center to center spacing. For example, if the probe features had an average diameter of 100 microns, the preferred center-to-center distance would be 150-200 microns. However, it is the goal in the industry to make the arrays smaller and more compact, since smaller arrays require less sample (which is usually in short supply), can be scanned more rapidly and are less expensive to manufacture. In addition, as features become smaller and more densely packed, more genes can be analyzed using an array of a given size; this again saves sample and reduces costs. Achieving smaller and more compact arrays will depend heavily on the manufacturing equipment and processing. It should be appreciated that as probe arrays for gene analysis become more density packed, very small errors in probe placement more severely impact the accuracy of the analysis of the hybridization results.
Any real manufacturing process is subject to both random and systematic errors in the dimensions of the manufactured artifact. The manufacturing processes used in creating arrays of nucleic acids features for gene analysis are no exception and therefore, nucleic acid feature locations are subject to both random and systematic errors. An error in the location of the feature on the array of greater than or equal to ten percent of the diameter could affect the scanned fluorescence data and produce inaccurate results by the conventional equipment and method. Since the regions of interest for probe analysis are predefined to typically exclude the edges of a feature, a location error of less than 10 percent of the diameter of the typically shaped feature, should not jeopardize the integrity of the collected data.
However, another manufacturing error that may result includes variations in the diameter of the probe feature. Variations in the diameter of a feature may result from surface chemistry problems on the surface of the substrate, such as changes in hydrophobicity of the surface. A higher than expected surface hydrophobicity will results in the feature having a smaller footprint, since the feature tends to bead up more on the more hydrophobic substrate surface. Therefore, the feature might be located in the correct place, but be only one half to three quarters of the diameter than was expected (i.e. the error is greater than 10 percent of the diameter). When the computer samples the predefined region of interest, it collects non-probe feature data in addition to the feature signal. The feature signal is degraded by the additional data.
When an array is subjected to hybridization with a target sample, either the hybridized feature will produce a bright fluorescence intensity (i.e. will display good intensity contrast, relative to a background) or a dim fluorescence intensity (i.e. will display poor intensity contrast, relative to the background) when scanned, typically due to the amount of hybridization that occurred with the target sample. If a poorly positioned or located hybridized feature is bright, it becomes self-locating. As long as the intended feature (and only the intended feature) overlaps a region of uncertainty drawn around the intended feature location, then any algorithm capable of recognizing a connected region of pixels whose intensities exceed some threshold can be used to find the actual feature, calculate its center and move the center of the data extraction region to coincide with the feature center. Such algorithms are well known to the art of image processing. A commercial implementation of such an algorithm can be found in the computer program IMAGEQUANT.TM. marketed by Molecular Dynamics (Sunnyvale, Calif.).
Unfortunately, dim features are not self-locating and therefore, the location of probe features that are inaccurately placed on a substrate must be determined in order to obtain accurate data during feature extraction of dim features. The problem of locating inaccurately placed probe features that result in weak signals after hybridization becomes particularly difficult as feature size decreases, because the relative importance of location errors increases at the same time that the total number of pixels in the digital array image that contain relevant data is decreasing. The physical laws governing the behavior of light limit the minimum size of a pixel in a raster-scan image obtained via laser-excited fluorescence. This minimum dimension ranges from about 3 microns (blue excitation light) to about 5 microns (red excitation light). Thus, a 100 micron diameter feature scanned by a red excitation laser is spread across approximately 310 pixels; a 50 micron diameter feature is spread across approximately 78 pixels, and a 25 micron diameter feature occupies only 19 pixels. For a 100 micron diameter feature, a 5 micron location error will subtract about 40 signal-containing pixels from the extraction process, and replace them with background pixels (approximately 13% error). The same 5 micron location error will introduce approximately 50% error into the extraction of a 25 micron diameter feature.
The conventional equipment and method of feature extraction rely on the locations of the probe features being those expected or intended from the manufacturing processes to analyze and identify the locations of hybridization. The uncertainty in the actual position of each feature in the array can compromise the detection of probe-bound hybridization targets, particularly when the hybridization signal density from a feature is weak ("dim feature").
For instance, if the conventional detection equipment is directed onto a spot that spans the margin of a dim feature and includes some substrate region, that has neither any probes nor hybridized targets bound onto it, then the total signal that reaches the detector from a spot fully within the feature boundary could yield a positive reading and the existence of the dim hybridized feature would be detected. It is common for the surface of the substrate to produce optical background noise (e.g. undesired signal) when the array is optically scanned to identify hybridized features. If the signal from the hybridized feature is weak, it may be difficult to distinguish the dim feature from the background noise. When the background signal noise from the surface of the substrate is stronger than the dim feature, the feature is described as a "negative feature". Most arrays contain at least a few negative features; these further add to the difficulty of locating dim features. In addition, "false negative" results are possible when the dim feature is mislocated on the substrate. A false negative reading is caused by missing a signal from a dim feature that was above the detection threshold because the equipment extracted mostly the substrate region or background and not the feature.
Therefore, either the array manufacturing process must be improved to the point that location and other manufacturing errors are negligible, or other methods must be used to locate dim features, or both.
Methods to generally locate features on a substrate are disclosed in U.S. Pat. No. 5,721,435, issued to Troll and assigned to the assignee of the present invention, and is incorporated herein by this reference. The methods of Troll include a plurality of reference markings and test spots on an array, all of which produce signals when optically scanned that are detected and evaluated to determine the location of the test spots. The reference markings have optically unique signatures to distinguish them from the signals from the test spots. The reference markings are spaced apart at known distances and serve to provide a constant calibration for the scanning equipment. The reference markings are typically laser-etched or metal-plated alignment marks that are written to the substrate surface. This method of feature location is commonly referred to as "dead-reckoning" from a mixture of design parameters and physical landmarks.
Another method to generally locate features that can be used to locate dim features is user-assisted feature extraction ("by hand"). Although these methods work well to generally locate features on a substrate, without further intervention, they are not much better at locating dim features that are mislocated (i.e., not properly placed) on the substrate by the manufacturing equipment. Dead reckoning is degraded by both uncompensated systematic location errors and random location errors. Finally, user-assisted extraction is, by definition, subjective and not automated; it is also slow, tedious and subject to errors caused by user fatigue.
Thus, it would be advantageous to have an apparatus, system and method to accurately locate probe features bound to a substrate regardless of whether the features produce a dim or bright fluorescence when hybridized with a target and regardless of the accuracy of the manufacturing equipment and processes, preferably utilizing the features of conventional scanning and analysis equipment.