The present invention relates generally to processing techniques for data from biological microarrays. More particularly, the invention provides one or more methods for identifying areas on images of a mechanically printed or otherwise generated array of spots which contain data, and those which are background areas. Merely by way of example, the invention is applied to processing images of DNA microarrays to distinguish and demarcate the data-containing areas known as spots from the background areas. But it would be recognized that the invention has much broader range of applicability including analysis of images of arrays of any kind in which data-containing areas lie in an approximately regular pattern on a background.
Microarray technology has been developed over the past few years to allow scientists to test large numbers of genes, proteins, or other molecules for hybridization or binding to molecules in a test mixture (see, for example, DeRisi et al., “Use of a cDNA microarray to analyse gene expression patterns in human cancer,” Nature Genetics 14:457–460 [1966]; Schena et al., “Parallel Human Genome Analysis: Microarray-Based Expression Monitoring of 1000 Genes,” Proc. Natl. Acad. Sci. USA, 93:10614–10619 [1996]; Schena et al., “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,” Science 270:467–470 [1995]). As large numbers of genes have been identified by the sequencing of the genomes of various organisms, scientists have gained the ability to assess the total range of gene expression in cells of different types or at different stages of differentiation or cell cycle. By comparing gene expression in closely related cells which exhibit distinct phenotypes, for example malignant and nonmalignant cells from the same tissue origin, correlations may be drawn between gene expression and phenotype. Understanding these relationships may lead to identification of new drug targets or other therapeutic interventions for a variety of diseases.
A type of microarray experimental design uses spots of DNA immobilized on glass slides, or “chips.” These spotted DNA microarrays have been developed as a way of testing for expression or presence of large numbers, often tens of thousands, of DNA or RNA sequences using various types of hybridization experiment. To make a spotted DNA microarray, preparations of complementary DNA are made from the messenger RNA or genomic DNA which represent the sequences of many of the genes (or a defined subset thereof) found in the organism of interest. These preparations are held in plates which are divided into many (usually 96 or 384) wells, each of which may hold from 50 μl to 200 μl of DNA in solution. Wells are often identified by position on the plate, and the identity of the DNA sequence in each well is recorded in a list. Droplets of each of the DNA preparations are typically deposited onto transparent slides in regular arrays, often by dipping pins or capillaries into the DNA solution and touching the pins to the surface-treated slide. The DNA is thus fixed in small (50–250 μm), approximately circular areas on the surface of the slide, in a known arrangement such that the identity of each spot can be determined by correlation with the list of sequence names.
Arrays prepared in this manner can be used as a substrate for hybridization experiments to test for the presence or absence of complementary DNA or RNA sequences in a test mixture. A useful application of this type of microarray is to interrogate gene expression in cells, usually comparing the RNA isolated from two or more different sources, for example, a test and a control. A total population of messenger RNA molecules from each origin is labeled with a distinct fluorescent tag, with or without a reverse transcription and amplification step. The labeled nucleic acid preparations are often allowed to hybridize to the immobilized spots of DNA on the chip; specific hybridization occurs where the DNA contains sequences that are homologous to sequences of labeled test nucleic acid. After washing away unbound material, the fluorescent signals from the tags can be detected by a scanning fluorimeter. This instrument generates a data file that contains numeric information about the intensity of emitted light at the wavelength that is generated by the fluorescent dye used to tag the test nucleic acid. Data are collected for each pixel of the scanned area, with number of pixels determined by the size of the chip and the resolution of the scanner. The file containing the data can be converted into an image, which often represents the pattern of emitted fluorescence on the microarray chip.
In order to use information from these spotted DNA microarray images, considerable data processing is generally required. A step in the process is to determine which pixels in the image represent areas where DNA spots were deposited, and which pixels contain only background fluorescence, so that fluorescence intensity data may be extracted from each data-containing site. It is generally desirable to identify the location of each and every DNA spot on the image, regardless of whether a positive fluorescent signal is present at that spot, so that fluorescence signals can be correlated with the identity of the gene represented by each spot. To do this, analysis software ordinarily creates a grid, or a layout of areas each of which encloses the pixels that represent DNA spots. Such a grid is commonly defined as a perfect mathematical grid. Fluorescence intensity of pixels within each spot, adjusted for the intensity of background pixels in the immediate neighborhood, is the first piece of information needed to interpret results. Thus, locating the grid of spots accurately is generally important to obtaining legitimate data from an array. Conventional analysis programs construct simple rectilinear grids based on information provided by the user, such as number of blocks, number of rows and columns of spots, spot size, column and row spacing, and degrees of rotation of each block. Often, this information is difficult to obtain or is slightly (or considerably) inaccurate, and may even be variable within a single microarray image. Grids made in this way are created as perfectly spaced and perfectly aligned circles, without reference to the actual image being analyzed, which makes them limiting. Other limitations also exist.
The manufacture of spotted arrays, however, does not produce perfect rectilinear arrays. A commonly used method for array preparation makes use of a pin-type robotic printer, see for example U.S. Pat. No. 6,101,946, in the name of Martinsky, assigned to TeleChem International Inc. (Sunnyvale, Calif.), and Eisen and Brown, Meth. Enzymol. 303:179–205 (1999), or http://cmgm.stanford.edu/pbrown/mguide/index.html, or a capillary-type robotic printer, see for example U.S. Pat. No. 5,807,522, in the name of Brown, P. O., assigned to The Board of Trustees of the Leland Stanford Junior University (Stanford, Calif.), and U.S. Pat. No. 6,110,426, in the name of Shalon, assigned to The Board of Trustees of the Leland Stanford Junior University (Stanford, Calif.). This type of instrument generally has a group of metal pins or capillaries, numbering from one to 48, which dip into DNA preparations contained in multiwell plates. The pins then move rapidly over an arrangement of many (often 100 to 200) glass slides, laid out on a flat table, touching each slide to deposit a droplet of DNA. The pattern of DNA spots that is created in this way approximates a regular, rectilinear grid. However, irregularities in the grid of DNA spots occur due to many factors, including variations in the size and thickness of the glass slides, minute differences in the distances between pins or capillaries in the set, minute differences in the bore size among the set of capillaries or diameter among the set of pins, variable precision of the distances traveled by the set of pins or capillaries when moving from one slide to another, and inconsistency of DNA concentration and thus viscosity among the thousands of DNA samples. Further irregularity in the arrangement of spots on the array can occur when the robot stops periodically during the printing run for replacement of the multiwell DNA plate or for other types of instrument maintenance. When multiplied over tens of thousands of spots, even minuscule variation results in DNA arrays which are not perfectly rectilinear. Even the best spotted arrays contain irregularities that make creation of the data analysis grid a challenge. Furthermore, within the area of a given spot, there are often irregularities in the fluorescence intensity of pixels, leading to variation in the shape of spots on the array. For this reason, and the fact that not every target site gives a positive signal, a simple blob analysis does not identify the locations of every spot. Further deviation from a perfect grid can arise during processing, for example when sections of a microarray become altered during the hybridization procedure. The coating of poly-L-lysine or other chemical, which allows the spotted DNA to adhere well to the surface of the glass slide, can lift and shift its position, leading to changes in the arrangement of rows and columns of DNA spots. These spots may still provide valid hybridization information, but because spot positions are not aligned with a simple grid, such data are often lost.
As noted above, conventional software programs commonly begin data analysis by mathematically generating a perfect rectilinear grid from user-provided information. The grid is then overlaid onto an image of the fluorescence scan of the chip. For nearly every microarray slide, a great deal of manual adjustment of these grids is often required to align accurately each and every spot on the grid with the fluorescent spots on the chip image. This adjustment process, which must be done before any analysis can be carried out, is quite time-consuming and cumbersome, and often leads to inaccuracy, as the user tires of the tedious process of adjusting the positions of hundreds or thousands of individual spots on a grid. To create automatically a grid based on the information contained in the fluorescence scan of the chip provides an improvement in the accuracy and efficiency of data analysis of microarray hybridization experiments.
Other techniques of data analysis have been developed to reduce the amount of manual adjustment of the grids to make each and every spot on the grid correlate with the fluorescent spots, although such techniques still require some manual adjustment. As merely an example, such techniques include those provided by ArrayVision, Imaging Research Inc., of 500 Glenridge Ave., St. Catharines, Ontario, Canada L2S 3A1, in which a rectilinear template is generally created and analyzed for signal intensity to determine whether a spot is present which is a likely fit to each element within the template. Statistical analysis and confidence weighting are used to help align spots. This technique, however, often leads to large misinterpretation of images of arrays containing areas of weak or absent signals, misalignments, or high background fluorescence. The Institute for Genomic Research (TIGR), of 9172 Medical Center Drive, Rockville, Md. 20850 USA, provides a program known as SpotFinder which uses a similar adaptive thresholding method to locate spots within a simple rectilinear grid. With bright spots, large features and widely separated spots, this method works fairly well, but spots on images of arrays or parts of arrays with irregular features are not accurately located and require considerable user manipulation. Another method of target site identification, using a statistical test to analyze ratiometric data in spotted arrays, has been reported [U.S. Pat. No. 6,245,517, in the name of Chen, Y., et al., and assigned to The United States of America as represented by the Department of Health and Human Services (Washington, D.C.)]. This method begins with a “target mask” inferred from landmark signals placed into the potential target area, and thus it suffers from the same limitations as described for other methods which use simple mathematical grids and statistical methods. Array Pro™ Analyzer by Media Cybernetics, Inc. of 8484 Georgia Avenue, Suite 200, Silver Spring, Md. 20910 U.S.A., utilizes a method which looks for inherent-periodicity in the clustering of spots into “grids and subgrids,” using a Fourier analysis to calculate the angle of skewed blocks and the distance between spots in the image. This method makes easier the task of locating spots by suggesting row and column spacing values, but individual spots are frequently misaligned and low-signal spots are often missed altogether. The method can be made to work by expert users but is cumbersome and often unsuccessful when operated by less highly trained personnel. In all of these programs, default values may be used to carry out data analysis “automatically,” but in reality, accuracy is poor if the many specific parameters of each array are not accurately entered. Unfortunately, in practice, all of such techniques usually require considerable manual inspection and adjustment in order to align each and every spot on the demarcation grid with the actual spots on the image.
From the above, one understands that it is often desired to have a technique for improved analysis of data contained in images of biological microarrays or the like.