The field of the present invention relates to the image processing of gene expression microarrays. In particular, the invention relates to automatically processing the information contained in a gene expression microarray.
A cell relies on proteins for a variety of its functions. Producing energy, biosynthesizing all component macromolecules, maintaining cellular architecture, and acting upon intra- and extra cellular stimuli are all protein dependent activities. Almost every cell within an organism contains the information necessary to produce the entire repertoire of proteins that that organism can specify. This information is stored as genes within the organism's DNA genome. Different organisms have different numbers of genes to define them. The number of human genes, for example, is estimated to be between 30,000 and 100,000.
Only a portion of the genome is composed of genes, and the set of genes expressed as proteins varies between cell types. Some of the proteins present in a single cell are likely to be present in all cells because they serve functions required in every type of cell. These proteins can be thought of as “housekeeping” proteins. Other proteins serve specialized functions that are only required in particular cell types. Such proteins are generally produced only in limited types of cells. Given that a large part of a cell's specific functionality is determined by the genes that it is expressing, it is logical that transcription, the first step in the process of converting the genetic information stored in an organism's genome into protein, would be highly regulated by the control network that coordinates and directs cellular activity.
The regulation of transcription is readily observed in studies that scrutinize activities evident in cells configuring themselves for a particular function (specialization into a muscle cell) or state (active multiplication or quiescence). As cells alter their state, coordinate transcription of the protein sets required for the change of state can be observed. As a window both on cell status and on the system controlling the cell, detailed, global knowledge of the transcriptional state could provide a broad spectrum of information useful to biologists. For instance, knowledge of when and in what types of cell the protein product of a gene of unknown function is expressed would provide useful clues as to the likely function of that gene. Furthermore, determining gene expression patterns in normal cells could provide detailed knowledge of how the control system achieves the highly coordinated activation and deactivation required to develop and differentiate a single fertilized egg into a mature organism. Also, comparing gene expression patterns in normal and pathological cells could provide useful diagnostic “fingerprints” and help identify aberrant functions that would be reasonable targets for therapeutic intervention.
The ability to perform studies that determine the transcriptional state of a large number of genes has, however, until recently, been severely inhibited by limitations on the ability to survey cells for the presence and abundance of a large number of gene transcripts in a single experiment. A primary limitation has been the small number of identified genes. In humans, only a few thousand of the complete set have been physically purified and characterized to any extent. Another significant limitation has been the cumbersome nature of transcription analysis. Even a large experiment on human cells can track expression of only a dozen genes, clearly an inadequate sampling to make any meaningful inferences about so complex a control system.
Two recent technological advances have provided the means to overcome some of these limitations in examining the patterns and relationships in gene transcription. The cloning of molecules derived from mRNA transcripts in particular tissues, followed by the application of high-throughput sequencing to the DNA ends of the members of these libraries has yielded a catalog of expressed sequence tags (ESTs). M. S. Boguski and GD. Schuler, “Establishing a Human Transcript Map,” Nature Genetics 10(4), 369–371 (1995). These signature sequences provide unambiguous identifiers for a large cohort of genes. At present, approximately 40,000 human genes have been “tagged” by this route, and many have been mapped to their genomic location. G. D. Schuler, M. S. Boguski, et al., “A Gene Map of the Human Genome,” Science 274(5287), 540–546 (1996).
In addition, the clones from which these sequences were derived provide analytical reagents that can be used in the quantitation of transcripts from biological samples. Specifically, the nucleic acid polymers, DNA and RNA, are biologically synthesized in a copying reaction in which one polymer serves as a template for the synthesis of an opposing strand, which is termed its complement. Even after separation from each other, these strands can be induced to pair quite specifically with each other to form a very tight molecular complex in a process called hybridization. This specific binding is the basis of most analytical procedures for quantitating the presence of a particular species of nucleic acid, such as the mRNA specifying a particular protein gene product.
Furthermore, the recent development of micro array technology, a hybridization-based process, has begun to enable the simultaneous quantitation of many nucleic acid species, even genome-wide quantitation. M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative Monitoring of Gene Expression Patterns With a Complementary DNA Microarray,” Science 270(5235), 467–470, (1995), J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen, Y. A. Su, and J. M. Trent, “Use or a cDNA Microarray to Analyze Gene Expression Patterns in Human Cancer,” Nature Genetics 14(4), 457–460 (1996), M. Schena, D. Shalon, R. Heller, A. Chai, P. O. Brown, and R. W. Davis, “Parallel Human Genome Analysis: Microarray-based Expression Monitoring of 1000 Genes,” Proc. Nat. Acad. Sci. U.S.A. 93(20), 10614–10619 (1996). For mRNA expression studies, the goal is to develop microarrays that contain every gene in a genome against which mRNA expression levels can be quantitatively assessed. This technology combines robotic placement (spotting) of small amounts of individual, pure nucleic acid species on a glass slide, hybridization to this array with multiple fluorescently labeled nucleic acids, and traditionally, detection and quantitation of the resulting fluor-tagged hybrids with a scanning confocal fluorescent microscope. When used to detect transcripts, a particular RNA transcript (an mRNA) is copied into DNA (a cDNA) and this copied form of the transcript is immobilized on a glass slide. The entire complement of transcript mRNAs present in a particular cell type is extracted from cells and then a fluor-tagged cDNA representation of the extracted mRNAs is made in vitro by an enzymatic reaction termed reverse transcription. Fluor-tagged representations of mRNA from several cell types, each tagged with a fluor emitting a different color light, are hybridized to the array of cDNAs and then fluorescence at the site of each immobilized cDNA is quantitied.
The various characteristics of this analytic method make it particularly useful for directly comparing the abundance of mRNAs present in two cell types. An example of such a system is presented in FIG. 1. In this experiment, an array of cDNAs was hybridized with a green fluor-tagged collection of mRNAs extracted from a tumorigenic melanoma cell line (UACC-903) and a red fluor-tagged collection of mRNAs was extracted from a nontumorigenic derivative of the original cell line (UACC-903 +6). J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen, Y. A. Su, and J. M. Trent, “Use of a cDNA Microarray to Analyze Gene Expression Patterns in Human Cancer,” Nature Genetics 14(4), 457–460 (1996). Monochrome images of the fluorescent intensity observed for each of the fluors are then combined by placing each image in the appropriate color channel of a red-green-blue (RGB) image. Intense red fluorescence at a spot indicates a high level of expression of that gene in the nontumorigenic cell line, with little expression of the same in the tumorigenic parent. Conversely, intense green fluorescence at a spot indicates high expression of that gene in the tumorigenic line, with little expression in the nontumorigenic daughter line. When both cell lines express a gene at similar levels, the observed array spot is yellow.
Visual inspection of the results with, for example, a scanning microscope, is adequate to analyze genes where there is a very large differential rate of expression. A more thorough study of the changes in expression requires the ability to discern more subtle changes in expression level and to determine whether observed differences are the result of random variation or whether they are characteristic of the gene being expressed. For this level of analysis, a visual inspection-based methodology is generally inadequate.
Moreover, advances in microarray technology have made using a visual inspection-based methodology even more impractical. Microarray generation systems are in place to produce over 10,000 spots on a single microscope slide. A hybridization experiment using one such slide yields an expression profile of thousands of genes. Thus, these systems produce massive amounts of information. The massive output of data makes possible high-throughput gene expression analysis at an acceptable cost and enables a more efficient study of the interaction and interrelationships of thousands of genes. If the information can be efficiently processed and analyzed, the results can potentially yield a complete understanding of the genomic functions in biological systems. Using visual inspection to quantitate the expression levels, however, is far too cumbersome, time-consuming and imprecise to effectively analyze these data-rich slides. Thus, along with the opportunities created by the rapid advancement in microarray generation technology, a management of information or “informatics” problem has arisen.
The application of digital image processing technology has largely been adopted as the avenue for solving the informatics problem. Using digital image processing, images of the microarray slides are digitally captured and analyzed using a high-speed computer. A typical microarray image depicts bright spots arranged in sets of sub-grids against a dark background. Typically, the sub-grids in a mircroarray image have the same number of rows and columns of spots. Normally, the sub-grids in a micro array image are arranged as a grid of sub-grids, or “meta-grid.”
Theoretically, processing a microarray image containing a meta-grid of spots is straightforward. First, the individual sub-grids in the meta-grid are detected. Then, for each detected sub-grid, the spots in the sub-grid are detected. Once the spots are located, their intensities reflecting the gene expression levels are measured. Finally, the reliability of the measurements for each spot and each sub-grid is assessed. Under ideal conditions, a microarray image is easily processed. These ideal conditions require that 1) the sub-grids within a meta-grid have the same dimensions, 2) the sub-grids be positioned in a predetermined location within a microarray image, 3) the sub-grids be equally spaced from each other, 4) rows and columns within a sub-grid be equally spaced from each other, 5) the spots be centered on sub-grid-line intersections, 6) the spots be of the same size and shape, 7) the spots have intensities distinguishable from the background, and 8) the slides have no contamination that appears in the microarray images. A simple software program can process a micro array image having the above “ideal” characteristics.
However, because of inherent limitations in the microarray generation hardware and process, the microarray images rarely, if ever, exhibit these conditions. For example, the pins for generating the spots in the array during the spotting process can be misaligned. Also, the spatial mapping between the slides and the scanned images can be offset. The result of these hardware imperfections is that the location of each grid in the microarray can vary from image to image and the spots will not be linearly aligned such that they are centered on grid-line intersections. Furthermore, some spots will appear to be missing from a grid entirely because of gene expression levels that are too low to be measurable.
Besides the positioning inconsistencies, the shapes and sizes of the spots vary significantly. Such variations are again due to limitations in the spotting hardware and process. In particular, the sizes of the droplets of DNA solution vary, causing the sizes of the spots to vary. Second, the concentrations of DNA and salt in the spotting solution vary over time. Consequently, the shapes of the spots will deviate over time from a circle as the density of DNA varies within a spot. Furthermore, the contact space between the tips of the pins and the slide surface varies, as do surface properties of the slide. All of the above factors perturb the shapes and sizes of the spots.
Other factors can affect the quality of the microarray image data that is generated. During the spotting process, temperature nonuniformities across a slide or between slides and accidental scrapes by pins during the spotting process can alter the results. Another issue that causes a microarray image to deviate from ideal conditions is contamination of the slide surface. For example, dust landing on the slides during the hybridization process can produce high-intensity pixels in the microarray image. In the slide-drying process, small bumps on the slide surface can appear as peculiar reflections in the microarray image. Another potential source of contamination is from accidental splashes and drips of DNA solution from the spotting pins. Thus, in any meaningful processing of a microarray image, the above factors should be accounted for and considered.
Because of these issues, previous image processing techniques for automatically processing and analyzing micro array images have been impractical. The methods used to automatically extract microarray data through digital image processing are normally classified into two groups: signal detection and signal analysis. Signal detection methods attempt to locate the spots in the microarray images. One of the early image processing-based methodologies used computer-based tools that allowed a user to direct the image processor to spot locations in the microarray images. A user applied a grid frame to an image and then resized the frame to fit the grid of spots in the image. When the spots in the image were not evenly spaced, the user would adjust the grid frame lines to align them with the spots in the image. This method, however, was prohibitively time-consuming and labor intensive for micro array images, particularly where precise grid alignment was needed before proceeding to a measurement phase for the spot signal.
Another image processing-based signal detection method automatically establishes grid lines after a user has identified the approximate location of a grid of spots in a microarray image. The user, for example, specifies the location of the four corners of the grid in the image. The spot finding method then locates the spots near the calculated grid points. The obvious problem with this method is that human involvement is still required, making analyzing large microarray prohibitively expensive.
Thus, a need exists for a system and method of automatically locating sub-grids of gene expression signals in a micro array that account for the inherent inconsistencies and errors in the microarray generation process and that do not necessitate the expense of human involvement.
Once the sub-grids in the micro array are identified, the signal analysis methods take over. In signal analysis, the gene expression spots in each sub-grid are detected and characterized. A number of signal analysis methods have been applied to extract or “segment” the gene expression signals from the spots. In a space-based signal segmentation method, for example, a circle of a predetermined size and having a location based on the most likely position of the spot signal is placed in the image to separate signal pixels from background pixels. Signal measurements are made based on the assumption that signal pixels reside inside the circle, while background pixels reside outside the circle. However, because of the high potential for microarray contamination and spot shape and location irregularity, the space-based signal segmentation method is inadequate.
Pure intensity-based signal segmentation methods have also been ineffective at obtaining accurate signal measurements for the gene expression spots. These methods use pixel intensity information to extract the signal pixels. In these methods, it is assumed that the gene expression signal pixels have intensities that are brighter than the background pixels. While being simple and fast computing, these methods have significant disadvantages. First, gene expression levels that are low will likely not be adequately characterized because the signal and background pixels cannot be separated based on intensity alone. Also, microarray images with contamination or noise are easily mischaracterized because the signal and background are not easily separated based on pixel intensity because both exhibit strong signals.
To enhance segmentation performance, methods that incorporate space and signal intensity information have been developed. In a Journal of Biomedical Optics article dated October 1997 by Yindong Chen et al. entitled Ratio-Based Decisions and the Quantitative Analysis of cDNA Microarray Images, a pixel selection method based on the Mann-Whitney test was proposed. In the method, a circle is placed in a target region that includes the region of the spot. Outside the circle, statistical properties of the assumed-to-be-background pixels are calculated. From these calculations, a threshold level is calculated to determine which pixels inside the circle are signal and which are background pixels. A problem with the method occurs when contamination is observed inside the circle whereby contamination pixels are probably classified as signal pixels. Correspondingly, contamination pixels outside the circle cause the calculated threshold level to be higher that it otherwise would be. The method also performs poorly on spots having weak signals and on microarray images that are noisy. In these situations, the intensity distributions for signal and background are overlapping. This overlapping of intensity distributions inherently limits the performance of threshold-based segmentation.
The method of trimmed measurements is another method that uses both spatial and intensity information to perform segmentation. In this method, a circle is placed around the signal region after the signal detection process. In most cases, some signal pixels will be outside the circle and some background pixels will be inside the circle. The impact on threshold calculations are removed by “trimming off’ these pixels from the intensity distributions for signal and background. A significant problem with this method, however, is the reliance on the precision of the location of the center of the circle and the determination of its radius. Small errors in either may result in the loss of significant signal information regarding the spot. Furthermore, the method incorrectly presumes that the spot is always circular. When the spot is not circular, the method again fails to identify significant signal information.
A need exists, therefore, for a robust system and method for segmenting and characterizing gene expression spots. Specifically, a need exists for a system and method that discerns contamination regions, noisy images, low signal spots, and also preserves a maximum of signal information.