(1) Technical Field
The present invention relates to image processing of biomaterial information, and more particularly to tools for processing the information contained in microarrays.
(2) Discussion
The bioinformatics field, which, in a broad sense, includes any use of computers in solving information problems in the life sciences, and more particularly, the creation and use of extensive electronic databases on genomes, proteomes, etc., is currently in a stage of rapid growth. In particular, much of the analysis of proteomic and genomic information is performed through the use of microarrays. Microarrays provide a means for simultaneously performing thousands of experiments, with multiple microarray tests resulting in many millions of data samples.
DNA is a primary example of the substances that are analyzed through the use of microarrays. However, many different types of biological chemicals such as proteins for example can also be analyzed using this technique. DNA microarray analysis has become an important source of information for geneticists, permitting the simultaneous monitoring of thousands of genes. As mentioned, modern microarrays contain tens of thousands of genes spotted on them. Once such a large volume of information is extracted from a microarray image, a wide variety of statistical techniques may be applied to make various decisions regarding the gene characteristics.
The data mining procedure typically performed on a microarray slide includes two main steps: image analysis and statistical data processing. As any statistical processing procedure may be influenced by the quality of its input, the statistical data processing step relies heavily on the image analysis step. The image analysis step typically comprises three stages: grid finding and spot location adjustment; spot region segmentation; and measurement extraction.
Grid finding is performed to locate the periodic grids of (usually circular) spots printed on a slide. The approximate grid structure is usually known in advance, and grid finding may be performed by a variety of well-known and effective searching procedures. Each image typically contains several subgrids that are also placed periodically with respect to each other. Deviations of subgrids and of individual spots (data points) from their expected positions on the slide can occur due to technical imperfections of the printing process. Spot location and size adjustment techniques are used to compensate for such deviations.
After each spot is locked on the image, the region around its center is ideally segmented into signal pixels, background pixels and ignored pixels. There are several techniques by which images may be segmented. The techniques vary from purely spatial to purely intensity based. Spatial schemes usually simply place a circular mask for the signal at its center location, assigning a “signal” label to every pixel within the circle. Intensity based schemes are based on analysis of the intensity distribution around the spot location, attempting to extract the signal distribution from the snip.
After the segmentation procedure is complete, the mean expression of the signal and the background may be measured along with their variances and other spatial and distributional quantities. To assess the quality of these measurements, a variety of approaches may be found in the literature, several of which are listed below for further reference. Generally, the source of low measured expression quality is rooted in the aforementioned three stages of the image analysis step, as well as to measurement contamination and misprints on the slide.
As mentioned, there are currently several general approaches to expression quality measurement. Two principally different groups of methods may be found in the literature: replicate-based quality assessment and image-based quality assessment. With regard to replicate-based quality assessment, spot replicates are considered to be a valuable source of information for example for significance analysis of differently expressed genes among other uses. However, before performing any kind of analysis, it is useful to analyze the distribution of replicate expressions and to remove the outliers, which usually appear due to defects in printing, scanning, or measurement extraction procedures. Techniques of varying complexity are currently available. However, the main drawback of this type of quality assessment is a necessity for a relatively large number of replicates. In order to generate somewhat flawless replicate measurements, a complicated design of experiments would be required to prevent the appearance of slide defects common for all replicates of an individual gene (sample).
On the other hand, with regard to confidence measures assessed through a direct image-based quality assessment, different quality measures may be used, with the choice depending mainly on the microarray design, the equipment sophistication, and the measurement extraction procedures. The most widely used set of measures includes the ratio of the signal standard deviation within the spot to its mean expression; the offset of a spot from its expected position in the grid; and measures of spot circularity (e.g. the ratio of squared perimeter to spot area).
These measures are taken independently, and are used in an independent manner from one another or are combined using basic logical operations, such as AND, OR, etc. Although these quality measures and their uses are of help in making decisions regarding the spot, currently, these values are not kept within specific bounds, which prevents them from being able to be used together in a synergistic manner. It is therefore desirable to provide a set of quality measurements that are bounded to a predetermined value range in order to permit their compatibility. It is further desirable to provide a system that uses a wider variety of quality measures, and it is more preferable that the system combines the various measures into an overall confidence measure for the data. By doing so, not only would a broader set of measures provide a more complete quality assessment, but combining the set of measures in a meaningful way would provide a more robust and flexible way of handling the issue of spot quality.    (1) Mei-Ling Ting Lee, F. C. Kuo, G. A. Whitmore, Jeffrey Sklar, “Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations”, Proceedings of the National Academy of Science, August 2000, vol. 97, no. 18.    (2) Yidong Chen, E. R. Dougherty, M. L. Bittner, “Ratio-based decisions and the quantitative analysis of cDNA microarray images”, Journal of biomedical optics, October 1997, no. 2(4).    (3) Yee Hwa Yang, M. J. Buckley, Sandrine Dudoit, T. P. Speed “Comparison of methods for image analysis on cDNA microarray data”, Technical report #584, 2000, Department of Statistics, University of California, Berkeley.    (4) R. Adams, L. Bischof, “Seeded region growing”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1994, no. 16.    (5) I. H. Witten, E. Frank, “Data Mining. Practical machine learing tools and techniques with Java implementations”, Morgan Kaufmann publishers, 2000.