Microarray biochips are being increasingly used for the performance of large numbers of closely related chemical tests. For example, to ascertain the genetic differences between lung tumors and normal lung tissue one might deposit small samples of different DNA sequences on a microscope slide and chemically bond them to the glass. Ten thousand or more such samples can easily be arrayed as dots on a single microscope slide using mechanical microarraying techniques. Next, sample RNA is extracted from normal lung tissue (a control sample) and from a lung tumor (a test sample). The RNA represents all of the genes expressed in these tissues and the differences in the expression of RNA between the diseased tissue and the normal tissue can provide insights into the cause of the cancer and perhaps point to possible therapeutic agents as well. The “probe” samples from the two tissues are labeled with different fluorescent dyes. A predetermined amount of each of the two samples is then deposited on each of the microarray dots where they competitively react with the DNA molecules. The RNA molecules that correspond to the DNA strands in the microarray dots bind to the strands and those that do not are washed away.
The slide is subsequently processed in a scanner that illuminates each of the microarray dots with laser beams whose wavelengths correspond to the fluorescences of the labeling dyes. The fluorescent emissions are sensed and their intensity measured to ascertain, for each of the microarray dots, the degree to which the RNA samples correspond to the respective DNA sequences. In the experiment outlined above the image scanner separately senses the two fluorescences, and thereby provides for each dot two numerical values, or “expression levels,” that represent reactions of the RNA extracted from the normal and diseased tissues. The scanner may then plot the data on a scatter plot, which has axes that correspond, respectively, to the intensity levels of the two fluorescences. A user then analyses the pattern of the data on the scatter plot.
The purpose of these experiments is to identify individual data points that are located sufficiently far from an identity line, i.e., a line in which the two intensities are the same, or some other closed-form mathematical function to denote a significant response difference. These points are commonly referred to as “out-lyers.” In other types of experiments, the purpose is to determine whether the data produces a scatter plot pattern that approximates the identity line, some other straight line, or some other function, such as, for example, a parabola. In these experiments, the observer of the plot judges the closeness of the correlation between the plotted data points and the locus of the line produced by the mathematical function. The invention described below is concerned with the types of experiments in which out-lyers are identified.
The out-lyers that are of particular interest in the experiment described above correspond to genes that are sufficiently “differentially expressed.” Differential gene expression is most often measured as the ratio of the control tissue expression level and the test tissue expression level, where an expression level is the absolute value of the associated fluorescence intensity.
Genes that are nearly equally expressed in both the control tissue and the test tissue will produce scatter plot data that are on or near the identity line, while genes that are differentially expressed will produce plot data that are farther from the identity line. Genes with low expression levels will produce plot data that are near the origin, or (0,0) point, regardless of their differential expression levels. The low expression levels expression can indicate lower data reliability, due to a low signal-to-noise value of that experiment. Accordingly, the experimenter may choose to omit the data from these genes from further study.
The identification of the genes that are candidates for further study is often done subjectively by visually judging which plotted points of the scatter plot are sufficiently far from the origin, that is, have high enough signal levels to justify confidence in the data, and/or are sufficiently far from the identity line, and thus, strongly differentially expressed. Known computer programs designed for the analysis of differential gene expression data often display a scatter plot, and provide to the user a mechanism to identify individual points of interest. For each identified point, the program may, for example, display or otherwise process the underlying gene data that generated the plotted point. Once the plotted points that meet the selection criteria have been identified by the user, the user may then collect or otherwise process the results for further analysis and experimentation.
It is simple to make qualitative judgements of the characteristics of individual plotted points in scatter plots that are comprised of a relatively small number of points. However, it is difficult to judge the differential expression ratio of the points, and/or to judge which points are just above or just below any particular expression level threshold. Further, these judgments and the identification of points of interest are more difficult to make with scatter plots that contain hundreds or thousands of data points. Accordingly, they are difficult for use with scatter plots associated with microarrays.