The advent of new experimental technologies that support molecular biology research have resulted in an explosion of data and a rapidly increasing diversity of biological measurement data types. Examples of such biological measurement types include gene expression from DNA microarray or Taqman experiments, protein identification from mass spectrometry or gel electrophoresis, cell localization information from flow cytometry, phenotype information from clinical data or knockout experiments, genotype information from association studies and DNA microarray experiments, etc. This data is rapidly changing. New technologies frequently generate new types of data.
Understanding observed trends in gene or protein expression often require correlating this data with additional information such as phenotype information, clinical patient data, putative drug treatments dosages, graphical representation of biological information, etc. Even when fairly rigorous computational techniques such as machine learning-based clustering or classification schemes are used, the results of these techniques are typically cross-checked with observed phenotypes or clinical diagnoses to interpret what the computational results might mean.
Currently, correlations of the experimental data with types of additional information as exemplified above are often done by manually (i.e., visually) inspecting the additional (e.g., clinical) data and visually comparing it with the experimental data to look for similarities (i.e., correlations) between experimental and observed phenomena. For example, a researcher might notice a highly up or down regulated gene during inspection of a microarray experiment and then explore the available clinical data to see if any observed clinical data correlates with the known function of the gene involved in the microarray experiment. Finding correlations in this manner could be described as a “hit-or-miss” procedure and is also dependent upon the accumulated knowledge of the researcher. Further, the large volumes of data that are generated by current experimental data generating procedures, such as microarray procedures, for example, make this method of correlating an extremely tedious, if not impossible task.
Efforts at consolidating the data to be analyzed for correlations between experimental results and observed phenomena have been made by attempting to consolidate all the data to be viewed into massive spreadsheets or tabular displays. However, the usefulness of these types of approaches has been limited because, due to the sheer volumes of data that usually need to be analyzed, it becomes impossible to view all relevant experimental data together, at once, on a single screen to allow visual comparison. Accordingly, it becomes necessary to provide split views, scrolling or multiple windows in order to view all of the data needed for performing the analysis. Not only does this make it difficult to easily make visual comparisons among the data contained in different screens, windows or views, but the ability to manipulate the data so as to make visual comparisons according to different characterizations of the data (different types of sorting, clustering, classification, etc.) to search for trends, correlations or other insights, becomes unwieldy and problematic.
Efforts have been made in attempting to visualize and discover overall gene expression patterns from large gene expression data sets with little success. For example, scatter plots and parallel coordinate techniques available with Spotfire 4.0 and Spotfire 5.0 were used by Pan in an attempt to identify expressed sequence tags (ESTs) having expression patterns similar to those of known genes. Both the expression patterns of the ESTs as well as those of the known genes were obtained from a data set including melanoma samples and normal (control) samples provided by National Human Genome Research Institute (see Pan, Zhijian: “Application Project: Visualized Pattern Matching of Malignant Melanoma with Spotfire and Table Lens”, http//:www.cs.umd.edu/class/spring2001/cmsc838b/Apps/presentations/Zhijian_Pan/. The use of scatter plots was reported to be incapable of managing the complexity of the data set being examined. The use of parallel coordinates with Spotfire 5.0 was more promising, in that it was capable of displaying all thirty-eight experimental conditions on a single page, where similarities in expression patterns could be searched for.
Table Lens was also employed by the same researcher to visualize expression patterns of the ESTs and known genes. However, it was reported that Table Lens was ineffective, and “very difficult” for use in finding matching patterns. Neither Spotfire (4.0 or 5.0) was used to compare expression or other experimental data with supporting clinical data or data sets of any other type, but were only used in attempting to group like data within the experimental data set.
A tool for forming a compressed view of gene expression results from multiple microarrays is described in co-pending and commonly owned application Ser. No. 10/209,477 filed Jul. 30, 2002 and titled “Method of Identifying Trends, Correlations, and Similarities Among Diverse Biological Data Sets and System for Facilitating Identification”, which is incorporated herein in its entirety, by reference thereto. In one example, microarray experimental data used to generate the compressed visualization was obtained from the National Human Genome Research Institute of the National Institutes of Health. Experiments were performed with respect to thirty-one subcutaneous melanoma patients using DNA microarrays. For each patient, eight thousand and sixty-six individual microarray measurements were displayed. Additionally, clinical data as well as patient cluster, and gene specific annotations corresponding to the gene represented by the expression ratios were contained within the respective rows of microarray data. Since the data set is highly de-normalized, for a given patient, the data in the clinical columns was repeated for each gene measured by that patient's microarray. In order to display such a massive number of columns in a single visualization, this system also employed Table Lens, which allowed the diverse data sets to be compressed, displayed and inspected simultaneously in graphical form on a single display. In this example, the system was based on a product known as Eureka, by Inxight. A complete description of the functionality of Table Lens can be found in U.S. Pat. Nos. 5,632,009; 5,880,742 and 6,085,202, each of which is incorporated herein, in its entirety, by reference thereto. The resultant visualization was a very dense graphical display representing 241,980 rows of data entirely visible on a single standard computer display. The visualization was highly compressed, with graphical values displayed to represent groups of cell values, since the compression prevented each individual row or cell value from being displayed. The tool further provides the capability of sorting by various data categories, such as “patient cluster” and “invasive ability”, for example, as described in the application. As a result of such sorting operations, correlation may be observed between patient clusters, or other categorical criteria. Although the system and methods described in the above system can be very useful and powerful in preparing visualizations for the analysis of biological analysis, they also require a significant amount of learning and familiarization with what is otherwise a quite non-intuitive display for those trained in the biological research disciplines. Those users that have not dedicated enough time to fully understand how to manipulate and interpret the display are likely to be confused or intimidated by the graphical representations of the compressed data and as to how to interpret them.
More powerful methods of combining widely diverse, but related and potentially correlated biological data sets are needed to improve the ease, speed and efficiency of correlating information in these data sets. Further, more powerful methods are needed to improve the probability that such correlations will be identified.