The present invention deals with the analysis and visualization of multidimensional data. In particular, the analysis and visualization of multidimensional biological data is addressed.
Biological systems are notorious for their complexity. One small change can have unpredictable consequences in apparently unrelated areas. The study of complex biological systems has a strong reliance upon statistical analysis, and the experience of the analyst in recognizing patterns and designing experiments that highlight the relationships between a multiplicity of factors.
The present invention provides methods and systems for the visualization of complex, multidimensional data in a manner that permits the recognition of a variety of relationships in the data. The present application of a component plane presentation to clustered data from complex biological systems, coloring the clustered data according to values for one component at a time, shows surprisingly different patterns among the clustered data compared to the typical visualization methods of the art, such as U-map and self-organizing map output.
With the completion of human genome sequencing being rapidly approached, functional genomics is becoming extremely prominent in the field of biology. DNA microarray technology emerged [Schena, M., Shalon, D., Davis, R. W., Brown, P. O., “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,” Science, 270:467-470 (1995)]. In microarray methodology, inserts from tens of thousands of cDNA clones (i.e., probes) robotically arrayed on a glass slide are probed with labeled pools of RNA (i.e., targets). These technological advances have made it possible to conduct research in microscale on very high throughput. Microarray and gene chip technologies permit the parallel conducting of many microreactions on a small scale at one time, using relatively small amounts of reagents. These technological advances in obtaining biological data strengthen the need for simple, visual inspection of the large quantities of data obtained.
Because the amount of data generated by each microarray experiment is substantial—potentially equivalent to that obtained through tens of thousands of individual nucleotide hybridization experiments done in the manner of traditional molecular biology (i.e., Northern blots)—it is extremely challenging to convert such a massive amount of data into meaningful biological networks. Current efforts toward this direction have primarily focused on clustering and visualization methods of data analysis.
The goal of clustering methods is to catalogue genes or RNA samples into functional meaningful groups. Data visualization methods help to exhibit clustering results by conveniently representing the clustered data as an image for visual elucidation.
A commonly applied clustering method is hierarchical clustering, which is an unsupervised clustering algorithm primarily based on the similarity measure between individuals using a pairwise average-linkage clustering [Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D., “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl Acad. Sci., USA, 95:14863-14868 (1998)]. Through the pairwise comparison, this algorithm eventually clusters individuals into a tree view. The length of the branches of the tree depicts the relationship between individuals, where the shorter the branch the more similarity there is between individuals.
A major drawback of hierarchical clustering is the phylogenetic structure of the algorithm. The phylogenetic clustering algorithm may lead to incorrect clustering, which is a particular problem with large and complex data sets, such as those from biological experiments.
Another clustering method that has been gaining in popularity is the recently introduced self-organizing map (SOM) [Kohonen, T., “Self-organizing maps,” in Volume 30 of Springer Series in Information Sciences, Springer (Berlin, Heidelberg, N.Y.: 1995); Kohonen, T., Oja, E., Simula, O., Visa. A., Kangas, J., “Engineering applications of the self-organizing map,” Proc. IEFE, 84:1358-1384 (1996)]. SOM is an artificial intelligence algorithm based on unsupervised learning. The SOM algorithm configures the output vectors into a topological presentation of the original data, producing a self-organizing map in which individuals with similar features are mapped to the same map unit or nearby neighboring units. The SOM neighborhood map creates a smooth transition of related individuals to unrelated individuals over the entire map. More importantly, an SOM ordered map provides a convenient platform for visual inspections of large numerical data sets.
SOM has been utilized by several groups for gene clustering analysis [Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., Golub, T. R., “Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation,” Proc. Natl Acad. Sci., USA, 96:2907-2912 (1999); Toronen, P., Kolehmainen, M., Wong, G., Castren, E., “Analysis of gene expression data using self-organizing maps,” FEBS Lett., 451:142-146 (1999); Chen, J. J., Peck, K., Hong, T. M., Yang, S. C., Sher, Y. P., Shih, J. Y., Wu, R., Cheng, J. L., Roffler, S. R., Wu, C. W., Yang, P. C., “Global analysis of gene expression in invasion by a lung cancer model,” Cancer Res., 61:5223-30 (2001); White, K. P., Rifkin, S. A., Hurban, P., Hogness, D. S. Microarray analysis of Drosophila development during metamorphosis,” Science, 286:2179-2184 (1999)].
However, many of the potential benefits of SOM—particularly for visual inspections—have not yet been explored. The deficiency in applying visualization methods to SOM output may have led to the observed under-utilization of the powerful SOM data mining tool in the analysis of microarray data.
The conversion of such massive amounts of data into meaningful information has been limited largely by a lack of robust and easy-to-interpret methods of data analysis. Lately, there have been significant advances in the automation of data organization to facilitate the recognition of characteristic features of a data matrix. The most remarkable advances in data organization revolve around processing the data with a self organizing network to produce a self-organizing feature space mapping. Preferably, the self-organizing network is unsupervised. The organization of the data is known as “training” or “modeling” of the data. However, there remains a need for visualization of the organized data in a manner that facilitates drawing conclusions regarding the data.
Many methods of the art for visualizing data output after data modeling view the value of the final modeled data or reduce the number of dimensions of the data output to a few dimensions (typically two or three dimensions). Examples of visualization methods of the art are shown in FIGS. 2 to 3 and FIGS. 5 to 6, and are discussed in more detail hereinbelow.
Because the present invention involves the visualization of data that has already been clustered, an important aspect of the background of the present invention is the known methods of data clustering. In particular, a brief discussion is warranted of methods of data clustering (organization) known in the art.
One useful statistical method of handling vast quantities of data is to model the data using an independent, iterative process known as SOM (self-organizing map). Although the recently introduced self-organizing map (SOM) has shown promising potentials for the processing of microarray data, the tools utilized to visualize the organized data, to date, fail to fully reveal many beneficial features of the algorithm and depreciate the value of this powerful data mining tool in gene expression analysis.
In “SOM-Based Exploratory Analysis of Gene Expression Data,” Samuel Kaski applied SOM technology to the expression of yeast genes, analyzing gene clusters such as genes known to be associated with cytoplasmic degradation, respiration and mitochondrial organization. Kaski visualized the SOM output in a U-matrix (Unified Distance Matrix, a Euclidian neighborhood analysis) display. The SOM was defined by an ordered set of data model vectors, one vector attached to each map unit or grid point.
However, Kaski found the “noisiness” of the U-matrix visualization to be problematic. As a solution, Kaski proposed a method to better define the edges of the clusters by coloring the U-matrix based on the difference between the data gradients of the U-matrix visualized SOM output data cells. Kaski used lightness to show similar data density gradients (i.e. clusters) and color to depict similarity of the data. Kaski's advance in U-matrix visualization of the data provides one approach to better define groups in the clustered data. The present work provides an alternative approach.
In “Analysis and Visualization of Gene Expression Data Using Self-Organizing Maps”, by Kaski et al., an SOM-treated nonlinear map of multidimensional genetic data is analyzed and visualized as a hexagonal U-matrix map. Kaski's cluster-defining method discussed above was used in this example application to biological data.
The above methods applied by Kaski at al. focus on analysis of the density of the SOM output model vectors. As such, the methods permit visualization of various aspects of the full SOM output data vector, and the density of the overall data clusters. Kaski's work primarily uses U-matrix visualization of the data and provides one view of possible relationships in the data. There remains a need for additional information to be drawn from the data using alternative visualization methods such as that provided by the present invention (e.g. compare FIG. 1 and FIG. 3).
Another useful statistical method of handling vast quantities of data is to model the data using an independent, iterative process known as feed-forward neural networks. Several patents relating to data organization into clusters include Pao, et al. U.S. Patent Publication No. US 2001/0032198 A1, which is a continuation of U.S. Pat. No. 6,212,509, which is a continuation of U.S. Pat. No. 6,134,537, which is a continuation-in-part of U.S. Pat. No. 5,734,796. Pao et al. use reduced-dimension data mapping of pattern data using conventional single-hidden-layer feed-forward neural networks with nonlinear neurons. Pao et al. visualize the data as a topologically correct low-dimension approximation of the clustered data. Such a visualization method projects the modeled vectors into lower-dimensional space (for example, a sphere may be projected as a circle and a helix as a spiral or zig-zag) and reflects the actual modeled data.
Still another useful statistical method of handling vast quantities of data is to model the data using an independent, iterative process known as hierarchical artificial neural network. Hoffman U.S. Pat. No. 6,278,799 B1 is a continuation of U.S. Pat. No. 6,035,057 disclosing a hierarchical data matrix pattern recognition system that uses a hierarchical artificial neural network for the analysis of complex data to automate the recognition of patterns in data matrices. Hoffman's methodology is applied to weather maps visualized at various altitudes. As with Pao et al., above, Hoffman's visualization method is a projection that preserves the topology of the trained and clustered data.
Almasi et al. teach yet another statistical method of handling vast quantities of data is to model data. Almasi et al. U.S. Pat. No. 6,260,036 B1 discloses a method and apparatus for organizing data into clusters where each cluster comprises a number of records with common input parameters. Almasi et al. visualized the clustered data as a neighborhood map in which the square cells where the data is presented as a dot (relative size depending on the data density) or pie charts in the cells. The visualization method of Almasi et al. is similar to that of Kaski et al., using bar graphs. Such data visualizations, as shown in FIG. 2 are complex and difficult to interpret.
In still another statistical method of handling vast quantities of data, Sirosh U.S. Pat. No. 6,226,408 B1 discloses methods of pre-analysis and clustering of data using unsupervised identification of nonlinear data clusters in multidimensional data. Sirosh visualizes a weighted topological graph of the vector space, using the cluster centers as nodes and weighting the cluster edges between the nodes as a function of the density of the vectors between the linked nodes to depict the relationships between the mapped data. As with Kaski's advance in U-matrix visualization, such visualization methods focus on the density of the clustered data and provide limited means to study the relationships between the clustered data.
Vesanto discloses component plane presentation as a visualization tool of SOM data [Vesanto, J., “SOM-based data visualization methods,” Intelligent Data Analysis, 3:111-126 (1999); Basilevsky, A., “Statistical factor analysis and related methods, theory and applications. John Wiley & Sons, New York, N.Y., 1994](9,10). Vesanto fails to teach or suggest the possible potential benefits of the application of component plane presentation visualization methods to draw conclusions about data from biological experiments. None of the other workers who investigated SOM clustering methods on biological data taught or suggested the application of component plane presentation to analyze the data.
As is evident from the discussion above, there are various ways to depict the reduced-dimension data. A common approach is to view a grouped representation of the data vectors. An example of this approach is a map with bar charts in cells representing the data vectors. Bar chart cells near one another depict more closely related data than bar chart cells distant from one another on the map. Similarly, line or pie charts depicting the data can be shown in the cells.
There is a need for other methods and apparati for the visualization of multidimensional data that permits analysis of empirical relationships between the data. Methods of data visualization that permit viewing of the clustered data based on the components of the modeled data, such as the component of time in a time course, temperature of the reaction, intensity of the output, quantity of a reagent, or an empirical parameter, allow appreciation of relationships between the data that may not be apparent from inspection of the full data modeling output.
There is a great demand for easily-interpretable methods and apparati for to visualizing multidimensional data in ways that highlight patterns and trends and/or help data analysts appreciate various aspects of the data.
The present invention provides methods and systems to facilitate pattern recognition in complex biological data using component plane presentations of clustered data.