1. Field of the Invention
This invention pertains generally to large database analysis, and more particularly to a system and method for identifying global association patterns in databases to facilitate an understanding of complex systems with numerous variables such as biological systems.
2. Description of Related Art
The development of computer systems with high-density memory devices and efficient communications, have lead to the appearance of large databases with complex data structures. For example, point of sale information from retailers, financial and insurance transactions and claims, credit card activity and similar information from across the country is stored in a number of large databases. Extracting useful information from substantially large databases to develop business or sales strategies is problematic. Various methods of “data mining” have been developed to identify and interpret patterns and global relationships in such databases so that information can be derived about the individual variables and allow strategic decision making and reliable predictions. Data mining allows the observance of trends, rules, patterns, constraints, irregularities and correlations and the like from large volumes of collected data. The process of data mining may use such methods as clustering, decision trees, association rule generation, memory based reasoning and neuro-analysis. The most suitable analysis will depend on the nature of the collected data to be analyzed and purpose of the analysis.
Databases relating to the research of biological systems tend to be particularly large and complex. Cellular activities form a complex biological system including a network of often interrelated subsystems. The rules for governing the genetic circuitry of expressing some genes and repressing others, in response to a variety of both inner and outer stimuli, are largely unknown. Modern DNA gene arrays allow the simultaneous measurement of the levels of expression of thousands of genes from a single biological sample. The gene or micro array is a high throughput biochemistry technique for generating data about the quantity of mRNA expressed by each gene in cells of an organism under a set of conditions. Here “conditions” broadly refers, for example, to doses of a tumor growth inhibition agent, times in a cell cycle, strains of mutation, types of cancer, and so on. The data, commonly referred to as a gene expression profile, can be used to infer the global cellular activities that would be hard to describe otherwise.
However, the statistical analysis of gene array data in molecular biology is usually the greatest difficulty faced with the use of the technique. Large volumes of array data typically overwhelm traditional methods and models of data analysis. The ultimate use of this powerful biotechnology hinges on how to process the data.
One procedure for distilling information from microarray data is through the use of correlation. Correlation is a traditional way of summarizing the relationship between any two variables in a system after the collection of empirical data. A positive correlation between variables implies similar behavior and a similar role within the system while a negative correlation indicates the opposite behavior and complementary role.
It is well-accepted that genes with similar profiles are likely to share common cellular roles and participate in related pathways. Pearson's correlation coefficient, (a popular measure of similarity), is thought to conform well to the intuitive biological notion of what it means for two genes to be “coexpressed”. For any pair of genes (X, Y), a correlation can be computed using formulas in the prior art. Then all correlations are compared with each other or with a default value to determine the statistical significance. High and significant correlation is interpreted as a likely evidence for functional association between genes, meaning that their gene products are more likely to form a protein complex or participate in common metabolic or regulatory pathways and biological processes. This correlation-based similarity analysis together with variations and extensions for clustering or classification has generated enormous biologically significant results. Indeed, many gene clusters with similar expression patterns and shared roles in common cellular processes have been observed, including histones, cytoplasmic ribosomal proteins, proteasome, cytochrome-C, ATPase complex, and others.
Positively or negatively associated variables, once detected, can be used as building blocks toward the understanding of the system and for predicting the behavior of the system. But in the extremely complex systems, the vast majority of cases turn out producing no or weak correlations between variables. This often is the rule, even for those variables that are suspected to be closely related in theory. This renders the traditional correlation-based similarity analysis useless.
Despite the many successful applications, the use of Pearson's correlation coefficient, including the related Spearman's rho and Kendall's tau known in the art, are not flexible enough to describe all of the kinds of in vivo association patterns between genes. There are numerous cases of failure with conventional correlation analysis where functional association is evident according to experimental evidence, yet the statistical correlation from gene assay evidence is practically zero. The abundance of such negative results constitutes a bottleneck in distilling more information from microarray data.
The observed lack of correlation can be due to the noises in the data. But quite often the primary reason is due to the biological complexity of the subsystem and nature of its components such as the pleiotropic or multiple functions of a protein, varying cellular oxidization-reduction states, fluctuating hormone levels or other cellular signals, and the like. The case of two transcription factors, Max and thyroid hormone receptor (TR) are illustrative. These factors can serve either as activators or as repressors, depending on the other molecules that are bound to them. Max can bind to Myc and form a Myc-Max dimer that acts as a transcription activator. But when bound to Mad, the Mad-Max dimer serves as a repressor. For the case of TR, it associates with RXR to form a TR-RXR dimer that serves as a repressor in the absence of thyroid hormone. In the presence of thyroid hormone, the TR-RXR dimer is converted into an activator. Histone deacetylation is involved in both repressing events. Thus for example, if X is taken to be the expression profile of the gene encoding TR and Y is taken to be the profile of one of its target genes, then X and Y may be either positively correlated or negatively correlated, depending on the hormone level. If the hormone level fluctuates in an unspecific manner, the opposing directions of correlation may cancel out each other and no similarity based analysis may succeed in detecting the functional association between X and Y. Thus, uncovering the biological reasons behind the lack of detectable correlation is a new approach for understanding the gene expression network.
Accordingly, there is a need for a system and method that aims at the identification of an advanced type of association between variables in complex systems where the correlation between variables is not readily apparent but is in fact existent and is constantly changing in response to the change of a state variable from unknown sources. The present system simultaneously identifies such variables and points to variables that are correlated with the state variable. The present invention generally overcomes the deficiencies found in existing equipment, systems and methods that fail to exploit structures that are not correlated or are weakly correlated.