1. Field of the Invention
This invention relates to autonomous data mining, such as for example autonomous data mining in large sets of gene expression data.
2. Related Art
In computer systems having relatively large amounts of data, such as recorded in a database system or other system for storage and retrieval of data, it is sometimes desirable to review that data to find if there are relationships between data elements that were previously unconfirmed or even unknown. This process is sometimes called “data mining,” and is typically applied to programmed processes that are applied to relatively large databases. For example, searching a large database of stock data for those securities that meet predetermined criteria for capitalization and earnings would be a form of data mining.
Known methods of data mining include “clustering,” that is, attempting to divide the multiple data elements into a relatively small set of clusters. Other known methods include applying statistical methods to best-fit a predetermined relationship against the set of data, so as to determine a set of parameters for the predetermined relationship. These other known methods include multiple linear regression and other statistical and stochastic techniques. While these methods of the known art can generally achieve the purpose of evaluating predetermined relationships against a relatively large set of data, they are of course subject to the drawbacks of all statistical methods, which is that they can only deliver a probabilistic assessment of the predetermined relationship against the set of data.
One problem with the known art is that the researcher or other person (that is, a “user”) must have a predetermined relationship in mind before attempting to apply it against the set of data. For example, when searching a large database of stock data, the user must have a predetermined relationship and a set of predetermined stock parameters in mind for evaluation before known data mining techniques can evaluate whether that predetermined relationship applies well to that set of predetermined stock parameters. This could be referred to as a hypothesis-generating problem.
A second problem with the known art is that the predetermined relationship might have little or no relationship to domain-specific knowledge about the set of data. For example, when searching a large database of stock data, the user might request evaluation of a predetermined relationship among a set of predetermined stock parameters that have, in any real-world model of the stock market, no relationship to each other (such as, for example, whether stocks with a price/earnings ratio that is a prime number occur more frequently when the Moon is in the Aries constellation). This could be referred to as the uninteresting-hypothesis problem.
A third problem with the known art is that the predetermined relationship and the set of data must be determined ahead of the operation of the data mining method. For example, when searching a large database of stock data, the user must assure that all needed data is available before attempting to perform data mining. This could be referred to as the known-database problem.
All three of these problems are particularly acute in the field of scientific research into gene expression.
First, databases of gene expression data have been collected by researchers and are often made available to each other, either in the context of academic research or in the context of pharmaceutical or other for-profit research and development. These databases are relatively large, and are getting substantially larger as time goes by, both due to work by researchers in obtaining new gene expression data and due to improved methods for obtaining that data in greater quantity and at greater speed. As an emergent consequence of the rapid growth of databases of gene expression data, it has become extremely difficult for individual researchers to maintain familiarity even with the scope of data available for review.
Second, gene expression data includes raw data describing measurements of activity for individual strands of mRNA (messenger RNA). These measurements can differ in response to differing times they were taken, differing patients they were taken from, differing clinical samples from one or more patients, differing medical conditions of the one or more patients, differing prescription or other drugs the patients were under the influence of, differing chemical milieus in which the measurements were taken, and many other possible differing conditions. Collection, recording and publication of gene expression data are known in the art of biochemical research. As might be inferred from this description, sets of gene expression data can be extremely complex, having no immediate relationships available to the reviewer of the data. Moreover, new sets of gene expression data are generated from time to time, thus increasing the available pool of gene expression data relatively continuously.
Third, the research community does not always make these sets of gene expression data available immediately upon production. Sometimes individual sets of data are checked for consistency or quality control. Sometimes one or more researchers have a particular predetermined relationship they would like to evaluate (and publish) before allowing other research groups to access those sets of data. As the number and size of sets of gene expression data becomes larger, and as the number of researchers interested in those sets of data becomes larger, the chance that a valuable set of data is not available to one or more researchers interested in that valuable set of data becomes greater.
Fourth, the particular biological processes that sets of gene expression data reflect are relatively complex. There are relatively large numbers of genes, activation of each of which possibly affects large subsets of other genes, in ways that are presently not well known. (That is why study of gene expression data is called “research.”) Many of these processes are highly nonlinear, that is, a small change in amount of gene expression for a first gene can result in very large changes in amounts of gene expression for one or more downstream sets of genes. Many of these processes have feedback, feed-forward, or other complex topological loops, so that gene expression for a first gene can have multiple different effects on gene expression for both a second gene and for the first gene itself. Even relatively simple examples known as cell cycles can involve relatively long feedback loops, each element of which itself includes a relatively complex set of interactions.
Known methods of examining gene expression data include examining the data “by hand,” that is, by an interested researcher who formulates hypotheses, performs operations on the data to evaluate those hypotheses, and determines if there is sufficient support for those hypotheses to warrant further experiment or even publication of results of the evaluation. While these known methods generally achieve the goals of finding and publishing interesting and useful statements about gene expression to the research world, they are subject to several drawbacks. As noted above, there is a relatively large amount of gene expression data. The amount of such data is rapidly increasing and is not easily subject to efficient or effective search by human researchers. Researchers do not have adequate time to review all the relevant data. Researchers also do not have adequate time to determine all the relevant data in their field, or in related fields. Researchers also often work in close-knit groups and are therefore not always aware of similar work being performed by other researchers. Moreover, as noted above, problems in gene expression analysis are relatively complex, and are therefore not easily subject to “by hand” analysis of extensive data.
Accordingly, it would be desirable to provide a technique in which data mining is performed with regard to a set of data, possibly interesting hypotheses are formulated in response thereto, and those hypotheses are reported. In one aspect of the invention, this technique can be achieved by performing a robotic process with regard to a set of data in a database, so as to formulate potentially interesting hypotheses and so as to communicate those hypotheses to researchers and other persons having an interest therein.