The present invention relates to the field of sorting and analysis of vast amounts of scientific data, and more particularly to methods and systems for the discretizing of biological, medical or biochemical data and generation of logical rules from that data, followed by processing of the generated rules.
Scientists create vast amounts of raw data. The sheer volume of such data renders difficult, if not impossible, the ability to draw complete conclusions from that data. Accordingly, mathematicians are requested to develop processes to analyze such data and, in particular, to study, organize, and determine rules (also called logical statements) for the presentation and analysis of such raw data in ways that can become scientifically important by exposing rules (and their bases) in a manner that permits conclusions to be drawn.
Traditional approaches to the creation of scientific data (and, more importantly, biological, medical and biochemical data) followed by human analysis fall short because literally thousands if not millions of data points are created. The intuitive ability of the human mind to analyze such data and draw logical, rational and appropriate conclusions has been emulated by computer-assisted analytical techniques including, e.g., the creation and analysis of logical rules for such data.
With respect to biological data in particular, vast amounts are created virtually daily. Within the class of biological data, lies a subclass of genetic data. With respect to genetic data, which is a particular segment of the biological community, the Human Genome Project and its progeny have created a simply unmanageable quantity of potentially relevant information relating to gene sequencing and expression. One of the major goals of molecular biology is to study such data and determine how different genes regulate one another. Thus, a major research effort has been targeted towards understanding and discovering gene regulation patterns. Likewise, huge amounts of medical data is created by laboratory and other analyses of medically significant biochemical moieties and their variations. The coding of protein interactions (and their DNA/RNA interfaces for synthesis) in the field of proteomics also results in significant data creation. Not all the data is relevant, yet some of the data that might appear at first human blush to be marginal, when combined with other data points, can reveal logical rules with appropriate statistical reliability, thereby enhancing the ability to modulate the experimental protocols employed or the conclusions determined.
One of the main techniques used by biologists for the creation of data concerning genetic expression is the oligonucleotide microarray method, which has reached popularity in the last few years. This technique permits biologists to produce large quantities of gene microarray data points that profile gene expressions under different conditions, at different times during development or in the presence of different factors that include, without limitation, drugs, environmental conditions, biochemical compounds, and the like. Typically, biologists generate a set of tests applying this method to a biological sample, where a single test would contain information on the expression levels of genes in the sample, and the number of tests would result in a range of measurements from a few dozen to a few hundred.
Gene regulation may be understood by measuring the amounts of different gene products produced by a cell. This production process, called gene expression, creates as a product a form of RNA. The oligonucleotide microarray method is a standard method employed to measure amounts of this form of RNA, in which this form of RNA is hybridized to an oligonucleotide microarray that allows the measurement of expression levels of up to tens of thousands of genes in a single experiment. From the computational point of view, the expression level is represented as an arbitrary real number. Therefore, the result of a single experiment is an array of xe2x80x9cNxe2x80x9d real numbers, where xe2x80x9cNxe2x80x9d remains the same across different experiments and depends upon the genes sought to be measured by the experimenter.
In order to discover how different genes regulate one another, biologists typically conduct multiple experiments to determine the manner in which different gene expressions change depending upon the type of tissue, age of the organism, therapeutic agents, and environmental conditions. Moreover, biologists are more interested in the method by which gene expressions vary in these experiments relative to normal expression levels in an organism, rather than absolute values of gene expression.
Accordingly, the manner in which patterns of genetic output change across different samples reflect underlying biological processes in the organism whose genes are being studied. It is of crucial importance for biologists to understand these biological processes, and a major research effort has been launched towards the discovery and biological interpretation of gene regulation patterns. As a result, millions of data points have been generated.
Typical data analysis techniques for handling vast amounts of oligonucleotide microarray data are based mainly on manual selection, querying and clustering techniques. Manual selection of patterns is usually performed by a direct xe2x80x9ceyeballingxe2x80x9d of the data by a person with some amount of experience or specialized expertise. This traditional approach is virtually impossible when the size of the database gets too large.
Database-querying techniques include SQL querying methods, and permit the data analyst to apply pertinent queries to the data and receive responsive information. While such techniques are effective in instances when the analyst is cognizant of the attributes of the data and thus can determine the queries, when the data is vast in size and the queries are less obvious, these techniques prove to be ineffective.
Clustering methods are shown in, for example, Eisen, et al. xe2x80x9cCluster Analysis and Display of Genome-wide Expression Patterns,xe2x80x9d Proc. Nat""l. Acad. Sci. USA, 95(25):14863-8, 1998, and also include self-organizing maps as shown in Tamayo, et al., xe2x80x9cInterpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoetic Differentiation,xe2x80x9d Proc. Nat""l. Acad. Sci. USA, Vol 96, pp. 2907-2912, March 1999. Such methods group genes into clusters that exhibit xe2x80x9csimilarxe2x80x9d types of behavior in the experiments. These clustering methods allow biologists to design experiments helping them to understand further the relationships among the underlying data points, and hence the genetic expressions shown by those data points. However, such traditional clustering methods fail to provide deep insights into specific relationships among genes and biological processes in the cell because the clusters are, by definition, broad categories.
Support vector machines (xe2x80x9cSVM""sxe2x80x9d) have been employed to overcome the problems associated with the querying, clustering and self-organizing map approaches, as shown in Brown, et al., xe2x80x9cKnowledge-Based Analysis of Microarray Gene Expression Data by Using Support Vector Machines,xe2x80x9d in PNAS, vol. 97, Issue I, pages 262-267, Jan. 4, 2000. In particular, the SVM method described in Brown, et al. builds a gene classifier based on some training data by using SVM methods that draw hyperplanes that separate different classes of data (e.g., positives from negatives). Then these classifiers are used to identify unknown functions of genes. SVM methods seek to solve an important but very specific problem of identifying functions of genes based upon predetermined classifications of functions using supervised machine learning methods. There is, however, much more to the analysis of biological and genomic data than just the identification of gene functions.
It is thus an object of the present invention to overcome the shortcomings of the prior art and provide a mathematical system and method that employs a computer for processing large amounts of biological, medial and/or biochemical data through the process of discretizing this data, generating logical rules from such discretized data, and processing of those rules by the system in a manner that presents rules of greatest relevance.
The various features of novelty which characterize the invention are pointed out with particularity in the claims annexed to and forming a part of the disclosure. For a better understanding of the invention, its operating advantages, and specific objects attained by its use, reference should be had to the drawings and descriptive matter in which there are illustrated and described preferred embodiments of the invention.
The present invention relates to a method and system for the analysis of biological, medical and/or biochemical xe2x80x9crawxe2x80x9d data, and more particularly to the creation of discrete values for each of such data points based upon a plurality of attributes, followed by creation of rules and subsequent processing of these rules. The steps involved in the instant invention include receiving the raw data and entering such data into a storage arrangement including, e.g., a relational database, in accordance with the values and attributes of the data; determining discrete bins for at least one of the attributes for the data in accordance with the values of the data entered in the relational database for each of the attributes; discretizing the data in the database by replacing each of the data points in the database with a number reflecting the appropriate bin in which each such point is found; mining the discretized data in the relational database to determine sets of applicable logical rules including, e.g., association rules; processing the logical rules at least once; and presenting the processed logical rules, with or without the discretized and raw data. The number of bins is a small number xe2x80x9cn,xe2x80x9d in which 1 less than nxe2x89xa6100, preferably nxe2x89xa610, and more preferably n =3.
The discretization process converts numeric values, e.g., gene expression data from a particular experiment, into a discrete value corresponding to a specific bin. The number of such bins and the boundaries between them can be determined using methods known to someone of ordinary skill in the art, or determined mathematically by establishing the full range of data values and parsing boundaries between them using known methods. Once the appropriate bin is determined, the instant method and system will place a tag upon each data point indicative of the bin in which that data point is to be placed.
As a result of the discretization step, boundaries between data points are determined and each of the raw values of data are then converted into discrete values. For the particular preferred embodiment of expression levels of individual genes in a single experiment, these values are preferably represented with three different states: unchanged (denoted as #), upregulated (denoted as ↑), and downregulated (denoted as ↓). Therefore, each experiment is re-represented with a vector of N genes taking one of these three values (#, ↑, ↓). Each entry in a relational database has, for each attribute, preferably one of these three values. Assuming that there are M experiments (or attributes) and N genes, the relational database can be represented by an Mxc3x97N matrix of n values. After creating this Mxc3x97N matrix, the inventive process next generates logical rules for gene expression.
Although the invention preferably possesses three discrete bins for discretization of the genetic attributes (upregulated, downregulated, and unchanged), it is not limited to this specific case. In fact, the system and method can accommodate any other number of gene expression states or scientific attributes. For example, genes can have binary states (yes/no), or arbitrary n-valued states (e.g. n=4, 5, etc.).
Once the data is discretized, the inventive system and method then xe2x80x9cminesxe2x80x9d the data to determine logical rules for that data, including degrees of reliability of the logical rules based upon the underlying data. Data mining techniques are known in the art, including those stated in Agrawal, et al., xe2x80x9cFast Discovery of Association Rules,xe2x80x9d in Fayyad, U. M., et al., Advances in Knowledge Discovery and Data Mining (AAAI Press, 1996, Chapter 12) (hereinafter xe2x80x9cAgrawal, et al.xe2x80x9d) which are especially usefull to the instant invention.
The data mining step creates a plethora of logical rules especially for genomic data where N is very large. Where the underlying biological, medical or biochemical data is voluminous, the expectation is the creation of a smaller, but still quite large, subset of logical rules. Hence, it is important to sift through this large mass of logical rules to enable the presentation of those that are truly of interest. Accordingly, the instant invention provides for iterative processing of the created rules by way of a plurality of different operators, described in greater detail below.
The inventive operators include filtering operators (MATCH(S,T), MISMATCH(S,T), and CONTRADICT(S,T)) in which S is a set of logical rules, and T is a template. Also presented are clustering operators. Also included, are a set of data mapping operators (TRANS_MATCH(D,T,C) and TRANS_MISMATCH(D,T,C), in which D is a data set, T is a template, and C is a matching condition xe2x80x9cfor allxe2x80x9d or xe2x80x9cfor any.xe2x80x9d Lastly, included is a data characterization operator (DATA-CHAR(D,S,R)), wherein D is a set of transactions, S is a structure, and R is a set of rules.
It should be appreciated that the system and method described are not limited to the gene expression problem. In fact, the invention is applicable to biological, medical or biochemical data for which the discovery of large numbers of logical rules are a necessary consequence of the analysis of volumes of underlying data. As a result of the invention and its employment of the operators to the logical rules produced from the discretized data, a smaller population of rules is produced that are of greater importance and reliability, and the basis for creating this population can be iteratively adapted if other indicia of significance are considered. The invention is particularly useful in applications having relatively few data points/records (e.g., measured in hundreds) and a huge number of attributes/fields/variables (e.g., measured in 10""s or even 100""s of thousands of attributes). Obviously, the gene expression problem constitutes one such application. Other applications include analysis of biochemical compounds, proteomics (protein interaction) and efficacy of drugs and their analogs.
Other features of the present invention will become apparent from the following detailed description considered in conjunction with the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims.