Using current DNA microarray technology, researchers and clinicians are able to easily collect large amounts of data to indicate which genes or ESTs (expressed sequence tags) are regulated upwards or downwards during various disease states, following various pharmacological treatments, or following exposure to a variety of toxicological insults. The relevance of this gene expression data is often determined by its relationship to other information within the context of the current analysis. For example, knowing that there is an increased expression of a particular gene or EST during the course of a disease is important information. In addition, there is a need to correlate this data with various types of clinical data, for example, a patient's age, sex, weight, genetic, environmental and behavioral history, stage of clinical development, stage of disease progression etc. What is needed is a way to correlate the vast amounts of gene and EST expression data that are available from DNA microarrays with the corresponding clinical data from the samples that are tested. As used herein, the term “gene” or “gene expression” will also include (EST” or “EST expression, unless otherwise indicated.
Researchers wish to answer questions such as: 1) which genes are expressed in cells of a malignant tumor but not expressed in either healthy tissue or tissue treated according to a particular regime; 2) which genes or ESTs are expressed in particular organs but not in others; and 3) how is a gene-of-interest regulated across a comprehensive panel of diseases and therapeutic areas with respect to human systems biology? A better understanding of the complex network of gene interactions can lead to discovery of novel therapeutic mechanisms and identify and prioritize potential drug targets with respect to specific therapeutic and large-scale biological considerations.
Traditional sample-based analysis methods for gene expression data involve manual curation of sample sets. Investigators begin their analysis with a specific goal (e.g., “today I will investigate Alzheimer's disease”) in mind and build the sample sets accordingly. This method biases the resulting analyses towards the initial goal of the investigator and leaves potentially interesting patterns undiscovered because the investigator did not have time to manually exhaust all potential analysis routes through the available data. To provide an example, discovering a gene regulated in Alzheimer's disease is interesting; but finding a gene regulated across all known degenerative neural diseases is potentially far more useful.
Commercially-available databases have been created containing massive amounts of annotated gene expression data derived from clinically important tissues, both normal and diseased, which can be searched for identifying relationships between specific genes and resulting proteins, e.g., those that are involved in a disease pathway. These databases are managed to enhance accuracy and reliability and to ensure that the data are regularly updated to include the latest developments in the field. Internet links are provided to provide easy access to related data contained in public databases. Graphics tools can be linked to enable visualization of the results. (See, e.g., International Publication No. WO 02/071059, published 9 Sep. 2002, assigned to the present assignee. The disclosure of this publication is incorporated herein by reference in its entirety.) Exemplary of such commercial databases are the GeneExpress® line of products available from Gene Logic Inc., Gaithersburg, Md., which utilizes the Affymetrix GeneChip® microarray data and its sample identification standards. Typically, such databases are used for evaluation of a researcher's manually curated sample and gene sets, e.g., manually-selected sets of samples from specific microarray sets, tissue types, pathology/morphology, etc. Often, the researcher will use the curated sample (or gene) set to compare his or her own laboratory results from tests performed using GeneChip® microarrays with gene expression data and corresponding sample information extracted from the existing database. Such analysis generally requires a fairly high level of sophistication in dealing with microarray-based data analysis, since individual samples and/or genes must be selected for inclusion in the sample sets. If the search queries are not carefully tailored, the search may become mired in the huge volume of data that must be searched. Once the search is completed, additional downstream data synthesis is usually required for interpretation of the search results.
While customization of the search strategy at the sample or gene level can be a valuable discovery tool, many researchers may be interested in investigating higher level relationships between disease pathology and genomics without the need for manually curating their own sample sets, interpreting the gene expression data or generating custom gene lists. This latter approach can be referred to as an “in silico experiment”, where the data to be mined is pre-existing within the computer database and the “experiment” consists of selecting and/or making different combinations of data from the database based upon pathological and biological considerations, without inputting specific information about individual samples. Such an approach serves as a useful reference tool, however, without careful data management and organization, the researcher may run into problems such as slow computational response time, as well as the inability to recognize global gene expression patterns, due to the large volume of data to be searched and presented.
Accordingly, it would be desirable to provide a reference tool in which the database is organized to facilitate rapid query searching and subsequent data presentation. It is to such a system and method that the present invention is directed.