The study of gene expression brings valuable information to the researcher about cellular function that can be applied directly to drug discovery and development. Devices and computer systems and methods have been developed for collecting information about gene expression or expressed sequence tag (EST) expression in large numbers of tissues.
DNA microarrays are glass or nylon chips or substrates containing arrays of DNA samples, or “probes”, which can be used to analyze gene expression. A fluorescently labeled nucleic acid is brought into contact with the microarray and a scanner generates an image file indicating the locations within the microarray at which the labeled nucleic acids are bound. Based on the identity of the probes at these locations, information such as the monomer sequence of DNA or RNA can be extracted. By profiling gene expression, transcriptional changes can be monitored through organ and tissue development, microbiological infection, and tumor formation. The robotic instruments used to spot the DNA samples onto the microarray surface allow thousands of samples to be simultaneously tested. This high-throughput approach increases reproducibility and production.
Microarray technologies enable the generation of vast amounts of gene expression data. Effective use of these technologies requires mechanisms to manage and explore large volumes of primary and derived (analyzed) gene expression data. Furthermore, the value of examining the biological meaning of the information is enhanced when set in the context of sample profiles and gene annotation data. The format and interpretation of the data depend strongly on the underlying technology. Hence, exploring gene expression data requires mechanisms for integrating gene expression data across multiple platforms and with sample and gene annotations.
The GeneChip® probe array of Affymetrix, Inc. (Santa Clara, Calif.) is one example of a widely-adopted microarray technology that provides for the high-volume screening of samples for gene expression. Affymetrix also offers a series of software solutions for data collection, conversion to AADM™ (“Affymetrix Analysis Data Model”) database format, data mining, and a multi-user laboratory information management system (“LIMS”). LIMS is a microarray data management package for users who are generating large quantities of GeneChip® probe array data. Data are published to an AADM-standard database that can be searched by mining tools that are AADM-compliant. The Affymetrix technology has become one of the standards in the field, and large databases of gene expression data generated using this technology, along with associated information, have been assembled and are publicly-available for data mining by pharmaceutical, biotechnology, and other researchers and clinicians. The researchers may wish to utilize a specific analysis and visualization tool, or to use multiple such tools for efficient identification and comparison of gene expression data.
For example, toxicology experiments may involve administration of a toxin to a set of laboratory animals and then sampling the animals at various time intervals following introduction of the toxin. There may be groups of animals that are sampled at three, six, and twenty-four hours after toxin administration, as well as some untreated animals. A researcher may also have previously observed an indication that a gene is involved in a toxic response and also that its expression level increases or decreases in a certain pattern at the time intervals sampled. In order to find other genes that may be involved in the same toxic response, the researcher may wish to identify other genes that demonstrate that same pattern of expression across these groups of samples corresponding to these time intervals.
It is conventional to run three to ten or more replicate experiments for each set of experimental conditions and/or time points. For example, seven animals may be sampled at three hours, six hours, and twenty-four hours after toxin administration. It is desirable to utilize the information derived from these replica experiments to find genes that consistently have a specified pattern of expression over a given set of experimental conditions and sampling times; that is, genes whose expression level varies very little within a group of samples that are replicates, but whose expression level varies greatly between samples corresponding to different experimental conditions or time points, all according to the pattern specified by the user. Such analytical tools have not been previously available.
In another example, there have been past attempts to develop an efficient analysis of gene expression data, such as, for example, clustering, of which there are many subtypes that are well known to those skilled in the art. One method, referred to as gene signature differential, identifies genes which have detectable expression in a minimum fraction of samples in one sample set and do not have detectable expression in at least another fraction of samples in a second sample set. For example, a comparison of samples in the first sample set may be typically drawn from disease or toxin-treated tissues while samples for the second sample set may be normal or untreated tissue.
Gene signature differential is based on another analysis method known as gene signature. Gene signature analysis identifies two sets of genes, given a set of samples and two user-specified threshold fractions P and A. The two gene sets generated by the gene signature are referred to as the present set and the absent set. The present set includes genes that are present, or have detectable expression in at least P percent of samples in the sample set; similarly, the absent set includes genes called absent (not having detectable expression) in at least A percent of samples. The conventional method of performing a gene signature differential is to run two gene signatures, take those gene signature sets, and compute the set intersections, the present set from one gene signature intersected with the absent set from the other signature. Next, another set is computed that includes the absent set from gene signature number one intersected with the present set from gene signature number two. Accordingly, this method includes the preliminary step of computing the two present gene sets and absent gene sets, and such a method is inefficient.
Accordingly, there is a need for methods and systems for the efficient comparison, identification, and processing of gene expression data.