The study of gene expression brings valuable information to the researcher about cellular function that can be applied directly to drug discovery and development. Devices and computer systems have been developed for collecting information about gene expression or expressed sequence tag (“EST”) expression in large numbers of tissues.
DNA microarrays are glass or nylon chips or substrates containing arrays of DNA samples, or “probes,” which can be used to analyze gene expression. A fluorescently labeled nucleic acid is brought into contact with the microarray and a scanner generates an image file indicating the locations within the microarray at which the labeled nucleic acids are bound. Based on the identity of the probes at these locations, information such as the monomer sequence of DNA or RNA can be extracted. By profiling gene expression, transcriptional changes can be monitored through organ and tissue development, microbiological infection, and tumor formation. The robotic instruments used to spot the DNA samples onto the microarray surface allow thousands of samples to be simultaneously tested. This high-throughput approach increases reproducibility and production.
Microarray technologies enable the generation of vast amounts of gene expression data. Effective use of these technologies requires mechanisms to manage and explore large volumes of primary and derived or analyzed gene expression data. Furthermore, the value of examining the biological meaning of the information is enhanced when set in the context of sample profiles and gene annotation data. The format and interpretation of the data depend strongly on the underlying technology. Hence, exploring gene expression data requires mechanisms for integrating gene expression data across multiple platforms and with sample and gene annotations.
The GeneChip® Probe Array of Affymetrix, Inc. of Santa Clara, Calif. is one example of a widely-adopted microarray technology that provides for the high-volume screening of samples for gene expression. Affymetrix® also offers a series of software solutions for data collection, conversion to AADM™ (“Affymetrix® Analysis Data Model”) database format, data mining and a multi-user laboratory information management system (“LIMS”). LIMS is a microarray data management package for users who are generating large quantities of GeneChip® Probe Array data. Data are published to a GATC™ (“Genetic Analysis Technology Consortium”)-standard database which can be searched by mining tools that are GATC-compliant. The Affymetrix® technology has become one of the standards in the field, and large databases of gene expression data generated using this technology, along with associated information, have been assembled and are publicly-available for data mining by pharmaceutical, biotechnology and other researchers and clinicians. However, these researchers often have proprietary gene expression data, also generated using the Affymetrix® technology, and associated data which they may wish to compare with the existing database for validation, or to combine with the database for expanded searching. Further, the researchers may wish to utilize a specific analysis and visualization tool, or to use multiple such tools for comparison. Accordingly, a system is needed for integrating data from multiple sources and providing multiple options for analyzing the results.
As is apparent from the description above, the study of gene expression entails analyzing vast amounts of gene expression data, which can become very difficult to manage. One major problem in managing information that links gene expression data to biological function is the need to seamlessly integrate terabytes of information and to gather information under a wide variety of conditions. Additionally, all of this information may be translated into a common format. The sheer size of the database requires specific hardware and software capabilities that are not usually available to pharmaceutical and academic researchers. The problem of differing formats poses a separate challenge; it is somewhat like developing programs that can simultaneously work with millions of files saved in hundreds of file formats on thousands of Macintosh@, Unix, Linux, Windows®, and other types of computer systems.
Many pharmaceutical researchers and academic researchers need to somehow link gene expression information, clinical data, and the published literature in order to get meaningful leads about the basis of disease and good therapeutic approaches. In some cases, the gene expression information is available commercially, but the information is not immediately useful because it lacks the clinical information from source samples that connects biological functions with expression levels. As a result, researchers must work with a combination of commercial and academic resources and try to pull the results together. Since the information can exist in incompatible formats or be based on experiments conducted under widely varying conditions, the data can be difficult or impossible to manage.
An additional challenge in analyzing the vast amount of data is the time involved in managing and processing the gene expression data. For example, loading, transforming, and validating gene expression data into databases consumes a considerable amount of time and is typically performed after regular business hours. The need to perform these tasks after hours necessarily results in a period of time when users are unable to perform any study of gene expression data. Another challenge is with providing access to gene expression data to multiple users. When an analysis is performed which involves a set of gene expression data, that gene expression data is often inaccessible to other users. Because gene expression data is typically large in size, an analysis or study of gene expression data often ties up system resources in accessing the data, analyzing the data, and then presenting and managing the results of the analysis. A need exists for more efficient ways of managing gene expression data.
Another difficulty in managing the vast amount of data is that much of the data is binary data. This binary data is in a format that is rather efficient in terms of storage space but is rather inflexible and inconvenient. The binary data is efficient and enables faster processing since it contains low overhead usage of data bytes. On the other hand, the binary data is inflexible and is not readily portable to other formats, such as for use between Macintosh®, Unix, Linux, Windows® and other computer systems. This inflexibility is of particular concern with gene expression data since data may be derived from multiple sources of different formats, such as commercial, private sources, and even internal-derived proprietary data.