Historically, the discovery and development of new drugs has been an expensive, time consuming and inefficient process. With estimated costs of bringing a single drug to market requiring an investment of approximately 8 to 12 years and approximately $350 to $500 million, the pharmaceutical research and development market is in need of new technologies that can streamline the drug discovery process. Companies in the pharmaceutical research and development market are under fierce pressure to shorten research and development cycles for developing new drugs, while at the same time, novel drug discovery screening instrumentation technologies are being deployed, producing a huge amount of experimental data.
Innovations in automated screening systems for biological and other research are capable of generating enormous amounts of data. The massive volumes of feature-rich data being generated by these systems and the effective management and use of information from the data has created a number of very challenging problems. As is known in the art, “feature-rich” data includes data wherein one or more individual features of an object of interest (e.g., a cell) can be collected. To fully exploit the potential of data from high-volume data generating screening instrumentation, there is a need for new informatic and bioinformatic tools.
Identification, selection, validation and screening of new drug compounds is often completed at a nucleotide level using sequences of Deoxyribonucleic Acid (“DNA”), Ribonucleic Acid (“RNA”) or other nucleotides. “Genes” are regions of DNA, and “proteins” are the products of genes. The existence and concentration of protein molecules typically help determine if a gene is “expressed” or “repressed” in a given situation. Responses of genes to natural and artificial compounds are typically used to improve existing drugs, and develop new drugs. However, it is often more appropriate to determine the effect of a new compound on a cellular level instead of a nucleotide level.
Cells are the basic units of life and integrate information from DNA, RNA, proteins, metabolites, ions and other cellular components. New compounds that may look promising at a nucleotide level may be toxic at a cellular level. Florescence-based reagents can be applied to cells to determine ion concentrations, membrane potentials, enzyme activities, gene expression, as well as the presence of metabolites, proteins, lipids, carbohydrates, and other cellular components.
There are two types of cell screening methods that are typically used: (1) fixed cell screening; and (2) live cell screening. For fixed cell screening, initially living cells are treated with experimental compounds being tested. No environmental control of the cells is provided after application of a desired compound and the cells may die during screening. Live cell screening requires environmental control of the cells (e.g., temperature, humidity, gases, etc.) after application of a desired compound, and the cells are kept alive during screening. Fixed cell assays allow spatial measurements to be obtained, but only at one point in time. Live cell assays allow both spatial and temporal measurements to be obtained.
The spatial and temporal frequency of chemical and molecular information present within cells makes it possible to extract feature-rich cell information from populations of cells. For example, multiple molecular and biochemical interactions, cell kinetics, changes in sub-cellular distributions, changes in cellular morphology, changes in individual cell subtypes in mixed populations, changes and sub-cellular molecular activity, changes in cell communication, and other types of cell information can be obtained.
The types of biochemical and molecular cell-based assays now accessible through fluorescence-based reagents is expanding rapidly. The need for automatically extracting additional information from a growing list of cell-based assays has allowed automated platforms for feature-rich assay screening of cells to be developed. For example, the ArrayScan System by Cellomics, Inc. of Pittsburgh, Pa., is one such feature-rich cell screening system. Cell based systems such as FLIPR, by Molecular Devices, Inc. of Sunnyvale, Calif., FMAT, of PE Biosystems of Foster City, Calif., ViewLux by EG&G Wallac, now a subsidiary of Perkin-Elmer Life Sciences of Gaithersburg, Md., and others also generate large amounts of data and photographic images that would benefit from efficient data management solutions. Photographic images are typically collected using a digital camera. A single photographic image may take up as much as 512 Kilobytes (“KB”) or more of storage space as is explained below. Collecting and storing a large number of photographic images adds to the data problems encountered when using high throughput systems. For more information on fluorescence based systems, see “Bright ideas for high-throughput screening—One-step fluorescence HTS assays are getting faster, cheaper, smaller and more sensitive,” by Randy Wedin, Modern Drug Discovery, Vol. 2(3), pp. 61-71, May/June 1999.
Such automated feature-rich cell screening systems and other systems known in the art typically include microplate scanning hardware, fluorescence excitation of cells, fluorescence captive emission optics, a photographic microscopic with a camera, data collection, data storage and data display capabilities. For more information on feature-rich cell screening see “High content fluorescence-based screening,” by Kenneth A. Guiliano, et al., Journal of Biomolecular Screening, Vol. 2, No. 4, pp. 249-259, Winter 1997, ISSN 1087-0571, “PTH receptor internalization,” Bruce R. Conway, et al., Journal of Biomolecular Screening, Vol. 4, No. 2, pp. 75-68, April 1999, ISSN 1087-0571, “Fluorescent-protein biosensors: new tools for drug discovery,” Kenneth A. Giuliano and D. Lansing Taylor, Trends in Biotechnology, (“TIBTECH”), Vol. 16, No. 3, pp. 99-146, March 1998, ISSN 0167-7799, all of which are incorporated by reference.
An automated feature-rich cell screening system typically automatically scans a microplate plate with multiple wells and acquires multi-color fluorescence data of cells at one or more instances of time at a pre-determined spatial resolution. Automated feature-rich cell screen systems typically support multiple channels of fluorescence to collect multi-color fluorescence data at different wavelengths and may also provide the ability to collect cell feature information on a cell-by-cell basis including such features as the size and shape of cells and sub-cellar measurements of organelles within a cell.
The collection of data from high throughput screening systems typically produces a very large quantity of data and presents a number of bioinformatics problems. As is known in the art, “bioinformatic” techniques are used to address problems related to the collection, processing, storage, retrieval and analysis of biological information including cellular information. Bioinformatics is defined as the systematic development and application of information technologies and data processing techniques for collecting, analyzing and displaying data obtained by experiments, modeling, database searching, and instrumentation to make observations about biological processes. The need for efficient data management is not limited to feature-rich cell screening systems or to cell based arrays. Virtually any instrument that runs High Throughput Screening (“HTS”) assays also generate large amounts of data. For example, with the growing use of other data collection techniques such as DNA arrays, bio-chips, microscopy, micro-arrays, gel analysis, the amount of data collected, including photographic image data is also growing exponentially. As is known in the art, a “bio-chip” is a stratum with hundreds or thousands of absorbent micro-gels fixed to its surface. A single bio-chip may contain 10,000 or more micro-gels. When performing an assay test, each micro-gel on a bio-chip is like a micro-test tube or a well in a microplate. A bio-chip provides a medium for analyzing known and unknown biological (e.g., nucleotides, cells, etc.) samples in an automated, high-throughput screening system.
Although a wide variety of data collection techniques can be used, cell-based high throughput screening systems are used as an example to illustrate some of the associated data management problems encountered by virtually all high throughput screening systems. One problem with collecting feature-rich cell data is that a microplate plate used for feature-rich screening typically includes 96 to 1536 individual wells. As is known in the art, a “microplate” is a flat, shallow dish that stores multiple samples for analysis. A “well” is a small area in a microplate used to contain an individual sample for analysis. Each well may be divided into multiple fields. A “field” is a sub-region of a well that represents a field of vision (i.e., a zoom level) for a photographic microscope. Each well is typically divided into one to sixteen fields. Each field typically will have between one and six photographic images taken of it, each using a different light filter to capture a different wavelength of light for a different fluorescence response for desired cell components. In each field, a pre-determined number of cells are selected to analyze. The number of cells will vary (e.g., between ten and one hundred). For each cell, multiple cell features are collected. The cell features may include features such as size, shape, etc. of a cell. Thus, a very large amount of data is typically collected for just one well on a single microplate.
From a data volume perspective, the data to be saved for a well can be estimated by number of cell feature records collected and the number of images collected. The number of images collected can be typically estimated by: (number of wells×number of fields×images per field). The current size of an image file is approximately 512 Kilobytes (“KB”) of uncompressed data. As is known in the art, a byte is 8-bits of data. The number of cell feature records can typically be estimated by: (number of wells×number of fields×cells per field×features per cell). Data collected from multiple wells on a microplate is typically formatted and stored on a computer system. The collected data is stored in format that can be used for visual presentation software, and allow for data mining and archiving using bioinformatic techniques.
For example, in a typical scenario, scanning one low density microplate with 96 wells, using four fields per well, three images per field and an image size of 512 Kbytes per image, generates about 1,152 images and about 576 megabytes (“MB”) of image data (i.e., (96×4×3×512×(1 KB=1024 bytes)/(1 MB=(1024 bytes×1024 bytes))=576 MB). As is known in the art, a megabyte is 220 or 1,048,576 bytes and is commonly interpreted as “one million bytes.”
If one hundred cells per field are selected with ten features per cell calculated, such a scan also generates (96×4×100×10)=288,000 cell feature records, whose data size varies with the amount of cell features collected. This results in about 12,000 MB of data being generated per day and about 60,000 MB per week, scanning the 96 well microplates twenty hours a day, five days a week.
In a high data volume scenario based on a current generation of feature-rich cell screening systems, scanning one high-density microplate with 384 wells, using sixteen fields per well, four images per field, 100 cells per field, ten features per cell, and 512 KB per image, generates about 24,576 images or about 12,288 MB of image data and about 6,144,000 cell feature records. This results in about 14,400 MB of data being generated per day and about 100,800 MB per week, scanning the 384 well microplates twenty-four hours a day, seven days a week.
Since multiple microplates can be scanned in parallel, and multiple automated feature-rich cell screening systems can operate 24 hours a day, seven days a week, and 365 days a year, the experimental data collected may easily exceed physical storage limits for a typical computer network. For example, disk storage on a typical computer network may be in the range from about ten gigabytes (“GB”) to about one-hundred GB of data storage. As is known in the art, a gigabyte is 230 bytes, or 1024 MB and is commonly interpreted as “one billion bytes.”
The data storage requirements for using automated feature-rich cell screening on a conventional computer network used on a continuous basis could easily exceed a terabyte (“TB”) of storage space, which is extremely expensive based on current data storage technologies. As is known in the art, one terabyte equals 240 bytes, and is commonly interpreted as “one trillion bytes.” Thus, collecting and storing data from an automated feature-rich cell screening system may severely impact the operation and storage of a conventional computer network.
Another problem with feature-rich cell screening systems is even though a massive amount of cell data is collected, only a very small percentage of the total cell feature data and image data collected will ever be used for direct visual display. Nevertheless, to gather statistically relevant information about a new compound all of the cell data generated, is typically stored on a local hard disk and available for analysis. This may also severely impact a local hard disk storage.
Yet another problem is that microplate scan results information for one microplate can easily exceed about 1,000 database records per plate, and cell feature data and image data can easily exceed about 6,000,000 database records per plate. Most conventional databases used on personal computers can not easily store and manipulate such a large number of data records. In addition, waiting relatively long periods of time to open such a large database on a conventional computer personal computer to query and/or display data may severely affect the performance of a network and may quickly lead to user frustration or user dissatisfaction.
Thus, it is desirable to provide a data storage system that can be used for feature-rich screening on a continuous basis. The data storage system should provide a flexible and scalable repository of cell data that can be easily managed and allows data to be analyzed, manipulated and archived.