1. Field
The present invention is generally directed to storing and managing data in a data warehouse, and more particularly directed to storing and managing data from biological sample analyzers, such as flow cytometer instruments.
2. Background
Biological sample analyzers, such as flow cytometer instruments, are widely used for clinical and research use. A biological sample may comprise a fluid medium carrying a plurality of discrete biological particles, e.g., cells, suspended therein. Biological samples can include blood samples or other cells within a heterogeneous population of cells. Information obtained from the biological particles is often used for clinical diagnostics and/or data analyses.
Flow cytometry is a technology that is used to simultaneously measure and analyze multiple physical characteristics or dimensions of particles, such as cells. Flow cytometry includes techniques for analyzing multiple parameters or dimensions of samples. Characteristics, properties, and dimensions measurable by flow cytometry include cellular size, granularity, internal complexity, fluorescence intensity, and other features. Detectors are used to detect forward scatter, side scatter, fluorescence, etc. in order to measure various cellular properties. Cellular characteristics and properties identified by flow cytometer instruments can then be used to analyze, identify, and/or sort cells.
In traditional flow cytometry systems, a flow cytometer instrument is a hardware device used to pass a plurality of cells singularly through a beam of radiation formed by a light source, such as laser beam. A flow cytometer instrument captures light that emerges from each of the plurality of cells as each cell passes through the beam of radiation.
Currently available flow cytometry systems may include three main systems, i.e., a fluidic system, an optical system, and an electronics system. The fluidic system may be used to transport the particles in a fluid stream past the laser beam. The optical system may include the laser that illuminates the individual particles in the fluid stream, optical filters that filter the light before or after interacting with the fluid stream, and the photomultiplier tubes that detect the light beam after the light passes through the fluid stream to detect, for example, fluorescence and/or scatter. The electronic system may be used to process the signal generated by the photomultiplier tubes or other detectors, convert those signals, if necessary, into digital form, store the digital signal and/or other identification information for the cells, and generate control signals for controlling the sorting of particles. In traditional flow cytometry systems, a computer system converts signals received from light detectors into digital data that is analyzed.
Flow cytometry systems capture large amounts of data from passing thousands of cells per second through the laser beam. Captured flow cytometry data must be stored and indexed so that statistical analysis can subsequently be performed on the data. Since flow cytometers operate at very high speeds and collect large amounts of data in short amounts of time, it is necessary for the data management and storage systems to operate at very high speeds and to efficiently store and manage the data. Statistical analysis of the data can be performed by a computer system running software that generates reports on the characteristics (i.e., dimensions) of the cells, such as cellular size, complexity, phenotype, and health.
Many conventional flow cytometry systems use relational or transactional databases to store and manage the data. Relational databases are not well suited for near instantaneous analysis and display of large amounts of data. Relational databases that are traditionally used with traditional flow cytometry systems are better suited for creating records for On-Line Transaction Processing (OLTP) databases. Unlike relational databases, on-line analytical processing (OLAP) databases are designed to enhance query performance for large amounts of data (i.e., data warehouses) involving relatively few data updates (i.e., data record updates, inserts, and deletes). Although many report-writing tools exist for relational databases, query performance suffers when a large database is summarized. OLTP databases are designed to enhance data update performance, which is achieved at the expense of query performance when OLTP databases contain a large number of tables and a large amount of data. Conversely, OLAP databases allow users to alter and fine-tune query results interactively, dynamically adjusting views of the data, even in cases where the database contains large amounts of data. A design goal of OLAP databases is to enable users to form queries (i.e., ask questions) and receive results quickly. However, current OLTP and OLAP databases schemas are not dynamic in that they cannot be readily be modified or extended by users who simply request that a “new field” be created.
Traditional relational database management systems (RDBMS) are unable to provide OLAP query performance for large relational databases (i.e., databases containing more than a terabyte of data). Similarly, existing OLAP systems are not typically configured to efficiently handle large amounts of data updates.
Traditional flow cytometry database applications have focused on retrieving data from list mode files or relatively small relational OLTP databases, and are not integrated with an OLAP database or a data warehouse. Currently available flow cytometry data analysis and storage systems are limited to storage, management, and sharing of flow cytometry list mode files. Flow cytometry list mode files are files containing raw flow cytometry data, hereafter called FCS files. As used herein, a FCS file refers to flow cytometry data files compliant with the International Society for Advancement of Cytometry (ISAC) Flow Cytometry Standard (FCS). The traditional tools merely index metadata in list mode files, but do not search across hundreds, thousands, or millions of list mode files in search of past experiments that identified a particular phenotype with a particular statistical value. For example, traditional systems cannot query list mode files in search of any fact/dimension combination contained within the files. An example of a fact/dimension combination is a protocol identifying a Naïve T Cell population that occupies at least 15% of total events.
Polychromatic flow cytometry data currently includes 8 or more colors. Polychromatic flow cytometry refers to methods to analyze and display complex multi-parameter data from a flow cytometer instrument. There are technical challenges involved in analyzing and querying large amounts of Polychromatic Flow Cytometry data. In traditional systems, as flow cytometry datasets increase in size, there is a corresponding degradation in data management and query performance.
Accordingly, what is needed are methods and systems that enable storage, analysis, and mining of large amounts of Polychromatic Flow Cytometry data. Further, when list mode data files from a clinical flow cytometry lab contain patient identifiers, what is needed are systems and computer program products that are capable of unifying proteomic and genomic data alongside flow cytometry data. What is also needed are systems, methods, and computer program products that allow queried data to be modified or “cleaned up” by users in both research and clinical environments. What is further needed is a dynamically extensible database schema capable of manipulating up to 1 terabyte or more of flow cytometry data, wherein the database schema can be readily extended by users by requesting that “new fields” be created.