Analysis of complex data often follows a reductionist approach. In other words, discrete analysis steps are performed on the data that, in general, simplify or reduce the number of data values or group the data values into similar clusters. Further analysis steps are then carried out independently on the results of these initial algorithms until the data is finally reduced to one or more outputs that the user desires. These outputs can be as simple as a single number (for example, a mean of values), or as complex as a series of graphs representing different aspects of the data.
Visualizations of the outputs of algorithms can be as simple as a display of a single number, or as complex as a dynamic multidimensional series of graphs. The generation of a visualization is itself an algorithmic process, and is as important to the analysis of data as the functional manipulation of the data. Visualizations can be associated with any given analysis step; thus, a user can completely analyze a data sample by associating successive algorithms and viewing the associated visualization in order to monitor the analysis process.
The advantages of a reductionist approach involving discrete analysis steps is that parts of the analysis can be applied to different datasets that may require different pre-analysis For example, some datasets require smoothing or elimination of spurious data before proceeding with further analysis.
This mode of data analysis is particularly useful in the field of flow cytometry. For example, scientists studying the very heterogeneous composition of white blood cells will typically employ measurements that discriminate these cells by revealing the presence or absence of particular proteins on the cells. Some of these proteins can discriminate major classes of white blood cells (i.e., B cells vs. T cells); others can discriminate subsets of these major classes. However, most of the proteins are expressed by many of the subsets; thus, it is necessary to use a combination of many different measurements to identify unique kinds of blood cells.
Typically, a scientist will first separate flow cytometric data values into sets corresponding to the major white blood cell types. To further differentiate between subsets, the researcher will view graphs that are derived from data only corresponding to these sets. As more and more restrictions are placed on the data, finer and finer subsets of cells are identified. Once the subsets have been identified, the scientist will typically desire a variety of different statistics to be determined for the cells contained in that subset
Often, the steps taken to analyze flow cytometric data can be repeatedly applied to multiple data samples. The specific gating (i.e., the restriction of the data values to particular sets) can be applied to, for example, different samples obtained from different individuals. A particular gating can also be used within the same sample to differentiate subsets of different major classes (for example, the same gating may identify subsets of B cells or subsets of T cells, depending on which data values are inputted to the algorithm). This is an underlying principle of batch analysis: the repetitive application of a series of algorithms in order to achieve similar analysis results on multiple samples.
A significant drawback of this approach is that different samples may require slightly (or significantly) different algorithms to achieve the same principal goal. In other words, in one sample, the major cell divisions may require a different type of gating than that required in another sample. However, subsequent analyses such as further gating or statistics may be identical between the two samples. Current analysis techniques do not provide the flexibility to allow for specific modification of certain algorithms within an analysis scheme while still allowing for easy batch analysis.
It will be apparent to one knowledgeable in the field of data analysis that the analysis processes and inherent limitations described above for flow cytometric data can be equally found in other types of data analysis. These include, but are not limited to, the analysis of demographic data and the analysis of clinical data. These data types are examples of highly multiparametric datasets (wherein many measurements are made for each member of the dataset) that can require complex analysis that may take many steps.
Current implementations of data analysis programs are extremely poor in the area of batch-mode analysis (i.e., repetitive analysis of multiple sample datasets). In general, batch-mode analyses are accomplished by the identical and repeated application of an algorithm, without allowing for sample-specific modifications to such algorithms. Therefore, after application of the batch process, the user must go back and re-analyze those samples requiring different steps. This process becomes especially tedious and error-prone when the batch analysis must be repeated (for example, to change one step in the batch analysis). This puts an enormous demand on the user to remember which samples require modifications of the algorithms, and what those modifications are. Current implementations also have no "automatic" mechanisms for scheduling batch analysis. Typically, users must select a set of sample data files and issue the command to apply a given algorithm to that entire set. When a new set of data samples has been collected, the user must re-issue the batch command for every algorithm to the new data samples. Finally, most implementations of data analyses do not allow the user to associate a descriptive name with the algorithms employed. The algorithms are often cryptic and difficult to immediately understand; thus, the user often will make mistakes by not recognizing subtle modifications to algorithms. Even when implementations allow users to annotate algorithms, the annotation itself has no functionality to the implementation, which tends to dissuade users from performing the annotation.
In the end, current data analysis programs place too much of a burden on users to keep track of the precise algorithms used to analyze samples. In addition, they provide few tools to employ these algorithms repetitively, and when they do provide such tools, these tools do not allow for any flexibility in the application to datasets requiring specific modifications of those algorithms.