Working with extremely large datasets can be problematic to a researcher in terms of trying to maintain an appropriate amount of detail of the subject data being examined, while at the same time maintaining sufficient context of the surrounding data. This problem is amplified when the data of interest is extremely sparse, spread out over a very large dataset.
In the biological field, these problems present with respect to many different forms of large datasets that researchers need to study. With the advent of high throughput technologies, dataset sizes and the volumes of data available for the researcher to access have grown by magnitudes.
One of the areas in biology in which the data to be studied is particularly sparse within very large datasets, is in the field of proteomics. One method of attempting to identify the proteins that are present in a sample is through protein expression profiling. Protein expression profiling involves the identification of proteins in a particular sample as a function of a particular state of the cell or as a function of exposure to a drug, chemical, or physical stimulus. One common approach to protein expression profiling is through the use of mass spectrometry, wherein proteins are fragmented, such as by digestion of the proteins, and then processed in a mass spectrometer. Another approach to protein expression profiling is the use of protein microarrays.
Another area in biology characterized by sparse data within very large datasets is the field of metabolomics. Metabolomics is the study of cellular metabolites, the small molecules that are the intermediates and products of metabolism. Metabolism refers to the totality of chemical changes that occur within tissues of an organism during the buildup and breakdown of molecules for utilization by the body. As with proteomics, one common approach to metabolomic analysis is through the use of mass spectrometry.
A common workflow for the use of mass spectrometry to, for example, determine the cohort of proteins in a sample of serum, involves the following steps. First, cells are broken down and complex protein mixtures are extracted through a process known as “lysis”. “Digestion” is performed to break down proteins in the sample into fragments. The complex mixture of protein fragments is separated into its component fragments. Common laboratory separation techniques include gel electrophoresis and chromatography. It should be noted that in some laboratory techniques the process of separation occurs before digestion. The separated fragments are further prepared and input into a mass spectrometer. A mass spectrometer commonly consists of three functional units. First, a “source” ionizes protein fragments, giving them positive or negative charges. Then, a “mass analyzer” separates the mixture of ions according to their mass-to-charge ratios. A “detector” detects the different ions and drives a data acquisition system that prepares raw mass spectral data. A “mass spectrum” is a plot of data typically containing m/z values along the x-axis and intensity values along the y-axis. The raw mass spectral data is typically interpreted by application software, for example in searching the data against protein databases to provide protein identification.
An example workflow for metabolomic analysis involves separating metabolites in a sample by capillary electrophoresis, then selectively detecting metabolites using mass spectrometry by monitoring over a range of mass/charge values. This is described in Soga et al., “Quantitative Metabolome Analysis Using Capillary Electrophoresis Mass Spectrometry”, Journal of Proteome Research, 2003, 2, 488–494.
In addition to working with interpreted data, it is often necessary and desirable for the user to view and interact directly with the raw mass spectral data itself. One example of this is in confirming whether a mass spectrometry experiment has provided coverage across the range of likely peptide fragments expected to be seen in a sample. Directly inspecting the raw mass spectral data is also very useful in estimating the quality of the data. For example, from visual inspection, a user can get a sense of overall signal/noise ratio. From the shape of visual features and from bleeding between visual features, the user can get a sense of the saturation of data in one dimension and the resolution of the data. Additionally, experimental artifacts such as chemical pollution in the source of the mass spectrometer are typically quite visible. Thus, a rough measure of quality is very evident.
Existing software for viewing and interacting with mass spectra is quite complex because there can be hundreds of thousands of compounds involved in the analysis of a typical sample. Current mass spectrometry data visualization software is limited to a narrow cross-section of data at a time, typically a slice of data in one dimension, for example either a sum of intensities over time or distribution of wavelength intensities at one given time point. Thus, while these approaches may provide sufficient detail of the data, they do so while significantly losing context of the surrounding data from which the narrow cross section is taken. This is due to the fact that datasets are enormous and sparse.
Several programs, for example the Analyst QS software product from MDS Sciex (http://www.sciex.com) have the ability to display two-dimensional graphs of mass spectrometry data, for example intensity vs. mass/charge ratio over time, however they for the most part work with fixed images. Changing any aspect of the image boundaries forces a redrawing of the image and/or the creation of a new window. This, in general, is slow and cumbersome.
Because mass spectrometry data is usually sparse and fine-featured, the simple display of scaled down mass spectral data is hard to read. Often the important information in mass spectral data, time and abundance, is visible on only a very small area, typically only a few data points in either direction. These fine features are often grouped together in patches, but these groups of features can often be far apart. Thus, information is lost when the image is zoomed out to a low degree of magnification, as features in the display of the data blend into each other.
The field of cartography deals with similar problems. Although some software systems provide the capability of zooming in and out from detailed views to high level views, these views are not generally continuously zoomable, but must be redrawn at different magnifications, which requires time and the user loses a sense of continuity as well, which can be detrimental to his/her sense of context. See, for example, http://www.mapquest.com/.
Additionally, the state of the art for comparing multiple mass spectrometry data sets is fairly primitive. Although some ad hoc research is being conducted with regard to this problem, the research has tended to be focused on very specific problems. Thus, there is no widely accepted solution to the problem of comparing multiple datasets.
In other domains, there are systems for using a graphical feature as an index for querying a database. An example of this is “content-based image retrieval”, as provided in products from Virage, Inc. (http://www.virage.com/), wherein a sample image is provided as input and the system returns a set of images in the database that are most similar, in terms of features such as color, texture or structure, to the sample image. These systems are not adapted to searching large, sparse datasets such as mass spectrometry data.
In view of the existing systems, what is needed are systems, methods, and tools that provide means to easily navigate large, multi-dimensional datasets from a global perspective, instead of being confined to a narrow slice of data at a time, that provide representations of the data at different levels of complexity, and that provide a fast and self-evident interface for zooming in and out so that features of interest are conveniently recognizable.