In many fields of science, collections of mass spectra, optical absorption spectra, chromatograms, electrophorograms or other analytical peak-containing strings of digital data are investigated with respect to inherent patterns and correlations of such patterns with external parameters of the original samples where the data strings are acquired from. The strings of data describe each the distribution of “peak intensities” along a scale of a “scaling parameter” which may be a “mass” (mass spectrometry) or “retention time” (chromatography) or the like. As a rule, the acquisition of such data strings is performed by a chemical analysis procedure, and the peaks each represent a certain substance. In different strings of data, peaks with the same scaling parameter can be related and assigned by a common parameter value (e.g., “mass”, “retention time” or simply a peak number) throughout the collection of data strings.
These strings of digital data can be displayed in a two-dimensional diagram showing the peaks within the data strings in graphical form. The notion “peak” designates not only just one peak in a single data string but, in a broader sense, all related peaks in the collection of data strings with a common scaling parameter value.
Mass spectra of affinity-extracted proteins from body fluids in clinical proteomics may serve as an example. Here, the peaks in the data strings are mass peaks, they each represent the signal of a protein (or some other biomolecule) having this mass. Usually, two collections of spectra are acquired, one collection from healthy patients, and another collection from patients with a well-confirmed and well-documented disease, and significant differences in the two collections of spectra are searched for. This is done first by visual inspection of a suitable graphical presentation of the collections of spectra. Greatly significant differences like peaks appearing only in one collection and lacking in the other may be found immediately in such a way. But usually refinements in the search for such differentiation parameters will be necessary. These mathematical refinements are performed by application of some pattern recognition, correlation, cluster-searching, or classification algorithms.
In some cases, even three-dimensional data ensembles are generated and investigated for inherent patterns. Examples are chromatograms of body fluids sampled before and after application of drugs, measured by LC-MS (liquid chromatography coupled with mass spectrometry). The goal is to detect the appearance of drug metabolites and other changes in composition by some regulatory effects. Here, the graphical display is more difficult, but in many cases it is sufficient to just show the total ion current chromatograms with their peaks representing the biosubstances. These total ion current chromatograms form the strings of data which can be displayed with the retention time as the scaling parameter. Only the mathematical pattern recognition investigations may take access to the mass spectra hidden behind the peaks.
There are many known kinds of graphical two-dimensional presentations for collections of such data strings: single data strings (intensity vs. scaling parameter) arranged for each data string in its own window one below the other, stacked data strings (shifted by small displacements in both dimensions), contour plots, gray scale plots, density plots, plots of averaged intensities (means) and relative standard deviations, and the like. The graphical display programs may only show the data strings in a passive way, or they may allow for interactive user access to predetermined features of the graphical presentation, like e.g. peaks, base lines, spikes or the like. The user access usually is realized by computer mouse clicking with the cursor on the selected feature.
There are likewise many programs for pattern recognition in given types of data strings. The data strings may be investigated as such, or the strings of data may be reduced beforehand to lists of peaks by “peak finding algorithms”. The peak list is a special form of a data string, but characterized by two data values per peak (intensity; scaling parameter), whereas the original data string consists of a string of digital measurement values of intensities acquired in predetermined time intervals. Data strings or peak lists may be put together in “collections” stemming from different types of samples, e.g., from healthy and ill patients.
The notion “pattern recognition” is used here for all programs which search for classifying, differentiating, or correlating structures in collections of data strings or peak lists. The pattern recognition programs thus comprise classification algorithms, principal component analysis, cluster analysis, cross correlation analysis, and many others. There are “supervised pattern recognition algorithms” (sometimes called “supervised learning programs)”, if different collections of data strings can be identified beforehand as belonging to different classes (e.g. healthy and ill patients), and there are “unsupervised pattern recognition algorithms”, if no such membership in different classes is known beforehand.
Results of pattern recognition programs are usually shown in graphical presentations of their own; e.g., clusters are shown in a diagram with principal components as coordinates, the principal components being built by transformation of the original peak parameters in some complicated way. In these types of graphical presentation, there is no easily recognizable connection to the original peaks in the data strings and therefore no connection to the substances represented by the peaks. In some cases it may even be difficult to refer back to the peaks (and substances) responsible for some recognized pattern, e.g., if the principal components for a cluster presentation is a non-linear combination of the parameters of many peaks.