A “marker” typically refers to a polypeptide or some other molecule that differentiates one biological status from another. It is useful to identify novel markers for diagnostics and drug discovery processes. One way to discover if substances are markers for a disease is by determining if they are “differentially expressed” in biological samples from patients exhibiting the disease as compared to samples from patients not having the disease. For example, FIG. 1(A) shows one graph 100 of a plurality of overlaid mass spectra derived from samples from a group of 18 diseased patients. Another graph 102 is shown in FIG. 1(B) and illustrates a plurality of overlaid mass spectra derived from samples from a group of 18 normal patients. In each of the graphs 100, 102, signal intensity is plotted as a function of mass-to-charge ratio. The intensities of the signals shown in the graphs 100, 102 are proportional to the concentrations of markers having a molecular weight corresponding to the mass-to-charge ratio A in the samples. As shown in the graphs 100, 102, at the mass-to-charge ratio A, a number of signals are present in both pluralities of mass spectra.
When the signals in the graphs 100, 102 are viewed collectively, it is apparent that the average intensity of the signals at the mass-to-charge ratio A is higher in the samples from diseased patients than the average intensity of the signals at the mass-to-charge ratio A from the normal patient samples. The marker at the mass-to-charge ratio A is said to be “differentially expressed” in diseased patients, because the concentration of this marker is, on average, greater in samples from diseased patients than in samples from normal patients.
Mass spectra like those shown in FIGS. 1(A) and 1(B) can be used to form an analytical model, which can be used as a diagnostic tool. For example, with reference to the above example, a mass spectrum may be generated from an unknown sample from a test patient. The mass spectrum can be analyzed and the intensity of the signal at the mass-to-charge ratio A can be determined in the test patient's mass spectrum. The signal intensity can be compared to the average signal intensities at the mass-to-charge ratio A for diseased patients and normal patients. As shown in FIGS. 1(A) and 1(B), a prediction can then be made using this analytical model as to whether the unknown sample indicates that the test patient has or will develop the disease. For example, if the signal intensity at the mass-to-charge ratio A in the unknown sample is much closer to the average signal intensity at the mass-to-charge ratio A for the diseased patient spectra than for the normal patient spectra, then a prediction can be made that the test patient is more likely than not to develop or have the disease.
When forming more sophisticated analytical models, signals in mass spectra are often “clustered” together and are then further processed by a computer. For example, various signals associated with the different mass spectra at one or more mass-to-charge ratios can form one or more signal clusters. The signals forming the signal clusters may be further processed, for example, to identify markers and/or to form an analytical model. If, for example, it was not known that the mass-to-charge ratio A represented a differentially expressed marker in normal and diseased patients, a computer could cluster all 36 signals shown in FIGS. 1(A) and 1(B) together. The computer could thereafter determine that the mass-to-charge ratio A is a mass-to-charge ratio of interest. A statistical process running on the computer could be used to analyze the 36 signals in the signal cluster and could automatically determine that the marker that is associated with the mass-to-charge-ratio A is a differentially expressed marker.
Deciding which signals to include within a signal cluster is a problem. Different signal peaks with slightly different mass-to-charge ratios in respectively different mass spectra may in fact represent the same marker. Consequently, these signals are clustered together as a signal cluster and each of the signals in the signal cluster is treated as having the mass-to-charge ratio associated with the signal cluster, even though the signals are in fact associated with slightly different mass-to-charge ratios.
A “cluster window” can be used to capture all desired signals for a signal cluster. The cluster window is typically a continuous range of values such as time-of-flight values, mass-to-charge ratio values, or values derived therefrom. All signal peaks within the cluster window would form a signal cluster, and the signals in the signal cluster and the mass-to-charge ratio for the signal cluster would be used for further data analysis. The width of a cluster window was specified in terms of a percentage of the mass-to-charge ratio (e.g., 1% of a particular mass-to-charge ratio).
A problem with the cluster window is that it was not wide enough to capture all signals that should have been in the same signal cluster. If some signal peaks are incorrectly excluded in this clustering process, then any subsequent data analysis and model formation would also be incorrect. Accordingly, it is desirable to cluster signals correctly.
The cluster window could be widened so that more signals are included in a signal cluster. For example, the proportional growth rate of the cluster window could be increased as the time-of-flight or mass-to-charge ratio increases. However, doing so may upset the clustering of peaks at lower molecular masses. For example, at low time-of-flights or low mass-to-charge ratios, one might capture too many signals within a signal cluster if the cluster window is too wide. Signals associated with different markers could be erroneously included in the same cluster. This would also be undesirable. This potential solution would also require manual tuning on the part of the user, which is subjective and prone to human error.
Embodiments of the invention address these and other problems.