Mass spectrometry is a powerful tool for determining the masses of molecules present in a sample. A mass spectrum consists of a set of mass-to-charge ratios, or m/z values and corresponding relative intensities that are a function of all ionized molecules present in a sample with that mass-to-charge ratio. The m/z value defines how a particle will respond to an electric or magnetic field that can be calculated by dividing the mass of a particle by its charge. A mass-to-charge ratio is expressed by the dimensionless quantity m/z where m is the molecular weight, or mass number, and z is the elementary charge, or charge number. Mass spectrometry provides information on the mass to charge ratio of a molecular species in a measured sample. The mass spectrum observed for a sample is thus a function of the molecules present. Conditions that affect the molecular composition of a sample should therefore affect its mass spectrum. As such, mass spectrometry is often used to test for the presence or absence of one or more molecules. The presence of such molecules may indicate a particular condition such as a disease state or cell type. A “marker” refers to an identifiable feature in mass spectrum data that differentiates the biological status, such as a disease, represented by one data set of mass spectra from another data set. A marker can differentiate between a person with a specific disease versus a person not having that disease. In some cases, differences in peaks in the mass spectra may be used as differentiating feature to form one or more markers. One way to determine markers for a disease is by determining if the mass spectra of biological samples from patients with the disease are differentially expressed from mass spectra of samples from patients not having the disease. By comparing mass spectra obtained from blood, serum, tissue or some other source, of patients with a disease against mass spectra from healthy patients, clinicians hope to be able to identify markers for disease and create diagnostic tools that can be used to detect or confirm the presences of a disease.
Manual inspection of mass spectra may be feasible for a small number of mass spectra samples. However, manual inspection is not feasible for larger quantities of mass spectra data sets. Advances in mass spectrometry technology allow for higher throughput screening of mass spectra samples. Recently, a number of algorithms haven been developed to find differences in mass spectra data to differentiate between mass spectra data of samples taken from two separate conditions. These algorithms that discriminate one condition from another by comparing spectral differences are called mass spectrometry classification algorithms, or classifiers. For example, one mass spectra data set may be a control mass spectra data set with a known marker or markers for identifying a certain disease state. The other mass spectra data set may be a sample that has not been classified. The algorithm of the classifier may be used to compare the mass spectra data sample to determine if it has any of the markers from the control data set, and therefore may be used to classify the sample as having the disease state. There are various types of classifiers applying different algorithms to these types of problems, including Classification and Regression Trees (CART), artificial neural networks, and linear discriminant analyzers.
The accuracy and running-time of classifiers in discriminating between separate conditions is impacted by the quality and preparation of the mass spectra data. Spectra obtained from mass spectrometry machines are noisy signals that contain many peaks that may correspond to markers. More expensive machines can produce less noisy data. However, differences in peaks are not guaranteed to differentiate between two conditions. Furthermore, these may be differentiating signals which are not differentially expressed due to the noisy signals or otherwise not easily differentiated in the patterns of the mass spectra data. For example, subsequent smaller peaks may not be emphasized because of the smearing effect of data patterns of larger peaks.
Identifying markers is an important step in discriminating between two conditions, such as in the diagnostics of diseases. Classifiers can be time-consuming and expensive to run in identifying markers, especially when working with raw mass spectrum intensity signals with unknown markers. Furthermore, it is not readily apparent what characteristics of mass spectra data patterns may represent a potential marker. Therefore, improved methods and systems are desired to improve the accuracy of classifiers and to provide better classification of mass spectra.