Embodiments of the present invention relate generally to analyzing indexed data and more specifically to methods, and devices for performing such methods, of identifying and/or characterizing features in indexed data, for example spectral data.
As used herein, the term xe2x80x9cindexed datasetxe2x80x9d or xe2x80x9cspectrumxe2x80x9d refers to a collection of measured values called responses where each response is related to one or more of its neighbor element(s). The relationship between the one or more neighbor elements may be, for example, categorical, spatial or temporal. In addition, the relationship may be explicitly stated or implicitly understood from knowing the type of response data and/or how such data were obtained. When a unique index, either one dimensional or multi-dimensional, is assigned to each response, the data are considered indexed. One dimensional indexed data is be defined as data in ordered pairs (index value, response). The index values represent values of a physical parameter such as time, distance, frequency, or category; the responses can include but are not limited to signal intensity, particle or item counts, or concentration measurements. A multi-dimensional indexed dataset or spectrum is also ordered data, but with each response indexed to a value for each dimension of a multi-dimensional array. Thus a two-dimensional matrix has a unique row and column address for each response (index value1, index value2, response).
The identification and/or characterization of significant or useful features in the analysis of indexed data is a classic problem. Often this problem is reduced to separating the desired signal from undesired noise by, for example, identifying peaks that may be of interest. For indexed data, each of such peaks appears as a deviation, that is to say a rise and a fall, in the responses over consecutive indices. However, background noise can also result in such deviations of responses leading, for example, to false peaks being included in indexed data.
Traditionally, peak detection has been based upon identifying responses above a threshold value. Whether this peak detection has been performed manually or by use of an automated tool, threshold selection has been a critical feature that has resisted an objective solution. Thus such previously known methods for threshold selection typically require arbitrary and subjective operator/analyst-dependent decision-making and are therefore an art. The effectiveness of such artful decision making, and as a result peak detection, using these known traditional methods is also affected by signal to noise ratio, signal drift, and variations in the baseline signal. Consequently, the operator/analyst has often had to apply several thresholds to the responses over different regions of indices to capture as much signal as possible. This has been shown to be difficult to reproduce, suffer from substantial signal loss, and subject to operator/analyst uncertainty.
An example of the problems with traditional peak detection and characterization algorithms and methods is illustrated by the development of statistical analysis methods for MALDI-MS (matrix-assisted laser desorption/ionizationxe2x80x94mass spectrometry). The MALDI-MS process begins with an analyte of interest placed on a sample plate and mixed with a matrix. The matrix is a compound selected to absorb specific wavelengths of light that are emitted by a selected laser. Light from such laser is then directed at the analyte mixture causing the matrix material, selected to absorb the light energy, to become ionized. This ionization of the matrix material, in turn ionizes some molecules of the analyte which become analyte ions 100 (FIG. 1). A charge is applied at a detector 104 to attract analyte ions 100 through a flight tube 102 and ultimately to detector 104 where detector 104 measures a mass and ionic charge of each ion 100 that arrives over a time interval. This number, or abundance of ions over time, is converted using mass and charge data to an abundance of ions as a function of a mass/charge (m/z) ratio. Since ions 100 arrive at detector 104 in a disperse packet which spans multiple sampling intervals, ions 100 are binned and counted over several m/z units as illustrated in FIG. 2. Currently used algorithms require an operator/analyst to specify a detection threshold 200 for the intensities observed so that only peaks 202 that exceed this specified threshold will be detected and characterized. This procedure for setting the detection threshold appears conceptually appealing and suggests that m/z values for which no ions are present will read baseline relative abundance, while m/z values for which ions are present will result in a peak. However, as a result of this procedure peaks 202 detected for a specific analyte are not only dependent on the MALDI-MS instrument used but also on the skill of the operator/analyst in setting the detection threshold 200 used for the analysis. If such a user-defined threshold 200 is too low, noise can erroneously be characterized as a peak, whereas if threshold 200 is too high, small peaks might be erroneously identified as noise. Thus the manual setting of detection threshold 200 induces variability that makes accurate statistical characterization of MALDI-MS spectra difficult, such variability decreasing even further the effectiveness of current peak detection algorithms. Also related to the problem of distinguishing signals from noise is the bounding uncertainty of the signal. It is well known that replicate analyses of a sample often produce slightly different indexed data due to instrument variability and other factors not tied to an operator/analyst.
Thus, it would be advantageous, in the art of indexed data collection and analysis, for there to be methods of processing indexed data that provide greater confidence in identification/characterization of feature(s). In addition, it would be advantageous if such methods also provided for greater confidence in separating actual signals from noise with less signal loss, and that such methods are robust and minimize adverse effects of low signal to noise ratio, signal drift, varying baseline signal, boundary uncertainties and combinations thereof. In addition, it would be advantageous for such methods to be applicable to multi-dimensional arrays as well as for characterizing multi-dimensional uncertainty of signals. Finally, it would be advantageous for such methods to provide some or all of the aforementioned advantages while providing greater automation than currently available.
Methods for identifying features in an indexed dataset or spectrum are provided. Whereas prior methods focused on comparing responses such as signal intensities to a response or signal intensity threshold, embodiments in accordance with the present invention combine such responses with indices, for example, mass charge (m/z) ratio values. More specifically, embodiments of the present invention considers such signal intensities, or any other measured response, as a histogram of indices, and uses this histogram concept to construct a measure of dispersion of indices. The responses associated with each of the indices are used as histogram frequencies in measuring the dispersion of indices. Comparison of the index dispersion, e.g. an intensity weighted variance (IWV), to a dispersion critical value or critical threshold provides for the identification or determination of significant or useful feature(s). Thus, some methods of the present invention encompass, but are not limited to:
(a) selecting a subset of indices, the subset being encompassed by a window-of-interest, the subset having at least one beginning index and at least one ending index that are usable for computing a measure of dispersion;
(b) computing a measure of dispersion for the subset of indices using a subset of responses corresponding to the subset of indices; and
(c) comparing the measure of dispersion to a dispersion critical value.
In addition, some methods in accordance with the present invention encompass, but are not limited to:
(a) selecting a subset of indices, the subset being encompassed by a window-of-interest, the subset having at least one beginning index and an at least one ending index that are usable for computing an intensity weighted variance (IWV);
(b) computing the intensity weighted variance (IWV) for the subset of indices using a subset of responses corresponding to the subset of indices;
(c) computing an intensity weighted covariance (IWCV) for the subset of indices using a subset of responses corresponding to the subset of indices; and
(d) comparing the IWV to a critical value determined from the statistical properties of the IWV.
(e) comparing the IWCV to a critical value determined from the statistical properties of the IWCV.
For MALDI-MS, index values are generally m/z ratios and the responses, corresponding intensities. Each index value represents a specific m/z ratio, and its corresponding intensity measurement represents the relative abundance of ions having that specific m/z ratio. Thus a MALDI-MS spectrum can be thought of as a histogram of m/z ratios that depicts the relative abundance of each m/z ratio measured.
From this histogram concept, features in the spectrum can be identified and characterized by comparing some of the properties of a histogram for any window-of-interest, to the corresponding properties for a hypothesized noise only distribution. In some embodiments of this invention, this noise only distribution is used as a criteria for distinguishing spectral features or peaks that are due to an actual signal, from those spectral features that are due to noise. In particular, when no transient feature or actual signal is present in a first window-of-interest, the neighborhood intensity is relatively constant.
In one-dimensional applications, a histogram created from the data collected from within the first widow-of-interest will essentially be a one-dimensional (1-D) discrete uniform distribution, which is understood to be a histogram where the intensity of any bin is approximately the same for all bins. On the other hand, where an actual signal or transient feature is present within a second window-of-interest, the distribution of intensities across the window will be unequal and a histogram created from the data of that second window will show at least one bin with an intensity unequal to the other bins. Thus the difference between the distribution of intensities or signals from one window-of-interest to another are advantageously employed to detect the presence of an actual signal or peak within a spectrum or indexed dataset. As mentioned above, for MALDI-MS, index values or bins are generally m/z ratios and the responses are generally the corresponding intensities. However other index values and responses can be used to form an indexed dataset or spectrum. For example, some spectra that can be evaluated by embodiments in accordance with the present invention that encompass an index value which is a physical displacement from a point of origin and a response which represents an intensity at that displacement. In addition, embodiments of the present invention can also be employed to evaluate a multi-dimensional spectrum or multi-indexed dataset. Thus, as will be discussed, some embodiments are advantageously used to detect and/or characterize transient features from datasets that incorporate a first index value, a second index value and a response.
Advantages of embodiments of the present invention include minimizing the effects of signal to noise ratio, signal drift, varying baseline signal and combinations thereof. In addition, such embodiments of the present invention provide for the automation of transient feature detection and data reduction by minimizing or eliminating the need for user selection of a threshold and by automatic, iterative scans of the data using windows of interest of varying sizes where a first window size is selected based on the resolution of the instrument providing the data.
Other embodiments in accordance with the present invention encompass equipment that is configured to perform the methods described herein. Thus such embodiments include a general purpose computer apparatus having program code effective to perform the methods of the present invention. Still other embodiments of the present invention encompass analytical instruments configured to both collect and analyze data.
The subject matter of the present invention is particularly pointed out and distinctly claimed in the concluding portion of this specification. However, both the organization and method of operation, together with further advantages and objects thereof, may best be understood by reference to the following description taken in connection with accompanying drawings wherein like reference characters refer to like elements.