Data is often analyzed (and experimented with) using data processing algorithms (e.g., to automate the data analysis). The data processing algorithms can include a set of data processing parameters that can be set and/or adjusted to configure how the algorithm processes the data. Typically, sample data (e.g., real-world data) is collected and used to configure the data processing parameters for a particular experiment. For example, input and output data for a particular process can be collected and used to generate a model for the experiment. The data processing parameters for the data processing algorithm(s) used in the experiment can be adjusted based on the model (e.g., so the data processing algorithm(s) can predict unknown output data based on available input data). Often, the configuration of the data processing parameters impacts the efficiency of data analysis and experimentation.
As an example, metabolomics generally refers to the systematic study of the unique chemical fingerprints that specific metabolic processes leave behind; specifically, metabolomics is the study of the small-molecule metabolite profiles of the fingerprints. The by-products of metabolic processes are referred to as metabolites. A metabolome represents the collection of metabolites in a biological cell, tissue, organ or organism, which are the end products of cellular processes. Metabolic profiling can give a snapshot of the physiology of a cell, which advantageously provides insight into what is happening to a cell (e.g., during a cellular process).
Studies in the field of metabolomics often involve several steps to proceed from a hypothesis (e.g., a group or category of metabolites of interest, such as fatty acids, oxidized lipids, nucleosides etc.) to biological interpretation. These steps may include experimental planning, sampling, storage and pre-treatment of data samples, instrumental analysis, data processing and multivariate statistical modeling, validation and/or interpretation. The end result of a metabolomic study can be highly dependent on how well each step in this exemplary chain of events has been conducted. Therefore, the quality of an end result depends on the weakest link of the process. For example, one poorly conducted processing step can compromise the entire experiment or evaluation).
In order to extract interpretable, reliable and reproducible information, standardized protocols for many of these metabolomics experimentation steps have been proposed. However, some of the experimentation steps have not been standardized, such as the data processing step. Therefore, the data processing step remains to be optimized, for example, based on user experience in a trial-and-error fashion, or by using default settings for data processing parameters.
Usually the quality of the results in the metabolomics data processing stage is determined by the quantity of detected spectral peaks in a particular sample, without regard to the quality of individual peaks and/or the proportion of noisy peaks or other signal artifacts (which may be unrelated to the actual samples and/or the underlying hypothesis). The peaks represent, for example, small-molecule metabolites (such as metabolic intermediates, hormones and other signaling molecules, and secondary metabolites) to be found within a biological sample. However, if noisy peaks and/or peaks unrelated to the sample are not removed, such peaks can limit the reliability of the results.
For example, in untargeted metabolomics analysis, the objective is to find as many potential biomarkers as possible associated with the underlying hypothesis, with relatively little a priori information. In the data processing step, the task of optimizing the data processing parameter settings becomes difficult, because there is no easy and accurate way of assessing the quality of an integrated spectral peak without extensive statistical testing and investigation of the variables from a perspective of biological context. However, extensive statistical testing and investigation requires both time and resources often not available at the data processing stage.