The present invention relates to analyzing chemical reactions and, more particularly, but not exclusively to systems and methods for automatically identifying outliers among chemical reaction assays.
An outlier may be indicative of a measurement error, a contaminated sample, a human error, an experimental error, etc, as known in the art.
Traditionally, the classification of chemical assays is based on manual examination by an expert in the field.
The expert manually examines hundreds or thousands of samples, say thousands of graphs derived from results of Quantitative Fluorescent Polymerase Chain Reaction (QF-PCR) based assays, or other chemical assays.
The expert detects certain features in the samples, and classifies each sample into one of two or more groups.
Typically, the results of the chemical assays are obtained through real time photometric measurements of reactions such as real-time Polymerase Chain Reaction (PCR) and Quantitative Fluorescent Polymerase Chain Reaction (QF-PCR), thus producing a time series of values.
The values produced through the measurements, may be represented in a two dimensional graph depicting spectral changes over time, say of a real-time PCR based assay.
The values may also be represented in a three dimensional graph depicting spectral changes vs. molecule length vs. time, say of a Capillary PCR based assay, etc., as known in the art.
For example, the spectral changes may include Fluorescence Intensity (FI) values measured over a PCR reaction apparatus, as known in the art. The measured FI values are indicative of presence and quantity of specific molecules, as detected in the PCR reaction.
The values measured may be used, to classify the chemical reaction assay into a certain type, to determine if the assay is positive or negative (say with respect to occurrence of a certain genetic mutation), etc.
For example, in QF-PCR, a graph representing the values measured over time may have linear properties, which indicate that no amplification takes place in a reaction apparatus.
Alternatively, the QF-PCR graph may include a sigmoid curve interval, which indicates the occurrence of a DNA amplification reaction in the reaction apparatus.
Parameters extracted from the graph are used to determine the properties of the amplification.
The right combination of parameters, say slopes of the graph in selected points on the graph, may indicate the existence of a specific subject (say the existence of a specific bacterial DNA sequence).
The traditional methods rely on a model built manually, by the expert.
In order to build the model, the expert has to manually examine hundreds or thousands of samples of a training set.
The expert may position points in a coordinate system, say on a paper or on a computer screen. Each of the points represents one the samples. The position of each of the points depends on the parameters extracted from the graph, and represents the results of a respective assay (i.e. a single one of the samples).
Then, the expert classifies each of the samples (as represented by points) into one of two or more groups.
Finally, the expert manually draws a line defining a separation between the two or more groups. For example, the expert may draw a line defining a separation between positive and negative samples.
A new sample may thus be classified into one of the groups based on position of a point which represents the future sample, on one or other side of the line which defines the separation between the groups.
Occasionally, while building the training set, the expert may find some of the samples problematic and difficult to classify into one of the groups, thus finding the problematic samples as outliers.
Some currently used methods are based on automatic classification of samples. For example, some of the currently used methods use SVM (Support Vector Machine), to identify patterns in biological systems.
Support Vector Machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns. Supervised learning methods are widely used for classification and regression analysis.
Standard SVM may take a set of input data, and predict for each given input, which of two possible categories the input is a member of.
Given a set of training samples, each marked as belonging to one of the two categories, an SVM training algorithm builds a model usable for assigning a new sample into one category or the other.
Intuitively, the SVM built model is a representation of the samples as points in space, mapped so that the examples of the separate categories are divided by a clear gap.
Consequently, new samples may be mapped into that same space and predicted to belong to one of the categories, based on which side of the gap the new samples fall on.