It is known to use mass spectrometry to identify microorganisms, and more particularly bacteria. A sample of the microorganism is prepared, after which a mass spectrum of the sample is acquired and pre-processed, particularly to eliminate the baseline and to eliminate the noise. The peaks of the pre-processed spectrum are then detected and the list of peaks thus obtained is “analyzed” and “compared”, by means of classification tools, with data of a knowledge base built from lists of peaks, each associated with an identified microorganism or group of microorganisms (strain, class, family, etc.).
Among classification tools, SVM (“Support Vector Machine”) classifications of “one versus all” type are known (hereafter, “OVA SVM”). A “one versus all” SVM classification comprises determining, for each class of objects of a set of classes, an oriented boundary which separates this class from the other classes in the set. As many “one-vs.-all” classifiers as there are classes in the set are thus obtained. The identification of an unknown object then comprises querying each of the classifiers by calculating the algebraic distance between the unknown object and the boundary associated with the classifier. Usually, the unknown object is determined as belonging to the class associated with the largest calculated distance.
This principle is schematically illustrated in FIGS. 1 and 2, which illustrate the very simple case of 3 microorganisms capable of being identified by means of two peaks only in a mass spectrum, for example the two peaks of highest intensity in the mass spectrum of the 3 microorganisms. The first microorganism is characterized by a first peak located on a value m-n and a second peak located on a value m12 (FIG. 1A), the second microorganism is characterized by a first peak located on a value m21 and a second peak located on a value m22 (FIG. 1B), and the third microorganism is characterized by a first peak located on a value m31 and a second peak located on a value m32 (FIG. 1C).
The OVA SVM classification comprises, first, acquiring a set of training mass spectrums of each of the microorganisms and determining the location of the two concerned peaks in each spectrum, to form a set of training vectors
      (                                        p            1                                                            p            2                                )    ,p1 being the measured position of the first peak, and p2 being the measured position of the second peak. Due to the measurement uncertainty, a dispersion of the values of the vectors can be observed. In a second step, a boundary separating the set of vectors
      (                                        p            1                                                            p            2                                )     associated with the microorganism from the vectors
      (                                        p            1                                                            p            2                                )     associated with the two other microorganisms is calculated. Three boundaries F1, F2, and F3 are thus obtained, as shown in FIG. 2, and are provided with a direction, for example, that indicated by the arrows in dotted lines.
The identification of an unknown microorganism then comprises acquiring one or a plurality of microorganism mass spectrums, deducing therefrom a vector M of measured peaks
      (                                        p            1                                                            p            2                                )    ,and calculating the algebraic distance, also called “margin”, of this vector M to each of oriented boundaries F1, F2, and F3. An algebraic distance vector, for example, equal to
      (                                        -            0.4                                                            +            0.3                                                            -            1.3                                )    ,is thus obtained. In the very simple illustrated case, it could thus be deduced that the unknown microorganism is the second microorganism.
Of course, the case illustrated herein is extremely simple. In reality, a microorganism has to be identified from among hundreds of microorganisms with a number of peaks retained for the identification capable of exceeding substantially 1,000 peaks. Further, the illustrated case is also simple since the microorganisms are very distant from one another and the measurement has been performed with enough accuracy to be able to deduce significant information from the distances.
In real cases, it is difficult or even impossible to directly deduce relevant information regarding the calculated distances to the boundaries. Indeed, a distance value may correspond to very different situations. FIGS. 3A to 3D illustrate this principle in a simple fashion. These drawings show boundary F1 separating a set of training peak vectors
      (                                        m            1                                                            m            2                                )     associated with a first microorganism, represented by circles, from the other training peak vectors
      (                                        m            1                                                            m            2                                )     associated with the other microorganisms, represented by triangles. Vectors of measured peaks M of an unknown microorganism to be identified are represented by squares.
In the case illustrated in FIG. 3A, the distance of measured vector M to boundary F1 is positive. However, vector M is so remote from the set of training vectors of the first microorganism that it cannot be deduced with certainty that the unknown microorganism effectively is the first microorganism. In the case illustrated in FIG. 3B, the measured vector M is now close to the set of training vectors but also very close to the other sets of training vectors. In this case also, it is difficult to deduce that the unknown microorganism is the first microorganism. In the case illustrated in FIG. 3C, measured vector M is distant from boundary F1 and is close to the set of training vectors, while being at the border of this set. Although this case is more favorable than previous cases, there still is an uncertainty as to the where the microorganism to be identified belongs. It is in particular necessary to study the accuracy of the measurement. Finally, the case illustrated in FIG. 3D is the rare typical case where the measured vector is both distant from the boundary and is located among the set of training vectors. The measured distance then is a value characteristic of the first microorganism, and can be relied upon.
As can be observed, the calculated distances are only partially relevant. For example, in a first case, a distance equal to 0.4 is highly relevant while in another case, it is impossible to deduce anything therefrom. It is thus necessary to analyze these distances to deduce the type of unknown microorganism therefrom, as well as the degree of reliance to be had on this identification. This additional analysis step is conventional carried out by an operator, be it a biologist or a doctor, who determines by means of his/her know-how what conclusion can be drawn from the distances calculated by the classification tool.
An SVM-type vector classification has been described, which calculates an algebraic distance between two objects of a vectorial space, that is, a vector corresponding to the microorganism to be identified, and a hyperplane corresponding to a boundary partitioning the space, in the illustrated example, 2, into two sub-spaces. The type of problem discussed in relation with this type of classification also appears in other classification types as soon as they generate a value or score representing a distance to reference objects, be the classifications of SVM type or not, or more generally of vector type or not, such as for example Bayesian classifications, linear classifications, classifications based on neural networks, tolerant distance classifications, etc.
To a certain extent, it may be argued that there still exists no reliable tool for identifying microorganisms by means of a mass spectrometry and of classification tools calculating distance values.