The invention relates to the identification of microbes in a sample by calculating the similarities between a mass spectrum of the sample and reference mass spectra in large libraries. The routine fast and error-free identification of many samples of microorganisms plays an important role particularly in clinical and non-clinical infection diagnostics, in hygiene monitoring in hospitals or in rivers and lakes used for swimming, and also in food analysis, monitoring of biotechnological processes or in microbiological research. Microorganisms, here also called microbes for short, include all microscopically small organisms, for example unicellular fungi (e.g. yeasts), algae, or protozoa (e.g. plasmodia as malaria pathogens), although the focus of the identification is usually on bacteria. Viruses are also occasionally categorized as microorganisms, although in the strictest sense they are not true organisms because they have no metabolism.
Identifying microbes basically means determining their species and thus categorizing them in the taxonomic hierarchical scheme: domain (eukaryotes and prokaryotes), kingdom, phylum, class, order, family, genus, and species.
The practice of classifying microorganisms into species originates from a time when taxonomy was largely based on differentiating by means of biochemical reactions and, in many cases, it is imprecise and describes non-uniform phylogenetic units within the microorganisms. The conventional biological definition for distinguishing species from each other by the unlimited sexual reproductive ability of their members among themselves cannot, unfortunately, be applied to microorganisms. Modern methods of molecular biology therefore lead to many corrections in the assignment of species to the genera, and also to the introduction of new species and, in the case of bacteria, additional taxonomic classes below the species, e.g. subspecies. Furthermore, observations in medicine and cell biology have led to the insertion of serovars or serotypes, which are particularly distinguished by different types of attachment behavior at the cell membrane, but which do not constitute a separate species or subspecies. Microorganisms are collected worldwide in many places in the form of frozen or freeze-dried strains.
The identification of a microbe sample within the meaning of this text involves the determination of at least the genus, usually the species, and if possible the subspecies as well, or—in favorable cases—even the serotype or the strain. The strains of one species stored around the world can often be distinguished from each other using molecular biological methods, just as humans are different although they all belong to the same species.
In a more general sense an identification can also mean a characterization in terms of other properties, such as the pathogenicity of a microorganism (ability to cause disease) or the resistance of a microorganism to antibiotics, but this type of identification is only regarded here in a more general sense because many of these characteristics are directly linked, or often have at least a high statistical probability of being linked, with the species, subspecies, serovars or strains. The statistical linkage may even vary from location to location, sometimes from hospital to hospital.
The traditional identification of microorganisms in a sample under investigation requires the cultivation of colonies of the microorganisms. The “API tests” used in laboratory practice comprise different culture media for the cultivation, which can be used to detect specific metabolic characteristics of the microorganisms, thus allowing an initial, usually approximate, taxonomic classification of the microorganisms. Moreover, the microscopic morphology of individual organisms of a colony and the morphology of the colony itself are investigated. On the other hand, new molecular-biological identification methods based, for example, on a DNA or RNA sequence analysis after replication of specific genetic sequences by polymerase chain reaction (PCR), or on mass spectrometric detection of specific molecular cell components of microorganisms, have been known for some years. These new methods are superior to conventional methods in terms of specificity (true-negative rate), sensitivity (true-positive rate), other error rates and analytical speed.
The identification of bacteria by mass spectrometric measurements has been described in detail in the review by van Baar (FEMS Microbiology Reviews, 24, 2000, 193-219: “Characterization of bacteria by matrix-assisted laser desorption/ionization and electrospray mass spectrometry”), for example. The identification is achieved by means of a similarity analysis between a mass spectrum of the bacteria to be identified and reference spectra of accurately known bacteria. During the similarity analysis, a similarity index is assigned to each of the reference spectra. This index characterizes the agreement between the reference spectrum and the mass spectrum of the sample. A bacterium can be classified as identified, for example, if the similarity index is significantly larger than the similarity index for all other reference spectra and also larger than a specified minimum value.
The reference spectra are usually collected in a library, which can contain not only reference spectra of bacteria, but also of other microbes, in order to identify not only bacteria, but also other species of microorganism.
Validation of a library of reference mass spectra requires every entry to be traceable and very accurately documented. The reference spectra are obtained from accurately characterized strains. Such strains of microorganisms are collected worldwide in governmental, public and private institutes, usually stored in the deep-frozen or freeze-dried state, and are available for scientific purposes. Microbiological research institutes frequently hold further strains of newly discovered species of microbe. The exact classification in the taxonomical hierarchy scheme is sometimes disputed, but this does not diminish the value of such strains, as long as the data are traceable. The exact taxonomical classification may be even improved by mass spectrometric means because the similarity indices reflect the relationship of microbe species and their belonging to genera and families.
The term “strain” describes a population which has been propagated from a single organism and identified with certainty in a laboratory of recognized reputation. Spectral libraries are compiled using strains whose identity and classification in the hierarchy system above is accurately known (even if occasionally disputed and subject to changes), i.e. which belong to a certain species of microbe, or, if available, a specific subspecies. Since the microbes are collected and stored in different places worldwide, there are also many strains worldwide which belong to the same subspecies. Although these strains are classified as the same subspecies, there are sometimes slight differences in the mass spectra, which indicate that there are individual differences (as is the case with animals or plants of the same species), such as the serotypes. The strains are marked by internationally agreed labels after the name of the species or subspecies. In contrast to the term “strain”, a population which has been grown from a single organism in a microbiological laboratory, e.g. in the process of identification, is termed an “isolate”.
The generation of mass spectra of the microbes usually starts with a cleanly isolated colony on a solid, usually gelatinous nutrient medium or a centrifuge sediment (pellet) from a liquid nutrient medium. A small swab, e.g. a wooden tooth pick, is used to transfer a tiny quantity of microbes from the selected colony or sediment to the mass spectrometric sample support. A strongly acidified solution of a conventional matrix substance is then sprinkled onto this sample, the matrix substance serving for a subsequent ionization by matrix-assisted laser desorption (MALDI). The acid of the matrix solution attacks the cell walls and weakens them; the organic solvent penetrates the microbial cells, causes them to burst due to osmotic pressure, and releases the soluble proteins. The sample is then dried by evaporating the solvent, which causes the dissolved matrix material to crystallize. The soluble proteins and to a minor extent other substances of the cell, are embedded into the matrix crystals.
There are borderline cases where the cell walls of the microbes are difficult to destroy or are not destroyed at all by the matrix solution. A slightly different type of digestion is then possible, where in addition to strong acids, sonication or mechanical treatment also helps to destroy the microbial cell wall. These digestions result in mass spectra which are very similar to those prepared in the usual way on sample supports. These digestion methods will not be discussed further here, however. The libraries of reference spectra may contain reference spectra for both preparation methods in parallel.
The sample preparations dried on sample supports, i.e. the matrix crystals with the embedded analyte molecules, are bombarded with pulsed UV laser light in a mass spectrometer, creating ions of the analyte molecules which can then be measured in the mass spectrometer, separated according to the mass of the ions. This type of ionization by matrix-assisted laser desorption is usually abbreviated to “MALDI” (“Matrix-Assisted Laser Desorption and Ionization”). Usually, special MALDI time-of-flight mass spectrometers are used for this purpose.
Nowadays, the mass spectra of the microbe proteins are scanned in the linear mode of these time-of-flight mass spectrometers, i.e. without using an energy-focusing reflector, because this gives a particularly high detection sensitivity, even though the mass resolution and the mass accuracy of the spectra from time-of-flight mass spectrometers are much better in the reflector mode. In the reflector mode, however, only around a twentieth of the ion signals appear, and the detection sensitivity is one to two orders of magnitude worse. The high sensitivity of the linear mode is based on the fact that not only the stable ions but also the charged and neutral fragments from so-called “metastable” decays of the ions are detected in a time-of-flight mass spectrometer. Secondary electron multipliers (SEM) are used to measure the ions, which means that the ion detector measures not only the unfragmented molecular ions and the fragment ions but also the neutral particles, because they also generate secondary electrons on impact. If a singly charged molecular ion fragments into five particles, for example, four of them are by necessity neutral particles. All the fragments that originate from one species of parent ion have the same speed as the parent ions and thus arrive at the ion detector at the same time. The time of flight is a measure of the mass of the originally undecayed ions.
The increased detection sensitivity is so crucial for many applications that one accepts many of the disadvantages of time-of-flight mass spectrometers in linear operation, such as a significantly lower mass resolution and also a reduced mass accuracy. The energy of the desorbing and ionizing laser is increased for these applications, something which increases the ion yield but also increases their instability, although this is of no consequence here.
The poor reproducibility of the desorption and ionization processes for the generation of the ions in a MALDI time-of-flight mass spectrometer operated in linear mode means the masses of the individual mass signals shift slightly from spectrum to spectrum. These shifts in the mass scales of the repeat spectra with respect to each other can be readjusted using a method described in the document DE 10 2004 051 043 A1 (M. Kostrzewa et al.; GB 2 419 737 B; U.S. Pat. No. 7,391,017 B2), before the repeat spectra are combined to produce a reference spectrum. The mass scales of sample and reference spectra can also be aligned with each other by this mass scale adjustment program. This means that smaller mass tolerance intervals can be used to determine matching mass signals during the similarity analysis, which is decisive for a good identification, even if it takes some time.
The mass spectrum of a microbe isolate is the frequency profile of the mass values of the ions. The ions here are predominantly protein ions, in most cases ions of ribosomic proteins. The mass spectra are usually acquired in the mass range from 2,000 to 20,000 atomic mass units; the most useful information for identifications is found in the mass range from around 3,000 atomic mass units to 15,000 atomic mass units. The reduced resolution means the mass signals of the different isotopic compositions of the ions in this mass range are no longer resolved individually; instead, each isotope group forms a single fused mass signal. The protein ions in this method are usually only singly charged (charge number z=1), thus we can simply refer to the mass m of the ions here, instead of using the more accurate term of the “charge-related mass” m/z, as is actually necessary and conventional in mass spectrometry. Only occasionally mass signals of doubly charged ions occur in the mass spectra of microbes; but as these mass signals are treated like all the others without any difference whatsoever, there is no need to distinguish between singly and doubly charged ions.
Every laser light pulse produces a single mass spectrum, but one which contains the signals of only a few hundred to a few thousand ions. In order to obtain more reliable and less noisy mass spectra, a few tens to a few hundreds of these individual mass spectra are added together to form a sum mass spectrum. The individual mass spectra here can preferably originate from different parts of the sample preparation or even from different sample preparations. The term “mass spectrum of a microbe”, or more simply “microbe spectrum”, shall always denote this sum mass spectrum.
The profile of the proteins of this microbe spectrum is very characteristic of the species of microbe in question because each species of microbe produces its own, genetically predetermined proteins, each having their own characteristic masses. The abundances of the individual proteins in the microbes, in as much as they can be measured mass spectrometrically, are also largely genetically determined because their production is controlled by other proteins, and the abundances depend only slightly on the nutrient medium or the degree of maturity of the colony. The protein profiles are characteristic of the microbes in the same way that fingerprints are characteristic of humans.
Reference spectra for spectral libraries are generated by first producing colonies or centrifuge sediments of microbes of specific, accurately documented strains and then acquiring mass spectra from them. A large number of sum mass spectra are always acquired for a reference spectrum; they are termed repeat spectra here. Mass spectra of microbes usually contain around 50 to 200 separate mass signals, but many of them are pure noise because the search for mass signals is set to high sensitivity. The reference spectra are therefore usually reduced to a maximum of 70 or 100 mass signals, for example; specialists consider even a limit of 50 mass signals to be sufficient. The information content of a mass spectrum with 50 mass signals in the mass range between 3,000 and 15,000 atomic mass units, where even at reduced mass resolving power far more than 2,000 distinguishable mass signals can occur, is already incredibly high, without taking account of the intensity differences (close to 2,00050≈10155 patterns can be distinguished from each other). For the restriction to 70 or 100 mass signals, the repeat spectra are initially combined to give an average spectrum very rich in signals before first deleting all mass signals which occur only a few times in the repeat spectra, and then deleting the mass signals with very low intensities until the desirable maximum number of mass signals remains.
The mass spectra of the microbes to be identified, called “sample spectra” below for short, are usually generated in a similar way from repeat spectra and limited to a predetermined number of mass signals in order to exclude noise signals as best as possible. This number of mass signals in these sample spectra is usually selected to be slightly higher than the number of signals in the reference spectra.
There are different types of similarity analysis, which are also usually based on different forms of the reference spectra. The reference spectra can store many or few mass-spectrometric parameters for each mass signal, which has a great impact on the length of the reference spectra and thus on the size of the library.
The publication by Jarman et al., for example, (Analytical Chemistry, 72(6), 2002, 1217-1223: “An Algorithm for Automated Bacterial Identification Using Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry”) elucidates a computational method for the generation of reference spectra of a library and for similarity analysis between a mass spectrum of a sample under investigation (called “sample spectrum” here) and the reference spectra of the library. This method particularly utilizes the reproducibility of the individual mass signals when generating the reference spectra. For the similarity analysis of a sample spectrum, the method derives an individual index for every mass signal of every reference spectrum, indicating how well it matches the mass signal of the sample spectrum. Particular attention is paid to the agreement of the intensities, on the one hand, and a weighting resulting from the spread of the reference signals, on the other hand. The smaller the spread of the intensity for this mass signal (i.e. the better this mass signal can be reproduced), the higher this individual weighting index will be. Mass signals which have poor reproducibility receive a low individual weighting index. The individual weighting indices of the mass signals of the reference spectra thus obtained are then summed to derive a similarity index indicating how closely each reference spectrum matches the sample spectrum. The reference spectra in a library are then sorted according to the magnitude of the similarity indices. The result is a list, sorted according to similarities, which contains the designations of the microorganisms assigned to the reference spectra and the similarity indices.
This algorithm according to Jarman et al., which is only outlined here, requires that a reference spectrum contains the following values for the individual mass signals of the reference spectrum, which are ascertained from the repeat spectra: the averaged mass, the mean deviation of the averaged mass, the average intensity, the mean deviation of the average intensity and how frequently this mass signal occurs above background in the repeat spectra, as a percentage, i.e. its occurrence above the sensitivity threshold. It is usual here to taken into account only those signals in a reference spectrum which have a predetermined minimum percentage of occurrence.
In addition to the method by Jarman et al., several other types of identification algorithm and reference libraries have been elucidated in the literature, but they will not be dealt with further here.
As identification methods developed by the applicant's company show, significantly simpler mass spectrometric identification methods can also have a very high success rate. For example, in contrast to the method used by Jarman et al., it is expedient—for the acquisition of both the reference spectra and the sample spectra—if the spectra are generated under standardized conditions for the cultivation of the colony, the sample preparation and the mass spectrometric spectrum acquisition. This measure alone leads to an improved identification. There is then no need to store any of the mean deviations of mass values and intensity values in the reference spectra, which makes the library smaller and more practical, and makes the similarity analysis faster. A method for aligning the mass scales of the repeat spectra with respect to each other, which frequently show a slight mass shift, has already been dealt with above. Since many mass signals occur in only some of the repeat measurements, but can nevertheless contribute to the identification, it has proved expedient in many experiments with reference spectra simplified in this way to also record the occurrence rate of a mass signal. The occurrence rate gives the percentage of the repeat spectra in which this mass signal occurs. A mass signal then only has three entries: averaged mass, averaged intensity and occurrence rate.
In its simplest form, the method for the similarity analysis with these simplified reference spectra can consist in examining every reference spectrum to see how many of its mass signals agree in each case with those of the microbe spectrum within a specified mass tolerance. The number of these hits, divided by the number of mass signals in the reference spectrum is then an initial partial measure for the similarity; the number of hits divided by the number of mass signals in the microbe spectrum is a second partial measure. A third partial measure can be derived from the intensity similarity of the mass signals that agree. The product of the three partial measures gives the similarity index. A refinement can be introduced by counting each hit only with the occurrence rate of this mass signal, i.e. with a number which is possibly less than one.
This algorithm can be adjusted by an appropriate scale transformation to a maximum similarity index between measured and reference spectra, for example a maximum similarity index of 3.00 for identical spectra. It is even possible to transform the similarity indices in such a way that a similarity value of 2.00 can be considered to be a minimum requirement for an identification. In our experience, such a minimum requirement and a corresponding maximum value have a high psychological value for the acceptance of the method.
For this simple similarity analysis it is possible to develop an algorithm which calculates a similarity index for a reference spectrum in around five milliseconds, some of this time being taken up by the mass scale adjustment. An identification with the aid of a sample spectrum in a reference library containing around 3,500 reference spectra, as are available nowadays, requires around 15 seconds on normal computer servers. This time is compatible with acquisition times for the sample spectra which are achieved with pulsed lasers at 20 hertz repetition frequency, but today's mass spectrometers can acquire mass spectra at 1,000 hertz in a much shorter time.
The development of MALDI mass spectrometers is advancing apace; the first spectrometers with a 2,000 hertz laser shot frequency are already on the market. The acquisition time for sample spectra is decreasing to between one and three seconds. On the other hand, one can expect that the libraries will quickly grow to 10,000 reference spectra and more as further reference spectra are entered, while the development of higher speeds for PCs and computer servers is advancing only slowly. The computation times for the identification will therefore rapidly increase to around one minute and will thus no longer be compatible with the acquisition times. One solution (albeit expensive) to the problem consists in equipping mass spectrometers with multi-processor systems. It would, however, be a welcome development if methods for increasing the identification speed were available which, in combination with the simple computer systems currently in use, produce suitably short identification times.