Determining the structure of unknown compounds is of prime importance to practioners of analytical chemistry. In particular, mass spectrometry has been and continues to be a widely employed structure characterization technique in the chemical, biological and medical sciences. The chemical structure of interest can range from very simple diatomic arrangements to very complex protein or DNA macromolecules. Considerable effort has been invested into hardware and software systems capable of efficient structure determination of unknown compounds by means of mass spectrometry.
Widely used library search systems are designed to identify compounds represented in the reference library that might have generated the submitted single stage or tandem mass spectrometry spectra from the unknown compound. These systems are based on an assumption that a chemical entity exhibits a unique spectral fingerprint that should have a counterpart in a reference library (McLafferty and Stauffer 1985; Stein and Scot 1994; Sander 1999; Gross et al. 2002; Alfassi 2003). When the unknown compound is not represented in the library, the compound cannot be identified by this means. To overcome this shortcoming, various “interpretative” library search techniques have been developed to derive at least partial structural information by estimating the probability of substructure occurrence and absence in a single stage mass spectrum using a predefined set of substructures (Damen et al. 1978; Warr 1993; Stein 1994). The identification of a substructure from a given mass spectrum using such a method can be difficult or even impossible because its success will depend on the relative rates of competitive processes that depend, in turn, on other structural features of the molecule. Even for substructures that commonly produce characteristic patterns, the actual “signatures” can be highly variable (Stein 1994).
Statistical pattern recognition methods have been applied for the selective detection of compound classes or individual functional groups from mass spectra. Most of these methods are based on the presumption that common structural moieties exhibit identical or similar fragmentation patterns (Scsibrany and Varmuza 1992; Drablos 1992; Lohninger 1994; Lebedev and Cabrol-Bass 1998; Klagkou 2003). To achieve statistically relevant results, a relatively large number of suitable spectra of compounds with common structural properties must be processed. Multivariate statistical methods have been successfully employed in the determination of certain structural features in selected data sets, usually in the range of 70–90% correct identifications; however, erroneous classification cannot be avoided. Since the mass spectrum reflects not only the relative rates of competitive processes but also complex gaseous ion thermochemistry, the dynamics of the reaction are highly variable. Even structurally highly similar compounds often do not exhibit a uniform fragmentation pattern. As a consequence, multivariate statistics and related methods cannot alone be considered as general purpose and reliable interpretation approachs.
Various methods for the interpretation of mass spectra based on expert systems or artificial intelligence have been developed (Lindsay et al. 1980; Warr 1993 Part 2). These methods employ a variety of advanced mathematical algorithms to derive structural information from spectra using individual or a combination of pattern recognition methods, decision trees, empirical rule-based systems, knowledge bases, exploratory techniques and other heuristic systems. A central problem when dealing with expert systems and artificial intelligence methods is their narrow application range in terms of structural variety. These methods need to be selectively trained for each chemical class of interest. In order to perform satisfactorily, these methods require, in the training phase, a statistically relevant number of spectral representatives for each structural class or group, which may pose a serious problem if no such spectra are available. On the other hand, specific applications that do not require universal substructure determination capabilities can benefit from the inherent selectivity of these methods, which allows the achievement of high probabilities of correct identification.
In the past, there have been several attempts to design algorithms for structure elucidation based on substructure identification from tandem mass spectra (Enke et al. 1987; Wade et al. 1988; Palmer et al. 1989). Although these systems include some expert system features, they are very similar to the interpretative techniques of single stage library search methods mentioned above. These method try to derive substructural information from a comparison of calculated m/z value ratios and/or neutral loses of predefined single- or diatomic substructures stored in a library with correspondingly calculated parameters from analyzed tandem spectra. Owing to the immense structural variability and the huge number of structurally different isobaric ions, the m/z ratios and neutral lose values, even with exact mass precision, are usually not distinct enough to provide rules for unambiguous identification of the predefined fragment structures because of the immense combinatorial cardinality of structures or fragments with identical molecular mass.
With the advent of proteomic research, a full variety of new structure characterization techniques for linear molecules has emerged. One method compares an experimental product mass spectrum with theoretical spectra calculated from amino acid sequences of database proteins and identifies the sequence that best fits the tandem mass spectrum (Yates III et al. 1995; Perkins et al. 1999; Sadygov et al. 2002; Anderson et al. 2003). An alternative approach, termed “de novo sequencing,” converts the fragment ion mass values derived from spectra into a ranked list of most probable amino acid sequences (Shevchenko et al. 1997; Fernandez-de-Cossio et al. 1998; Dancik et al. 1999; Horn et al. 2000). The major limitation of these methods is the fragmentation model which considers that peptides fragment in a uniform manner. A considerable number of routinely observed peptide spectra do not exhibit a contiguous series of backbone cleavage sequence ions because of the vast variability of dissociations patterns. This becomes even more prevalent in non-linear molecules, preventing the adoption of proteomic methods for the interpretation of a majority of organic compounds.
There has been a rapid expansion in the use of tandem mass spectrometry for the structural elucidation of organic compounds. In tandem mass spectrometry, the ions that emerge from the ionization process (precursor ions) can be further isolated and fragmented by means of collision induced dissociation or various other ion activation techniques that give rise to second stage spectra called product spectra. Given the appropriate hardware, the isolation and activation procedures can be successively repeated in several stages. Resulting product spectra exhibit fragmentation peaks from isolated ions, providing an added dimension to the overall fragmentation pattern. Although tandem mass spectra along with the masses of their precursor ions contain important portions of structural information of the elucidated molecule, the structural arrangement remains ciphered through the set of product fragment masses. Accurate mass measurements can greatly reduce the number of possible elemental compositions for a given fragment mass. Still, the immense variability of dissociation patterns obscures structural determination.
With the introduction of atmospheric pressure ionization techniques in combination with tandem mass spectrometry to analytical chemistry, library techniques for tandem spectra have been developed (Dheandhanoo 1988; Martinez and Ganguli 1989; Martinez 1991; Bristow et al. 2004; Joseph and Sanders 2004; Pittenauer et al. 2004). There are, however, several potential difficulties of obtaining standard library searchable spectra. As different types of analyzers favor different fragmentation pathways for the same compound due to different kinetic energies of the precursors, different collision energy regimes, few or multiple collisions and unimolecular or consecutive decays, data from different types of analyzers cannot be easily incorporated in one database. Although, despite the inherent variability of sample preparation, experimental conditions and instrumentation designs, attempts have been made to create libraries applicable to a wide range of possible “real-life” situations and to automate the structure identification process, improved systems still are needed (Sander 1999; Mistrik et al. 2003; U.S. Pat. Nos. 6,624,408, 6,623,935, 5,072,115, 4,008,388).