Methods for rapidly identifying unknown compounds from their corresponding mass spectra have been evolving. Sweeney (2003) described in great detail a process for deriving modular structures directly from CID-type accurate-mass mass spectral data; this process will herein be called partitioning. Modular structures obtained by partitioning basically show how mass spectral fragments may be related to one another. Many small organic compounds can be represented in the form of unbreakable cells or subfragments, of known elemental composition, joined together at cleavable seams. These representations are called modular structures. Modular structures are a convenient way of summarizing and viewing CID-type mass spectral data. Each modular structure has a unique molecular formula. The fragment ions are viewed as different sets of connected subfragments; each subfragment has an elemental composition that is complementary to all of the other subfragments composing the modular structure. For example, if a plausible elemental composition of the whole molecule has only one sulfur atom, then assigning that sulfur atom to one particular subfragment will preclude all of other subfragments from having a sulfur atom.
In contrast to Wu's basket-in-a-basket approach that also can yield structural information, partitioning does not require accurate mass MS4 or MS5 data, obtained with difficulty on expensive instruments, such as FT-ICR mass spectrometers. In addition, partitioning can often yield spatial information about how the subfragments are arranged in the modular structure, whereas the basket-in-a-basket approach yields little spatial information. Because there are usually more fragments than subfragments, the calculated mass defects of the subfragments will often be more accurate than the fragment ion masses since the subfragments are “weighed” in combinations rather than one at a time (Sweeney 2003). The partitioning approach is also conceptually simple; it has few “rules”—in contrast to some competitive expert system software. For example, Mass Frontier now has about 20000 rules according to Kind.
Modular structures differ from molecular structures in two ways. First, the number of hydrogens in a particular subfragment of the modular structure will often differ from the number of hydrogens in the corresponding part of the molecular structure. However, the non-hydrogen atoms (herein called heavy atoms) are present in equal numbers (Drawing). In addition, while the heavy atoms of the subfragments are usually present in exactly the same combinations found in corresponding parts of the molecular structure, there is a lack of atomic sequence information in the modular structures. For example, one subfragment of the modular structure of xemilofiban (Drawing, blue color) is a combination of atoms (C2H6O), which corresponds to the ethoxy moiety (—O—CH2-CH3) in xemilofiban. Ignoring the hydrogens, the same combination of atoms (C2O) is present in both the modular structure and the molecular structure. However, while the combinations of elements are the same, the molecular structure has a specific ordering of atoms (—O—C—C) that is lacking in the modular structures.
Rational Numbers® partitioning software was commercially available in an Apple Mac mini format from December 2006 to December 2007; it was later available on The Sun Grid Compute Utility, also called the Sun Cloud in the wikipedia, from April 2007 until October 2008 when Sun closed the Sun Grid compute utility in a cost-cutting move.
How Partitioning has been Used (Sweeney 2007)
1. De Novo Identification of a Novel Compound (Rational Numbers® Partition)
With limited background information, it is extremely difficult to identify a novel compound from mass spectral data. However, combined with NMR data, the complete molecular structure can often be derived. NMR is very useful for determining which atom is connected to which atom, but sometimes there are gaps (substructures with no hydrogens or carbons) in a compound. In a sense, mass spectrometry shows the clumps of trees in the whole forest, whereas NMR shows exactly how the trees are arranged in each clump.
In the case of de novo identification, the 10 modular structures best accounting for the mass spectral data are saved. These modular structures give a rough idea of the overall structure of the compound. Some modular structures will fit the data very well, but may not correspond well to the actual molecular structure. Although the modular structures are ranked, there is no way of knowing a priori which ones match the structure of the compound that produced the spectral data and which ones do not. For de novo identification work, modular structures with up to five subfragments have been used.
2. Identification Using the “Template” Approach (Rational Numbers® Assign)
In the pharmaceutical industry, unknown compounds are usually closely related to a lead compound: degradation products, impurities, or metabolites. Traditionally, the mass spectral data of that lead compound are used to work out the fragmentation pathways, and the unknown compounds are then identified based on the changes in the masses of various fragments. This approach works well, but it can be very time consuming.
Watson et al. and Hill et al. used systematic bond-disconnection to assign accurate-mass fragments to known compounds. A similar approach is used to assign subfragments of modular structures to specific molecular subgroups of a lead compound. The heavy atom distribution of modular structures, derived from the mass spectral data, is compared to the heavy atom distribution of a computerized molecular structure of the lead compound to find matches. Only the modular structures that correlate with the computerized molecular structure are saved, and a monochrome molecular structure can then be color-coded with the same color scheme as the modular structures. This makes the fragmentation easy to visualize.
By using the modular structures that match the lead compound as templates, related unknown compounds can now be identified by comparing modular structures to modular structures. The modular structures of the unknown compound that best match the templates are saved and linked to the template modular structure that they most closely match. For correlating related compounds to a lead compound of known structure using the template approach described by Rourick et al., subfragments are clearly the most simple units of comparison.
3. Identification by Matching Compounds (Rational Numbers® FragSearch and IndexSearch)
The basic approach used to assign subfragments and fragments to a single template compound, systematic bond-disconnection, and comparison of the heavy atom distributions has been applied to searching molecular structure databases. Traditional spectral libraries are not needed. A set of modular structures are derived from the mass spectral data, and then this set of modular structures is compared to all computerized molecular structures in the database that have a similar mass. Computerized molecular structures that match modular structures are then ranked according to how many modular structures are matched and the scores of the matching modular structures. The overall objective is to draw a rough picture of molecules that would correlate with the accurate mass fragmentation data, and then to search through an index of the MDL® (now Symyx) Available Chemicals Directory or PubChem to find matching compounds. For searching, modular structures with up to four subfragments have been used. The searching was done by comparing the heavy atom compositions of subfragments to the heavy atom compositions of subgroups generated by applying systematic bond disconnection to a computerized molecular structure. The distribution of RDEs (ring and double-bond equivalents) was also compared.
Determining modular structures from mass spectral data requires finding the accurate masses of the subfragments, determining the elemental compositions of the subfragments, and finding a way to connect the subfragments together in a manner consistent with all of the mass spectral data. This invention deals with finding the accurate masses of the subfragments.
Prior Art Used to Determine the Accurate Masses of Subfragments
The spectral ions are neutralized by adding the mass of a proton to negative ions and subtracting the mass of a proton from positive ions. Positive and negative ion data are then pooled. This procedure of neutralizing ions is performed on all data sets, prior to finding the subfragment masses.
Accurate masses of subfragments are currently found in a four step process (Sweeney 2003):
Step 1: Partitions of the integral molecular weight are found. A partition is a mathematical term for a set of integers that sum up to another integer. For each partition, every combination of those integers is then summed to select those partitions that best account for the fragment masses.Step 2: Fragment masses are then “assigned” as sums of different combinations of the individual integers. The individual integers can be viewed as the integral masses of subfragments; assigned fragments are then sums of subfragments. A score based on coverage (weighted intensity) of each assigned ion is also calculated.Step 3. Partitions with “linked subfragments” are then removed. Linked subfragments are basically trivial solutions in which a subfragment has been divided into two subfragments that always are assigned together.Step 4: The fragments have been assigned as integral sums of various combinations of subfragments. The mass defects of the subfragments that compose any particular fragment must also sum up to the mass defect of that fragment. Since the mass defects of the fragments are known, the mass defects of the subfragments can be calculated by solving a set of simultaneous linear equations.
At this point we have a score and a set of subfragment accurate masses for each partition. The current process for finding accurate masses of subfragments is CPU intensive and therefore time-consuming.
Partitioning is very CPU intensive and this has limited its development because most potential improvements would also significantly increase the CPU requirements. As an illustration, the data for xemilofiban, which was an example in the 2003 Sweeney paper, will be used. The masses of the subfragments of 4-subfragment partitions were found.
The accurate-mass MS/MS data for xemilofiban in the paper has 12 fragments, including the protonated molecule. The molecular weight is 358. For this molecular weight, depending on the starting mass, there are 151559 possible integral partitions of 4 subfragments. Generating these 151559 partitions took 6 milliseconds (step 1). Finding partitions having a score greater than 57 (arbitrary score chosen for comparison purposes) took another 253 milliseconds (step 2). The most CPU intensive operation was calculating the mass defects using the multi-stage Monte Carlo optimization (MSMCO) to solve the simultaneous equations. The 169 MSMCO optimizations that were done took 10237 milliseconds (step 3), roughly 61 milliseconds each. This gave a total time of 10496 milliseconds. This does not include any operations to determine possible spatial arrangements of the subfragments or to find elemental compositions of the subfragments.
Total Partitions 151559ScoreABCD5817026418937614107891811214581702653596411410789164094858170265400425141078816001657017026764979113507981410790731702708200631180525141078761170264820058124052613507975818937740042414107891581053613398534802051350797141079061359642419630991158181121370359639460420135079614107906135964582005699115514107856136046482005810503261350798613804989703021030288120055467400426419634135079714107886740042646041995037017704307340042982005995036814107895840117382005810096171350797584204678200599303301410789614196308200559911581350800734604218200599503691350797614802048200589305861350797585307368200648200551410788585307378200638800511350792615898056497907609951581052615898037609978200561410787585907318200598200581350797585966098141818200581350797616497887609988200551350801bolded partitions above correlate well with the molecular structure
The basic problem with the present approach for generating modular structures is that the process is very CPU intensive and therefore time-consuming, especially as the molecular weight increases and the number of subfragments increases (e.g. a 5-subfragment set of masses takes much much longer to find than a 4-subfragment set of masses). More computer power is very helpful; using a computer cluster such as the Sun Grid allows parallel processing and significantly reduces the elapsed time, but introduces the added complexity of opening and maintaining an account on a compute utility.