The present invention relates to chemical structures and, more particularly, to a method and apparatus for assigning numeric or alpha-numeric identifiers to chemical structures and for identifying chemical equivalency.
In the field of chemistry there are numerous molecular properties that are of possible interest to chemists. Molecules may be compared for equivalency based on these properties. For example, a high level or general comparison of equivalency may compare two compounds or molecules with respect to the number of non-hydrogen atoms they contain or the number of bonds they have. Alternatively, a more detailed comparison of equivalency may compare chemical structures based not only on the number of bonds or atoms, but on specific atomic connectivity or spatial relationships of those atoms. For example, ring systems or cyclic systems of two molecules or compounds may be compared for equivalency.
A cyclic system is a chemical structure in which atoms are bonded together to form single or multiple rings. Cyclic system equivalence is of particular interest in the field of medicinal chemistry where the equivalency of cyclic systems may correspond to physiological or biological properties. However, a particular criterion of molecular equivalence only becomes operable, in a practical sense, when a chemist can identify each of the distinct classes of molecules that are equivalent with respect to the properties of interest.
Relative and absolute identification schemes are commonly used for identifying chemical structures that are molecularly equivalent. Relative identification schemes assign a unique identifier to each molecular structure encountered by the identification scheme. The assigned identifiers are not related to any particular information in a chemical structure. For example, a relative identification scheme may assign an identifier of one to a first chemical structure, an identifier of two to a second chemical structure and so on. A relative identification scheme, therefore, requires a memory that stores a list of identifiers that have been previously assigned to molecular structures.
An absolute identification scheme assigns an identifier to a molecular structure based solely on the information available in the molecular structure being identified. For example, an absolute identification scheme may assign a chemical structure having three atoms and two bonds, the identifier 32, wherein the first digit represents the number of atoms in the structure and the second digit represents the number of bonds in the structure. Absolute identification schemes are beneficial in that the scheme need not check to see if an identifier is in use when assigning a new identifier. Additionally, through the use of an absolute identification system, two collections of compounds (e.g., molecular structures) may be directly compared with respect to the molecules they contain without coordinating their identifiers.
Criteria of equivalence are useful when a chemist is selecting compounds (e.g., collections or mixtures of molecules) for purchase. When selecting compounds for purchase, the chemist may first filter the list of compounds to screen out the compounds that are clearly of no interest. After screening out the uninteresting compounds, the chemist may visually inspect the remaining compounds. The chemist may sort the remaining compounds according to their cyclic system identifiers and, therefore, may include or exclude portions of compounds having common cyclic system identifiers. In selecting compounds for purchase, the use of cyclic system identification may save time and reduce error in selecting compounds for purchase. However, if a particular identification system erroneously assigns the same cyclic system identifier to compounds having different cyclic systems, the chemist loses faith in the fidelity of the identification system and the ability of the identification system to distinguish different chemical structures.
Criteria of equivalence are also useful in comparing two or more different collections of compounds. For example, if a chemist desires to know which compounds are similar with regard to particular properties among compound collections and which compounds differ with regard to particular properties among the compound collections, the chemist may use an identification system to name or identify each compound in the two compound collections. If the identification system the chemist uses is an absolute identification system, the chemist may simply compare the identifiers of the chemical structures of the compounds in the two compound collections. An identifier common to the compound collections indicates a common compound between the compound collections. A unique identifier in one of the collections indicates a compound found only in that collection.
In chemical and drug research, chemists often construct compound screening collections or libraries. Screening collections are used to scan a subset of a collection of compounds for a particular activity, rather than scanning the entire compound collection. The subset could be designed to emphasize particular types of compounds or could be designed to contain dissimilar compounds. If the cyclic systems of the compounds are used as a typing criterion for the collection, screening subsets are easily constructed. For example, after the chemist uses a filtering process to exclude compounds that the chemist does not wish to consider, the chemist may randomly order the cyclic-system identifiers and then select the number of compounds the chemist wishes from each successive cyclic system group until the chemist has a subset of the desired size.
If a screening operation has a large number of active compounds (compounds active in a biological test system of interest to a project team), the task of focusing on which of those compounds (called leads or hits) to pursue as useful starting points for lead optimization can be difficult. Numerous factors enter into the evaluation of a lead and, in many cases, close analogs of an active compound exist which differ at only one position by a small structural change from the active compound. A structure activity relationship (SAR) is sometimes said to exist if a chemist finds pairs of close analogs that differ significantly in their activity. Grouping compounds by cyclic system greatly accelerates and systematizes the process of finding such pairs. Finding such an SAR supports the choice of that cyclic system for one criterion to be used in finding leads.
In lead optimization efforts, large numbers of closely related compounds may be synthesized and tested. These efforts are guided by a growing understanding of the related SARs. SARs evolve out of numerous pairwise comparisons of closely related structures. If N compounds related to a lead exist there are N(Nxe2x88x921)/2, or roughly N2/2, possible pairwise comparisons that must be considered. If N is between 1,000 and 10,000, there may be between roughly 500,000 and 50,000,000 pairwise comparisons.
Obviously, in practice most pairwise comparisons are never made. Instead the comparisons that are considered are restricted to much smaller subgroups of compounds. For a subgroup 1/Kth the size of N, there are roughly (N/K)2/2, pairwise comparisons per group. Thus, if N is between 1000 and 10,000, and the subgroup size K is 1/100 the size of N, there may be between roughly 50 and 5,000 pairwise comparisons. With such efficiency gains in subgrouping, there is a compelling interest in a flexible and fast way of forming and organizing subgroups. Such a flexible and fast technique is provided by using identified cyclic systems.
A cyclic system browsing index partitions a large compound collection into interesting and non-overlapping subgroups, and thereby, enables a user to realize the preceding efficiencies in constructing useful pairwise comparisons. Constructing a comparable number of subgroups using conventional substructure and similarity searching methods is a time consuming and error prone operation.
As will be appreciated by those having ordinary skill in the art, highly accurate schemes for identification chemical structures based chemical graphs or pseudographs play a key role in the foregoing applications. An accurate identification system facilitates high-throughput browsing, grouping and searching of chemical databases. One absolute identification scheme commonly referred to as the xe2x80x9cMorgan Algorithmxe2x80x9d was proposed in xe2x80x9cThe Generation of a Unique Machine Description for Chemical Graphsxe2x80x94A Technique Developed at Chemical Abstracts Service,xe2x80x9d J Chem DoE 5:107, 1965. As shown in FIG. 1, a process representative of the Morgan Algorithm 10 includes various steps that may be executed on a processor, a computer, or the like. At step 12, the Morgan Algorithm 10 receives a chemical diagram, which may be in the form of a computer file. Step 14 processes the chemical diagram by assigning to each vertex (i) of the chemical structure an initial vertex value.
After each vertex (i) of the chemical structure has been assigned an initial value, step 16 updates the value of each vertex (i). In particular, for each vertex (i) step 16 sums the vertex values for the vertices connected to the vertex in question and assigns the sum to vertex in question. A mathematical representation of the operation performed by step 16 is shown below in Equation 1.                               v          i          xe2x80x2                =                              ∑            j                    ⁢                      v            j                                              Equation        ⁢                  xe2x80x83                ⁢        1            
Wherein vixe2x80x2 is the updated vertex value for vertex i, j is an index representative of the vertices connected to vertex i and vj is the value of vertex j. Equation 1 is repeated for each value of i (i.e., for each vertex), wherein the number of values of i is equal to the number of vertices in the chemical graph.
Step 18 determines whether the Morgan Algorithm 10 has iterated sufficiently to converge to a numerical identifier for the chemical structure. If the Morgan Algorithm 10 has not sufficiently iterated to converge, control passes from step 18 to step 16, wherein the value of each vertex is again updated. If, however, the Morgan Algorithm 10 has sufficiently iterated, control passes from step 18 to step 20, wherein the Morgan Algorithm 10 assigns a numerical identifier (ID) or numerical name to the chemical structure. Step 20 may be carried out by taking the sum or the product of all of the vertex values for the chemical structure.
When operating on certain chemical structures, the Morgan Algorithm may not converge to a unique solution and may fail to distinguish non-isomorphic chemical structures. Therefore, the Morgan Algorithm is less accurate than is desired by chemists and the like. An algorithm capable of distinguishing all non-isomorphic chemical graphs does not presently exist.
According to one aspect, the present invention may be embodied in a method of generating a numerical identifier representative of a chemical structure having a first atom of a first type, a second atom of a second type and a bond connecting the first atom and the second atom. The method may include the steps of representing the first atom with a first numerical value, representing the second atom with a second numerical value and representing the bond with a numerical bond value. The method may further include the steps of determining a number of bridge bonds that are found in the chemical structure, calculating an updated first numerical value based on the first numerical value, the second numerical value and the numerical bond value, calculating an updated second numerical value based on the second numerical value, the first numerical value and the numerical bond value and calculating the numerical identifier based on the updated first numerical value, the updated second numerical value and the number of bridge bonds.
In some embodiments, the steps of representing the first and second atoms with first and second numerical values may include the step of representing the first and second atoms with different numerical values if the first and second atom types are not similar or representing the first and second atoms with identical numerical values if the first and second atom types are similar.
In certain embodiments, the first and second atoms may be represented by first and second chemical symbols having first and second sets of characters, wherein the steps of representing the first and second atoms with first and second numerical values may include the steps of setting the first numerical value equivalent to an ASCII code sum of the first set of characters and setting the second numerical value equivalent to an ASCII code sum of the second set of characters.
Additionally, the bond connecting the first atom and the second atom may have a bond type and the step of representing the bond with a numerical bond value may comprise representing the bond with a numerical bond value that is related to the bond type. The numerical bond identifier may be divided by a factor of two if more than one bond connects the first atom and the second atom. If the bond type is a single, double, triple or aromatic bond, the step of representing the bond with a numerical bond value may include the step of making the numerical bond value equal to one, two, three or four, respectively.
The method may further include the step of scaling the updated first and second numerical values using a modulus operation. Additionally, the method may include the step of generating an array of prime numbers, the array having a size at least as large as the update first numerical value and the updated second numerical value, wherein the step of calculating the numerical identifier is based on the array of prime numbers.
According to a second aspect, the present invention may be used on a processor and embodied in a system for generating a numerical identifier representative of a chemical structure having a first atom of a first type, a second atom of a second type and a bond connecting the first atom and the second atom. The system may include a computer readable medium communicatively coupled to the processor, a first portion of software stored on the computer readable medium and adapted to be executed on the processor to represent the first atom with a first numerical value and to represent the second atom with a second numerical value, a second portion of software stored on the memory and adapted to be executed on the processor to represent the bond with a numerical bond value, a third portion of software stored on the computer readable medium and adapted to be executed on the processor to determine a number of bridge bonds found in the chemical structure and a fourth portion of software stored on the computer readable medium and adapted to be executed on the processor to calculate an updated first numerical value based on the first numerical value, the second numerical value and the numerical bond value. The system may further include a fifth portion of software stored on the computer readable memory and adapted to be executed on the processor to calculate an updated second numerical value based on the second numerical value, the first numerical value and the numerical bond value and a sixth portion of software stored on the computer readable medium and adapted to be executed on the processor to calculate the numerical identifier based on the updated first numerical value, the updated second numerical value and the number of bridge bonds.
According to a third aspect, the present invention may be embodied in a method of compiling a library for drug research, wherein the library may include a number of identifiers representative of a number of chemical structures. The method may include the steps of selecting a chemical structure from the number of chemical structures, the selected chemical structure having a first atom, a second atom and a bond connecting the first atom and the second atom, representing the first atom with a first numerical value, representing the second atom with a second numerical value, representing the bond with a numerical bond value, calculating an updated first numerical value based on the first numerical value, the second numerical value and the numerical bond value, and calculating an updated second numerical value based on the second numerical value, the first numerical value and the numerical bond value. The method may also include the step of calculating the identifier based on the updated first numerical value and the updated second numerical value and storing the identifier in a memory. Additionally, the method may include the step of determining a number of bridge bonds in the chemical structure and using that number of bridge bonds to calculate the identifier.
In some embodiments the method may be repeated for each chemical structure in the number of chemical structures, thereby storing the number of identifiers in the memory.
In certain embodiments, the method may also include the steps of searching the memory for chemical structures having a desired attribute and outputting a list of chemical structures having the desired attribute. The list of chemical structures may then be used to select a compound for medical treatment.
Additionally, or alternatively, the method may also include the steps of sorting the number of identifiers in the memory according to a desired attribute and outputting a sorted list of chemical structures sorted according to the desired attribute. The sorted list of chemical structures to select a compound for medical treatment.
In other embodiments, the library may be a first library, the number of identifiers may be a first number of identifiers and the number of chemical structures may be a first number of chemical structures, and the method further include the step of compiling a second library including a second number of identifiers representative of a second number of chemical structures. The method may then compare the first number of identifiers with the second number of identifiers.
These and other features of the present invention will be apparent to those of ordinary skill in the art in view of the description of the preferred embodiment, which is made to the drawings, which are briefly described below.