A frequently used approach for the development of new drugs is similarity-based screening. This approach does not depend on structural knowledge about the molecular target—the binding mode and the exact location of the interaction between the drug and the body of the patient. The knowledge of one drug (active molecule or compound) is sufficient. Starting from this single or a small set (<1000) of active molecules databases of small (druglike) molecules are screened for candidates for the development of new drugs. Publicly available databases contain between two and three million druglike molecules. Proprietary databases may contain a multiple of this number.
The screening can be done in a laboratory or (virtually) with a computer. In practice the virtual screening is done prior to lab-based screening for time and economical reasons. Established virtual screening methods use one-to-one comparisons between active molecules and the screened molecules. Simple bit-vector-based methods can calculate one million pairwise comparisons in one minute. More elaborate graph-based methods, which compute maximum common subgraphs or substructures or edit-distances need several hours for one million pairwise comparisons.
Fast virtual screening programs represent molecules by a vector of bits. Examples for these programs are MACCS Keys from MDL Information System Inc., BCI fingerprints from Barnard Chemical Information Ltd. and Daylight Fingerprints. The screening is done by bit-vector comparison. Each bit codes the absence or presence of one or a set of small molecular fragments within the represented molecule.
While the comparison of bit-vectors is very fast, the results are hard to interpret. The descriptor allows for high similarities between completely unrelated structures that share a set of common fragments. A maximum common subgraph based similarity measure, as applied in this invention, is highly specific. The reported matches can be easily visualized and interpreted. The measure can be relaxed in well-defined ways by using edit operations or by representing the compared molecules as reduced graphs.
Subgraph isomorphism based methods are used successfully for substructure searching in molecular databases (see M. G. Hicks and C. Jochum, J. Chem. Inf. Comput. Sci., 30:191-199, 1990; J. M. Barnard, J. Chem. Inf. Comput. Sci., 33:533-538, 1993). Here generally a two-step procedure is used: A fast bit-vector based screening (prefiltering) program identifies a course set of most similar molecules. Then a subgraph isomorphism program is applied to the reduced input set.
The main drawback of this approach for virtual screening is the absence of useful boundaries for the maximum common subgraph similarity given the result of the prefiltering program. This is acceptable for substructure searching where only exact matching subgraphs are desired and a high similarity threshold can be used for the prefiltering. In virtual screening, where maximum common subgraph based similarity values of 0.8 and below are desired, the prefiltering step becomes inefficient.
An interesting approach for substructure searching is used by the hierarchical tree substructure search program (HTSS) (see M. Z. Nagy, S. Kozics, T. Veszpremi, and P. Bruck, “Substructure search on very large files using tree-structured databases”, in W. A. Warr, editor, “Chemical Structures. The International Language of Chemistry.”, pages 127-130, Springer-Verlag, Berlin, Germany, 1988). This program precomputes a tree of expanding atomic neighborhoods. For each atom of the query molecule a path of exactly matching neighborhoods in the tree is traversed. A subgraph isomorphism match can be directly inferred from a matching set of atomic neighborhoods. A direct extension of this approach to the computation of maximum common subgraphs or edit-distances is not possible. Instead of exactly matching neighborhoods and a simple path traversal, approximate matches and a backtracking procedure would be necessary.
Wipke and Rogers introduced such an algorithm using a more general tree of subgraphs instead of atomic neighborhoods (see W. T. Wipke and D. Rogers, Tetrahedron Computer Methodology, 2:177-202, 1989). While the algorithm gives a good speedup compared to the computation of maximum common subgraphs between the query and each database molecule, it cannot be combined with prefiltering methods and uses exponential time for each traversed vertex or node, respectively, of the subgraph tree.
Messmer, Shearer, Bunke and Venkatesh introduced a decision tree for the computation of subgraph- and maximum common subgraph isomorphisms of object-graphs derived from video images against a database of model graphs (see B. T. Messmer and H. Bunke, Pattern Recognition, 32(12):1979-1998, 1999 and K. Shearer, H. Bunke, and S. Venkatesh, Pattern Recognition, 34(5):1075-1091, 2001). The approach stores in a preprocessing step the complete isomorphism classes of all model graphs in a decision tree. An asymmetric graph with n vertices has n! (n factorial) isomorphic representations. The size of graphs to be stored and searched in the decision tree is limited to 12 vertices using current hardware and will never exceed 14 in the foreseeable future.
Borgelt, Meinl and Berthold (see C. Borgelt, T. Meinl, and M. R. Berthold, “MoSS: A program for molecular substructure mining” in Workshop Open Source Data Mining Software—2005, OSDM, pages 6-15, 2005; C. Borgelt, “Combining ring extensions and canonical form pruning” in Workshop on Mining and Learning with Graphs—2006, MGL, 2006) introduced in the field of frequent subgraph mining an algorithm for the computation of a subgraph tree containing canonic representatives of all subgraphs that have a given minimal support in a set of molecules. The algorithm is used for mining subgraphs that are correlated with molecular properties. Activity against a given target is such a property.
Bit-vector based screening methods are fast, but applied to a database that contains millions of compounds a single screening run takes several minutes. Therefore the methods cannot be used interactively and the results are hard to interpret.
None of the discussed methods computes the substructure space of a given set of molecules. The algorithms compute a set of molecules similar to a given query without further structuring the result. An extension of the result set can only be done by a complete recomputation. All discussed approaches for the computation of maximum common subgraphs in the 1-to-n case cannot be easily extended to edit-distances or general topological transformations or the m-to-n case where multiple query molecules are searched simultaneously.
It has now been found that the presented method according to the invention allows for the virtual screening of millions of druglike molecules within a second.