Molecular databases are routinely screened for compounds that most closely resemble a molecule of known biological activity to provide novel drug leads. It is widely believed that 3D molecular shape is the most discriminating pattern for biological activity, as it is directly related to the steep repulsive part of the interaction potential between the drug-like molecule and its macromolecular target. However, efficient comparison of molecular shape is currently a challenge.
Virtual Screening is a key technique in computational drug discovery, aimed at identifying those drug-like molecules that are likely to have beneficial biological properties. It is an obvious way to reduce expensive biological tests and tackle the high failure rate currently faced by the pharmaceutical industry. In Molecular Docking, for instance, the process of docking the screened molecule to a macromolecular biological target (almost always a protein) is simulated to provide an estimate of its binding energy and thus its likelihood of being bioactive. These techniques have spurred the generation of massive databases of drug-like molecules.
An alternative Virtual Screening technique consists of searching a molecular database for compounds that most closely resemble a given query molecule. This chemical template can be a known product or inhibitor of a target protein; a natural product; or even a patented compound. The underlying assumption is that molecules similar to the active query molecule are likely to share similar properties. This similarity can be in terms of molecular shape or a range of molecular descriptors, most of which are in one way or another related to the geometry of the molecule.
Methods for molecular shape comparison can be roughly divided into two categories: superposition-based methods and descriptor-based methods. Superposition methods rely on finding an optimal super-position of molecules being compared, and descriptor-based methods (non-superposition methods) are independent of molecular orientation and position. Superposition methods are regarded as particularly effective, but not as efficient, while descriptor-based methods have higher efficiency but are generally considered to be less effective than the superposition methods.
A widely used, commercially available superposition method is ROCS (rapid overlay of chemical structures) (Rush et al., A Shape-Based 3-D Scaffold Hopping Method and Its Application to a Bacterial Protein-Protein Interaction. J. Med. Chem. 48, 1489-1495 (2005) which is hereby incorporated by reference herein). ROCS calculates a similarity score from the volume overlap of the molecules being compared. The required alignment is carried out through what is essentially a local optimization process, where each of the iterations involves the calculation of the volume overlap for the currently tested relative orientation and position of the molecules. Although ROCS has been touted as much more efficient than a typical superposition method, unlike other superposition methods, the same radius value is given to all heavy atoms in the molecule, which can introduce error. Furthermore, by only keeping the zero order Gaussians, ROCS calculates just the first term of the molecular volume expansion as opposed to up to the sixth term as done in an earlier superposition method (Grant et al., J. Phys Chem, 1995, 99, 3503). This introduces an error of about 75% with respect to the original method when tested on macromolecules (the magnitude of these errors on drug-sized molecules is to date undetermined).
More importantly, ROCS does not guarantee that the best superposition between the compared molecules will be found. This can be alleviated by increasing the number of starting points at the cost of further optimizations (one per starting point), thus lowering ROCS efficiency. In addition, reduced effectiveness due to suboptimal molecular overlap is very hard to detect because only the top ranked molecules are visible in practice. Those molecules that have a sufficiently similar shape to that of the query, but obtain a suboptimal molecular overlap because of superposition errors, will unnoticeably drop below the threshold and be lost among possibly millions of other rejected molecules.
Descriptor-based comparison methods use geometrical descriptors to encode the shape of molecule, with the similarity score between molecules calculated by comparing the corresponding descriptors. In one descriptor-based technique, Shape Signatures (Zauhar et al. Shape Signatures, a New Approach to Computer-Aided Ligand- and Receptor-Based Drug Design. J. Med. Chem. 46, 5674-5690 (2003), hereby incorporated by reference herein), each molecule is described by a histogram of the information derived from the simulation of a ray-trace reflecting within the molecular volume. Although the ranking provided by this method is largely consistent with human-perceived shape similarity, the query molecule is not ranked first in most cases, leading to questions of accuracy. While this method is quite efficient, calculating the shape signature of each molecule in the database is a very expensive procedure, which takes about 1,600 hours for a database of just 113,331 molecules on a single 450 MHz Pentium III processor.
Another descriptor-based technique is EigenSpectrum Shape Fingerprints (ESshape3D), which is a commercially available technique included in the Molecular Operating Environment (MOE 2006) software suite (MOE 2006.08 Release (http://www.chemcomp.com/)). This method starts by calculating a matrix with the Euclidean distances between all heavy atoms in the molecule to thereafter form a spectrum characteristic of its shape with the matrix's eigen values. Next, this spectrum is encoded as a fingerprint, and the similarity score is calculated as the inverse of the distance between the corresponding fingerprints. However, this method may still suffer from lower accuracy than a number of competing methods.
While more traditional descriptor based methods can be fast (in the range or 500-2000 comparisons per second on a 1995 PC), they are known to be less effective than the superposition methods and are primarily used for database prescreening instead of stand-alone molecular shape comparison. In contrast, superposition methods can have higher accuracy rates, but comparison rates are much slower and require the previous alignment of the molecules, which is a source of errors, particularly with symmetrical query molecules. In the light of the foregoing, it is clear that none of the current shape comparison methods is completely effective.