A fundamental problem in cheminformatics is the calculation of similarity between two given molecules. As a consequence, a large variety of similarity techniques exists. These measures have as their underlying information content a variety of molecular properties, including various encodings of molecular substructure, volume, and surface similarity and of electrostatic similarity. While many of the more complicated techniques are able to uncover relevant chemical similarities not found by simpler methods, they are often computationally expensive to evaluate.
Many important algorithms in cheminformatics contain as a critical subroutine these pairwise similarity comparisons. For example, a database search against a single query (without filters) amounts to the evaluation of a similarity measure once for each database molecule. More complicated algorithms, such as those for clustering or network construction, may require the evaluation of a number of similarities quadratic in the size of the database, rather than linear. Evaluation of similarities can be a bottleneck, limiting performance as well as the size of problems that can be considered. While fingerprint-style methods have been developed to approximate these similarity measures, they lack rigorous justifications of their accuracy.
A significant continuing trend in cheminformatics is the increasing size of virtual chemical databases. Public libraries listing known chemical matter, such as PubChem (31 million molecules) and ZINC (34 million molecules), are routinely used in database searches. Continuing advances in both computational power and storage space enable the use of even larger exhaustive and combinatorial databases. GDB13 is an exhaustive database enumerating all 970 million possible compounds composed of 13 or fewer heavy atoms (C, N, O, S, and Cl), according to simple stability and synthesizability criteria. Virtual combinatorial libraries can similarly reach or exceed the 109 molecule mark, even with as few as three or four points of substitution. In the limit, it is believed that as many as 1060 molecules are potentially synthesizable. The combination of rapidly growing chemical libraries with computationally difficult similarity metrics suggests a need for dramatically faster methods of calculating chemical similarity.
Algorithms for several emerging large-scale problems in cheminformatics have as their rate-limiting step the evaluation of relatively slow chemical similarity measures, such as structural similarity or three-dimensional (3-D) shape comparison.