1. Field of the Invention
The present invention generally relates to computer-based and/or computer-assisted calculation of the chemical and/or textual similarity of chemical structures, compounds, and/or molecules and, more particularly, to ranking the similarity of chemical structures, compounds, and/or molecules with regard to the chemical and/or textual description of, for example, a user's probe proposed, and/or lead compound(s).
2. Background Description
In recent years, pharmaceutical companies have developed large collections of chemical structures, compounds, or molecules. Typically, one or more employees of such a company will find that a particular structure in the collection has an interesting chemical and/or biological activity (e.g., a property that could lead to a new drug, or a new understanding of a biological phenomenon).
Similarity searches are a standard tool for drug discovery. A large portion of the effort expended in the early stages of a drug discovery project is dedicated to finding "lead" compounds (i.e., compounds which can lead the project to an eventual drug). Lead compounds are often identified by a process of screening chemical databases for compounds "similar" to a probe compound of known activity against the biological target of interest. Computational approaches to chemical database screening have become a foundation of the drug industry because the size of most commercial and proprietary collections has grown dramatically over the last decade.
Chemical similarity algorithms operate over representations of chemical structure based on various types of features called descriptors. Descriptors include the class of two dimensional representations and the class of three dimensional representations. As will be recognized by those skilled in the art, two dimensional representations include, for example, standard atom pair descriptors, standard topological torsion descriptors, standard charge pair descriptors, standard hydrophobic pair descriptors, and standard inherent descriptors of properties of the atoms themselves. By way of illustration, regarding the atom pair descriptors, for every pair of atoms in the chemical structure, a descriptor is established or built from the type of atom, some of its chemical properties, and its distance from the other atom in the pair.
Three dimensional representations include, for example, standard descriptors accounting for the geometry of the chemical structure of interest, as mentioned above. Geometry descriptors may take into account, for example, the fact that a first atom is a short distance away in three dimensions from a second atom, although the first atom may be twenty bonds away from the second atom. Topological similarity searches, especially those based on comparing lists of pre-computed descriptors, are computationally very inexpensive.
The vector space model of chemical similarity involves the representation of chemical compounds as feature vectors. As will be recognized by those skilled in the art, exemplary features include substructure descriptors such as atom pairs (see Carhart, R. E.; Smith, D. H.; Venkataraghavan, R., "Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications", J. Chem. Inf. Comp. Sci. 1985, 25:64-73) and/or topological torsions (see Nilakantan, R.; Bauman, N.; Dixon, J. S; Venkataraghavan, R., "Topological Torsions: A New Molecular Descriptor for SAR Applications", J. Chem. Inf. Comp. Sci. 1987, 27:82-85), all incorporated herein by reference.
As seen, many strategies for representing molecules in the collection and computing similarity between them have been devised. We have recognized, however, that these searches are often more involved when the goal is to select compounds that have similar activity or properties, but not obviously similar structure. That is, we have identified a need to ascertain, from a large collection of chemical structures, compounds, or molecules, a set of diverse chemical structures, for example, that may look dissimilar from the original probe compound, but exhibit similar chemical or biological activity. We have also recognized that although algorithms using, for example, Dice-type and/or Tanimoto-type coefficients, each known to those skilled in the art, by design, yield compounds that are most similar to the probe compound, such algorithms may fail to provide compounds or chemical structures characterized by diversity relative to the probe compound.
With respect to a chemical example, if a particular compound were found to be a HIV inhibitor, we have recognized that it would be desirable to search a database of chemical compounds or compositions and identify HIV inhibitors that have the same or similar pharmacological effect as the original HIV inhibitor, but that may be structurally dissimilar to the original HIV inhibitor probe. The capability of being able to find one or more dissimilar HIV inhibitors quickly and effectively can potentially be worth billions of dollars in revenue.
We have also recognized that utilizing a probe and providing a database that includes a textual description in addition to a chemical description reveals correlations and relationships therebetween that cannot be obtained by utilizing either textual or chemical descriptors alone.