A molecule is normally thought of as a set of atoms of varying atomic type, with a certain bonding pattern. Indeed this “chemical” description can uniquely describe the molecule. It is the language that chemists use to compare and contrast different molecules. Efficient database models have been constructed to store such information for fast retrieval and storage. However, this form of description does not actually describe the three dimensional structure of the molecule, e.g., the positions of each atom, and since the interaction of molecules is a spatial event, the “chemical” description is incomplete for physical phenomena. One such phenomenon of commercial importance is the binding of drug molecules to sites of biological importance, such as the active areas or “sites” on protein surfaces, which is the mode of action of nearly all pharmaceuticals.
Drug molecules are often small, on the order of 20 atoms (excluding hydrogens). They interact with large macromolecules such as proteins by binding to them. Through binding, the drug may activate or inhibit the normal action of the macromolecule. The binding occurs at specific sites on the macromolecule, and the basis of tight and specific binding is complementarity in shape and other properties, such as electrostatic, between the two molecules.
Pharmaceutical companies maintain computer databases of all molecules they have synthesized, plus other compounds available on the market. The use of these databases and the techniques of computer-aided drug design are beginning to replace trial and error lab testing in new drug development. Important components of this process are finding new small molecules similar in shape to ones known to bind a target, and designing new molecules to fit into known or hypothesized binding sites.
There have been many attempts to describe or “encode” the three dimensional information of molecules beyond a simple list of coordinates. Many involve the distances between pairs of atoms in a molecule, i.e., an atomistic approach akin to the chemical description but with extra, spatial degrees of freedom.
A more radical departure is to adopt an alternate representation of a molecule: the field representation. A field is essentially just a number assigned to every point in space. For instance, the air temperature in a room at every point in that room forms a field quantity. Molecules have one fundamental field associated with them, namely the quantum mechanical field that describes the probability of electrons and nuclei existing at each point in space. However, this field can be thought of as giving rise to other, simpler fields of more use in understanding a molecule's properties. Chief amongst these are the steric and electrostatic fields, although others are used, such as the hydrophobic and the hydrogen bond potential field. An illustration of a Gaussian representation of a steric field is shown in FIG. 1. As is customary, on each contour line in the field, each point has equal value.
Steric fields describe the mass or shape of the molecule, and at the simplest level such a field might have a value of one inside the molecule and zero outside. Electrostatic fields represent the energy it would take to place an electron at a particular place in space, by convention positive if the energy is unfavorable and negative if favorable. These two fields are the most relevant for molecular interactions because of basic physical laws, i.e., that two molecules cannot overlap (steric repulsion) and that positive atoms like to be near negative atoms and vice versa. These are the basic components of molecular interactions.
If a molecule is known to bind and have effect on some biological target, it is of great commercial interest to find other molecules of similar shape and electrostatic properties (i.e., similar fields) since this enhances the likelihood of such molecules having similar biological activity. Since shape and electrostatic character are consequences of the underlying atoms, which can be efficiently encoded by a chemical description, such searches have traditionally been performed at this level, by looking for molecules which are “chemically” similar. One disadvantage of this approach is that the relationship between chemical similarity and structural similarity is not precise; and chemically similar molecules may be structurally quite different. Another disadvantage is that a chemically similar compound may well be covered by the same patents as the original molecule. Finally, searching only chemically similar compounds inevitably means one will not find active molecules that are not chemically similar.
This latter point is key. David Weininger of Daylight Chemical Information Systems has reported an analysis that suggests there are 10200 different molecules synthesisable by known means. (Only 10107 molecules of typical drug size would fit in the known universe!). As such, any molecule, of any shape or electrostatic profile, has a potentially astronomical number of similarly shaped and charged “doppelgangers”. Only a fraction of them are necessarily chemically similar. Hence by restricting the search to chemical similarity a vast number of potential drug leads are never discovered.
Although 10200 molecules is too large a number to ever enumerate, I believe that it is possible to determine bounds to the possible variations of molecular fields of this hypothetical set. Furthermore, I plan to compute for database storage a very large number of molecular structures (e.g., of the order of billions) that sample this range such that I am able to find a match, or “mimic”, from this collection to any novel structure presented. Such a database would be many thousands of times larger than any currently in existence and hence crucial to this plan is the efficient organization of such for fast search and retrieval of such mimics and the assessment of whether I have indeed “covered” chemical space. It is these problems that the present invention addresses.
State of the Art
Much has been done in the use of molecular fields to compare and contrast molecules and to predict activity from such operations. Some of these approaches are described below. I believe that the crucial aspect of my approach which differs from all prior work is in the application of a particular property of field comparison, namely the “metric” property, and in a novel way to decompose fields into separable domains, wherein each is quantifiably similar to a geometrically simpler field.
The most widely known “field analysis” approach is that known as Comparative Molecular Field Activity (COMFA). See U.S. Pat. Nos. 5,025,388 and 5,307,287 assigned to Tripos Inc. of St. Louis, Mo. The idea behind COMFA is to take a series of molecules of known activity and to find which parts of these molecules are responsible for activity. The procedure is to first overlay the set of molecules onto each other such that the combined difference of the steric and electrostatic fields between all pairs of molecules is at a minimum. (The concept of overlaying, i.e., finding an orientation between a pair of molecules that minimizes a field difference is fundamental to all methods that utilize molecular fields for molecular comparison.)
Given this ensemble average, one then finds values of properties such as the electrostatic field at a number of grid points surrounding the set of molecules. These then become data points in a statistical analysis known as Partial Least Squares (PLS), which seeks to identify which points correlate with some measure of activity. For instance, if all active molecules, once overlaid, had a similar region of positive potential, while less active or inactive molecules did not, the procedure would identify this as an important common motif in activity.
Problems inherent in COMFA are the multiple alignment of a set of molecules, the placement of grid points near the molecules, and the interpretation of the PLS output.
Another approach which uses molecular overlay is that set forth in U.S. Pat. Nos. 5,526,281 and 5,703,792 of Chapman et al. of ARRIS Inc. They are interested in selecting a subset of compounds from a much larger set that retains much of the diversity of the larger set. The basic concept is to start with as few as one molecule as the representative set, then to overlay a candidate molecule to minimize steric and/or electrostatic field differences to all in the set, and then to calculate differences between the molecules based upon this alignment. This is repeated for each of the candidate molecules. The candidate which is “most different” from those already in the representative set is added and the procedure then repeated until the number of compounds chosen reaches a desired threshold.
In both COMFA and the Chapman approach, field similarity is used as a tool to solve the alignment issue, and similarities or differences are then calculated. The value of the field similarity or difference is of secondary importance, it merely solves what is called the “assignment” problem, i.e., which atoms, or areas of a molecule's field are “equivalent”.
In contrast, in Mestres et al., “A Molecular Field-Based Approach to Pharmacophoric Pattern Recognition,” J Molecular Graphics and Modelling, Vol. 15, pp. 114-121 (April 1997) incorporated herein by reference, molecules are aligned based upon the overlap of their steric or electrostatic fields, or by a weighted sum of the two. A similarity measure is defined that equals one when the fields are the same, and minus one when they are maximally different. The Mestres et al. work is embodied in a program called MIMIC, which performs global and local optimization of the field overlap. They note that there are several possible overlays that have the appearance of being the best overlap. These are so called “local minima”, because while small displacements lead to a decrease in their similarity function, they may not be the best “global” solution. This is as expected since field overlay belongs to the class of problems known to have “multiple minima”. Mathematically this is usually an intractable problem, solvable only by much computation, e.g., as is evident in the descriptions by Mestres et al. In fact, the multiple overlay solutions are one of the key aspects of their work, in that one cannot be sure which is the most “biologically” relevant overlay, and what might be the correct weighting of steric to electrostatic fields.
An additional aspect considered by Mestres et al. is the issue of molecules existing in multiple structural conformations, i.e., energetically there may be more than one possible structure for a given molecule. Mestres et al. calculate the similarity indexes of all pairs of conformations of a molecule and perform what is known as principal component analysis (PCA). They do this to find representatives of all possible conformations that are most distinct. Although this procedure is really akin to finding the dimensionality of the space in which these conformers exist, Mestres et al. do not use PCA for this purpose, but merely to cluster the conformers. They do not apply PCA to sets of different molecules, only to conformers of the same molecule, and they do not use any other “metric” property of their similarity measure. In fact they seem unaware of such.
There is an important distinction to be made between a “measure” of similarity and a “metric” of similarity, although these words are often used interchangeably. A measure can be any quantity which has a correspondence with molecular similarity, i.e., the idea that the more similar the measure the more similar the compounds. A metric has a precise mathematical interpretation, namely that if the metric, or more commonly the metric distance, between A and B is zero then the two items are the same item, that the distance from A to B is the same as the distance from B to A, and that the distance from A to B plus the distance from B to a third compound C must be greater than the distance from A to C. This latter is called the “Triangle Inequality” because the same conditions can be said of the sides of a triangle ABC. The Triangle Inequality, or metric upper bound, also leads to a lower bound, namely that in the case above, C can be no closer to A than the difference of these distances A to B and B to C.
In M. Petitjean, “Geometric Molecular Similarity from Volume-Based Distance Minimization-Application to Saxitoxin and Tetrodotoxin,” J Computational Chemistry, Vol. 16, No. 1, pp. 80-95 (1995) incorporated herein by reference, it is recognized that the quantity that measures the overlay of fields forms a metric quantity, and that the measure of the optimum overlay of two fields also forms a metric which is intrinsic to the molecule, i.e., independent of orientation or position.
A metric distance may also be used in a technique called “embedding”. The number of links between the elements of a set of N elements can be shown to be N*(N−1)/2 and each link can be shown to be a metric distance. While a set of N elements has N*(N−1)/2 distances, the set can always be represented by an ordered set of (N−1) numbers, i.e., I can “embed” from a set of distances to a set of N positions in (N−1) dimensional space. This is identical to Principal Component Analysis mentioned previously, except that with PCA one finds the most “important” dimensions, i.e., the “principal” directions, which carry most of the variation in position. Typically with PCA one truncates the dimensionality at 2 or 3 for graphical display purposes. In general, the number of dimensions which reproduces the set of N*(N−1)/2 distances within an acceptable tolerance may be much smaller than (N−1), yet still be greater than 2 or 3. Hence one talks of “embedding into a hyper-dimensional subspace”, where hyper-dimensional means more than 3 dimensions, and subspace means less than (N−1). Techniques for such an embedding are standard linear algebra. When applied to molecular fields, the result of embedding is a shape-space of M≦N−1 dimensions.