This invention relates in general to a machine-learning approach to modeling biological activities or other characteristics and, in particular, to a machine-learning approach to modeling biological activity for molecular design or other characteristics. In modeling biological activity, the approach is preferably shaped-based.
The shape that a molecule adopts when bound to a biological target, the bioactive shape, is an essential component of its biological activity. This shape, and any specific interactions such as hydrogen bonds, can be exploited to derive predictive models used in rational drug design. These can be used to optimize lead compounds, design de novo compounds, and search databases of existing compounds for novel structures possessing the desired biological activity. In order to aid the drug discovery process, these models must make useful predictions, relate chemical substructures to activity, and confidently extrapolate to chemical classes beyond those used for model derivation.
Physical data such as X-ray crystal structures of drug-target complexes provide a shape model directly and have led to recent successes in structure-based drug-design. However, in the absence of such data, rational drug design must rely upon predictive models derived solely from observed biological activity. Several methods exist that produce predictive models relying, in part, on molecular shape.
Existing methods for constructing predictive models are unable to model steric interactions accurately, particularly when these interactions involve large regions of the molecular surface. Existing quantitative structure-activity relationship (QSAR) models are severely limited by the types of molecular properties they consider. Methods that employ properties of substituents assume that the molecules share a common structural skeleton, and hence cannot be extrapolated to molecules with different skeletons. Many methods employ ad hoc features that make it difficult to interpret the models as a guide for drug design. Pharmacophore models (e.g., BioCAD) model activity in terms of the positions of a small number of atoms of functional groups. This overcomes many of the problems of traditional QSAR methods, but it has difficulty addressing steric interactions.
In U.S. Pat. No. 5,025,388 to Cramer, III, et al., a comparative molecular field analysis (COMFA) methodology is proposed. In this methodology, the three-dimensional structure for each molecule is placed within a three-dimensional lattice and a probe atom is chosen, placed successively at each lattice intersection, and the steric and electrostatic interaction energies between the probe atom and the molecule calculated for all lattice intersections. Such energies are listed in a 3D-QSAR table. A field fit procedure is applied by choosing the molecule with the greatest biological activity as the reference in conforming the remaining molecules to it. In determining which conformation of the molecule to use in the analysis, COMFA proposes using averaging or Boltzman distribution weighting to determine a most representative conformer. After the 3D-QSAR table is formed, a partial least squares analysis and cross-validation are performed. The outcome is a set of values of coefficients, one for each column in the data table, which when used in a linear equation relating column values to measured biological values, would tend to predict the observed biological properties in terms of differences in the energy fields among the molecules in the data set, at every one of the sampled lattice points.
The COMFA method is disadvantageous since it requires that the chemist guess the alignment and active conformation of each molecule or, alternatively, compute the average or a weighted distribution of the steric and electrostatic fields for all conformations. This can undermine the applicability and accuracy of the method.
The COMFA method is also disadvantageous because it constructs a linear model to predict activity as a function of the properties measured at the grid points. Biological activity is an inherently non-linear function of molecular surface properties (such as electrostatic, weak polar, and van der Waals interactions). In COMFA these nonlinearities must be captured in the field values measured at the grid points.
None of the above-described approaches is entirely satisfactory. It is therefore desirable to provide an improved approach for modeling biological activity in which the above-described difficulties are alleviated.