The present invention relates to a method of identifying candidate molecules expected to be biologically active.
Evaluating receptor or target suitability of molecules is an important task in pharmaceutical drug research. With the increasing employment of automation techniques over the last years within Drug Discovery processes, methods like High-Throughput-Screening (HTS) and High-Throughput-Synthesis have become industry standards in pharmaceutical research. Nowadays, it is possible to test more than 20,000 molecules per day for their biological activities in certain disease targets. Also in the area of chemical synthesis, combinatorial chemistry in combination with automation processes, hundreds of molecules per day can be made physically available. Since based on today""s chemical knowledge, more than 10100 molecules could theoretically be synthesized and tested and several hundreds of thousands molecules are commercially available, computer assisted methods have been developed to select subsets of molecules which are actually supposed to be tested based on their predicted potential of biological activity for certain disease targets.
Two categories of computer assisted methods serve the purpose of discovering (selecting and/or prioritizing) molecules from data sets of theoretically available molecules for biological activity testing. The first category comprises diversity or similarity based discovery methods, whereas the second category comprises structure based discovery methods. Among the second category, there are database search techniques, as well as (Q)SAR methods and Docking methods.
Only the (Q)SAR methods and the Docking methods implicitly consider information related to specific targets, either common structural patterns of a series of active molecules ((Q)SAR) or the 3-dimensional structure of a target protein (Docking) and therefore deliver the most specific results. In practice, methods based on (Q)SAR or Docking are applied to smaller data sets (up to 50,000 sets), since they need relatively high computing power. However, although parallel computing techniques can be used to gain speed, still data sets consisting of more than 106 molecules are not predictable with respect to their biological activity in a reasonable time frame.
The term biological activity is hereinafter used to comprise in particular pharmaceutical as well as agrochemical activity with respect to a certain receptor or target.
The search for candidate molecules also comprises the search for lead compounds.
It is therefore an object of the present invention to provide a method of and a system for finding candidate molecules expected to be biologically active, which method and system can be applied on molecule libraries comprising high amounts of data and yields results in a reasonable time.
According to the invention, the method of identifying candidate molecules expected to be biologically active, comprises the following steps:
a) creating a set consisting of different molecules;
b) to each of said molecules of said set, assigning a descriptor representing a predetermined number of molecular properties;
c) mapping said set of molecules onto points of a two-dimensional grid with regard to a predetermined similarity relation of the respective assigned descriptors such that the grid distance between grid points of two molecules is a measure for the similarity of said two molecule descriptors;
d) forming a three-dimensional surface over said grid of molecules, said surface representing the distribution of biological activity of the molecules on the grid approximatively according to a predetermined quality criterion;
e) selecting from said three-dimensional surface candidate molecules satisfying a predetermined criterion with respect to their biological activity.
According to the invention, said three-dimensional surface in step d) may be formed by applying the following steps:
da) taking the whole two-dimensional grid as initial region of approximation;
db) selecting molecules on predetermined grid points of said region, and calculating their respective values of biological activity;
dc) approximating the surface over said region using said previously determined values of biological activity of said molecules on said predetermined grid points;
dd) determining whether said approximated surface satisfies a predetermined quality criterion; if so, goto step e); if not so, refining the approximation of the surface by selecting molecules on further grid points, calculating their respective values of biological activity, and repeating step dc) and this step dd).
The method used for forming said three-dimensional surface in step d) is preferably an approximation method of the Delauney Triangulation type.
Thus, the inventive method consists in performing two major steps. In the first step, the molecules are sorted and mapped onto a 2-D grid according to their similarity of their descriptors. In the second step, the biological activity of the mapped molecules is approximated by modelling the distribution as surface over the molecule map. From the surface, suitable candidate molecules for further evaluation can be determined. According to the invention, only a very small amount of molecules within the data set have to be really calculated. This results in a considerable gain of performance. The recursive proceeding allows to study the data base based on customizable quality criteria. Error and quality criteria for the analysis can be tailored according to a given problem. Docking simulations of collections of molecules can easily be run parallel, which additionally leads to a performance gain.
Thus, the method according to the present invention translates predicted/measured biological activity into topographical information of a three-dimensional surface, which is analyzed iteratively, using approximation algorithms. Only those regions of the surface are thoroughly analyzed, that represent regions of high biological activity while those of low binding energies for a given protein binding site are approximated by few data points. Thus, as examples have shown, active molecules can be identified from data sets by explicitly calculating/measuring just 4-6% of the molecules within the data set.
By using the method according to the invention, drug lead candidates can be identified without the need of making large molecule sets physically available and testing them.
The selected candidate molecules are suitable for chemical synthesis.
Preferably, the molecular properties represented by said descriptors are at least two of:
molecular weight,
number of rotatable bonds,
number of hydrophobic groups,
number of hydrophilic groups,
number of acid groups,
number of basic groups,
number of neutral groups,
number of zwitter groups,
number of heavy atoms,
number of H-bond donors,
number of H-bond acceptors,
number of 1-2 dipoles,
number of 1-3 dipoles,
number of 1-4 dipoles.
Preferably, said molecule-mapping is performed using self-organizing maps within a neural network or statistical methods like linear vector quantization.