The use of robotics and miniaturization is now allowing researchers to quickly screen thousands of compounds for biological activity. Combinatorial chemistry provides the logistics of mass production of compounds and a wide range of molecular diversity for drug discovery. The automation of biological assays, High Throughput Screening (HTS), allows for investigation of thousands of compounds against biological targets per week. While this brute-force approach to lead generation certainly has its place in the field of drug discovery, it is not practical to adopt HTS for every new target of potential biological importance, given the size of today's chemical libraries (e.g., hundreds of thousands to millions of compounds).
Various molecular descriptors (explanatory variables) can be readily computed to describe the chemical properties of every molecule in the database. When there is no prior model relating biological response to these descriptors, the generally accepted procedure is to screen (test) a diverse subset of the overall database, and then examine further compounds that are structurally similar to any promising leads. Measures of “diversity” and “similarity” are based on the numerical descriptors. The assumption here is that similar objects are more likely to have similar biological responses. Thus, if an initial subset is to be selected, the subset should “fill” or “cover” the numerical space in some sense. Ideally, selected objects should be as dissimilar as possible and any candidate not selected should be near a molecule in the experimental design.
To measure the “coverage” of a descriptor space, the space is divided into cells. A good experimental design will ideally have at least one molecule in every cell. When this condition is met, the space is said to be covered. In a conventional cell-based method, the range for each of the k descriptors is subdivided into m bins of equal size, yielding mk cells. With even moderate values of m and k, a huge number of cells are generated, most of which are empty even for the candidate set of all molecules in the database. Any subset selected has even poorer coverage of the cells, making comparison of potential experimental designs difficult.
FIG. 1 shows the univariate and pairwise plots (100 and 105) of the six descriptors for the NCI candidate molecules (described later), with the distributions of the NCI molecules in one-dimension (1-D) and two-dimensions (2-D) projections for all 6 descriptors. It is clear that much of the space is empty. Either the collection is missing chemicals or it is not possible to make compounds with certain combinations of descriptors. In more than two dimensions, this problem will be even worse. Consequently, to deal with a problem of practical importance, P. R. Menard et al., “Chemistry Space Metrics in Diversity Analysis, Library Design, and Compound Selection,” Journal of Chemical Information and Computer Sciences, 38, 1204-1213, (1998), restricted the number of descriptors to 3 to 6 and the number of bins per descriptor to 4 to 7 and excluded a large number of candidate points by treating them as outlying observations. Even with these restrictions, over 80% of cells were empty in an example that they presented with 66 cells.
A discussion is now made of the various existing methods for selecting an experimental design to cover a space of explanatory variables and the deficiencies in these existing methods. The most common designs for selecting diverse molecules are random designs, distance designs, and cell-based binning designs.
The simplest designs are based on random sampling. In fact, most new leads have been discovered through random screening, in which large numbers of compounds are tested for a specific biological activity, and the active compounds are then selected for optimization. S. S. Young et al., “Random Versus Rational—Which is Better for General Compound Screening?”, Network Science, www.netsci.org/Science/Screening/feature09.html (1996), used a constant radius hypersphere around each randomly selected compound to measure the coverage of the descriptor space. They concluded that, unless a very large number of compounds are used to fill space, randomly selected compounds will cover as much space as carefully selected compounds. On the other hand, if the important dimensions for a particular problem are identified, and if a focused set of compounds is desired, then rational selection should be more effective than random designs.
There are three main types of distance based designs for selecting molecules from chemical databases. R. E. Higgs et al., “Experimental Designs for Selecting Molecules from Large Chemical Databases,” Journal of Chemical Information and Computer Sciences, 37, 861-870 (1997), refer to these as “Edge”, “Spread” and “Coverage” designs. These methods first define a descriptor distance metric (e.g., Euclidean or Manhattan distance) to measure the similarities or dissimilarities of the molecules, and then find the optimal coverage of the space based on some distance criterion. Edge (D-optimal) designs identify molecules at the edge of the descriptor space that produce minimum variance estimators for parameters in a regression model that is linear in the descriptors (explanatory variables). Spread designs (see, e.g., R. W. Kennard et al., “Computer Aided Design of Experiments,” Technometrics 11, 137-148 (1969)) identify a subset of molecules that are maximally dissimilar with respect to each other. Coverage designs (see, e.g., P. J. Zemroch, “Cluster Analysis as an Experimental Design Generator, With Application to Gasoline Blending Experiments,” Technometrics 28, 39-49 (1986)) select a subset of molecules that are maximally similar to the candidate set of molecules. The following references provide more detailed descriptions of these designs: R. E. Higgs et al., “Experimental Designs for Selecting Molecules from Large Chemical Databases,” Journal of Chemical Information and Computer Sciences, 37, 861-870 (1997); M. E. Johnson et al., “Minimax and Maximin Distance Designs,” Journal of Statistical Planning and Inference 26, 131-148 (1990); and R. Tobias, SAS QC Software. Volume 1: Usage and Reference, SAS Institute Inc., Cary, N.C., 657-728 (1995).
There are three problems with distance-based designs. First, in general these designs try to find a subset with optimal coverage of the entire descriptor space but pay little attention to the coverage in lower-dimensional subspaces. The low-dimensional coverage (i.e., 1-D, 2-D and 3-D) can be quite poor. The following references addressed this tissue by incorporating 1-D coverage into their spread designs: M. D. Morris et al., “Bayesian Design and Analysis of Computer Experiments: Use of Derivatives in Surface Prediction,” Technometrics 35, 243-255 (1993); and M. D. Morris et al., “Exploratory Designs for Computational experiments,” Journal of Statistical Planning and Inference, 43, 381-402 (1995). Secondly, descriptors that are unrelated to target activity can have a significant impact on the distribution of the molecules in the space, and can make the “optimality” of a design irrelevant. Without proper selection of descriptors, these optimal designs are not expected to improve the quality of rational sampling over that of random sampling. Thirdly, the presence of relatively few outlying observations can have significant impact on these designs. Very often this requires removal of many outlying molecules to come to a sensible design.
In the conventional cell-based method, each of the k numerical descriptors is subdivided into m bins of equal size, yielding mk cells or hypercubes, and the experimental design chooses at least one molecule from every cell. This method is attractive because it is easy to divide the descriptor space into cells and allocating even a very large dataset to these cells is straightforward. Missing diversity (i.e., empty cells) can easily be identified. The following references disclose cell-based binning methods to compare the relative diversity of molecular databases and to select diverse subsets of molecules: D. J. Cummins et al., “Molecular Diversity in Chemical Databases: Comparison of Medicinal Chemistry Knowledge Bases and Databases of Commercially Available Compounds,” Journal of Chemical Information and Computer Sciences, 36, 750-763 (1996); and P. R. Menard et al., “Chemistry Space Metrics in Diversity Analysis, Library Design, and Compound Selection,” Journal of Chemical Information and Computer Sciences, 38, 1204-1213 (1998). A problem with many existing cell-based binning methods is that they generate too many cells and many of the cells are empty. Even when k and m are relatively small, the number of empty cells is often more than the number of nonempty cells for chemistry problems. To reduce the number of empty cells, Cummins et al. (1996) and Menard et al. (1998) restricted the number of descriptors and the number of bins per descriptor. They also excluded many outlying candidate points. Even with these compromises, they reported a large proportion of empty cells. Indeed, a very low cell occupancy is expected by Menard et al. (1998)—they recommended a targeted occupancy of 12-15%.
Two popular space-filling designs, currently only applied to more regular sampling spaces, are Latin hypercube designs (see, e.g., M. D. McKay et al., “A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code,” Technometrics 21, 239-245 (1979)) and uniform shell designs (see, e.g., D. H. Doehlert, “Uniform Shell Designs,” Applied Statistics, 19, 231-239 (1970)). Latin hypercubes have excellent 1-D coverage and are very popular in experiments with computer models. The main problem in applying these methods to compound selection is that for chemical compounds only certain combinations of descriptor values exist.
Some of the other problems related to design and analysis of molecular data include:                1. The model is vague. In many drug design problems, it is not clear which model is appropriate to relate biological response to molecular properties beyond the assertion that similar objects are more likely to respond similarly. Thus if a collection of chemical objects are described numerically and if a subset is to be selected, then the subset should “fill” the numerical space in some sense—selected objects should be as dissimilar as possible or any candidate not selected should be near a selected object.        2. There is more than one response. Several biological screenings, each designed to detect a specific biological activity, may be in operation within the research division of a pharmaceutical company at any given time. It is hoped that the collective output of these screens will provide enough leads to contribute to the discovery process in a meaningful way.        3. The samples to be chosen from are often a collection of restricted sampling points and their descriptor values are dependent. A standard experimental design procedure assumes that the descriptor space can be represented as a region bounded by a k-dimensional hypercube with any points in the cube being a candidate point. The standard design is not possible for compound selection problems because, even though the space is continuous, the possible molecules are discrete. It is not possible to place a compound at certain positions in the space. One is restricted to the compounds that one has or can make.        4. The number of candidate points (Nc) and the number of design points (nd) are large. The number of possible combinations of the samples to be chosen from is so large that it becomes computationally impossible to consider every possible combination in the experimental design—it may take days or weeks or even months to compare every combination. In theory, to identify the optimal design, one needs to examine all possible subsets of size nd from the Nc candidate points, thus performing Nc-chose-nd subset evaluations. In practice, the magnitudes of Nc and nd prohibit a full scale optimization. For example, even to choose a small design of 100 points from a very small candidate set of 1000 molecules, there are 6.4×10139 possible subsets. For moderate or large datasets, search algorithms are usually applied to find a very good design, as it is impossible to find the globally best design.        5. The number of descriptors can be very large. Tens to hundreds to thousands of molecule descriptors are possible. This implies a high dimensional problem.        
Thus, what is needed is a fast method to select representative objects while providing optimal (or near optimal) space coverage.