Molecular similarity is one of the most ubiquitous concepts in chemistry (Johnson, M. A., and Maggiora, G. M., Concepts and Applications of Molecular Similarity, Wiley, New York (1990)). It is used to analyze and categorize chemical phenomena, rationalize the behavior and function of molecules, and design new chemical entities with improved physical, chemical, and biological properties. Molecular similarity is typically quantified in the form of a numerical index derived either through direct observation, or through the measurement of a set of characteristic properties (descriptors), which are subsequently combined in some form of dissimilarity or distance measure. For large collections of compounds, similarities are usually described in the form of a symmetric matrix that contains all the pairwise relationships between the molecules in the collection. Unfortunately, pairwise similarity matrices do not lend themselves for numerical processing and visual inspection. A common solution to this problem is to embed the objects into a low-dimensional Euclidean space in a way that preserves the original pairwise proximities as faithfully as possible. This approach, known as multidimensional scaling (MDS) (Torgeson, W. S., Psychometrika 17:401–419 (1952); Kruskal, J. B., Phychometrika 29:115–129 (1964)) or nonlinear mapping (NLM) (Sammon, J. W., IEEE Trans. Comp. C18:401–409 (1969)), converts the data points into a set of real-valued vectors that can subsequently be used for a variety of pattern recognition and classification tasks.
Given a set of k objects, a symmetric matrix, rij, of relationships between these objects, and a set of images on a m-dimensional map {yi, i=1, 2, . . . , k; yi ε}, the problem is to place yi onto the map in such a way that their Euclidean distances dij=∥yi−yj∥ approximate as closely as possible the corresponding values rij. The quality of the projection is determined using a sum-of-squares error function known as stress, which measures the differences between dij and rij over all k(k−1)/2 possible pairs. This function is numerically minimized in order to generate the optimal map. This is typically carried out in an iterative fashion by: (1) generating an initial set of coordinates yi, (2) computing the distances dij, (3) finding a new set of coordinates yi that lead to a reduction in stress using a steepest descent algorithm, and (4) repeating steps (2) and (3) until the change in the stress function falls below some predefined threshold. There is a wide variety of MDS algorithms involving different error (stress) functions and optimization heuristics, which are reviewed in Schiffman, Reynolds and Young, Introduction to Multidimensional Scaling, Academic Press, New York (1981); Young and Hamer, Multidimensional Scaling: History, Theory and Applications, Erlbaum Associates, Inc., Hillsdale, N.J. (1987); Cox and Cox, Multidimensional Scaling, Number 59 in Monographs in Statistics and Applied Probability, Chapman-Hall (1994), and Borg, I., Groenen, P., Modem Multidimensional Scaling, Springer-Verlag, New York, (1997). The contents of these publications are incorporated herein by reference in their entireties.
Unfortunately, the quadratic nature of the stress function (i.e. the fact that the computational time required scales proportionally to k2) make these algorithms impractical for data sets containing more than a few thousand items. Several attempts have been devised to reduce the complexity of the task. (See Chang, C. L., and Lee, R. C. T., IEEE Trans. Syst., Man, Cybern., 1973, SMC-3, 197–200; Pykett, C. E., Electron. Lett., 1978, 14, 799–800; Lee, R. C. Y., Slagle, J. R., and Blum, H., IEEE Trans. Comput., 1977, C-27, 288–292; Biswas, G., Jain, A. K., and Dubes, R. C., IEEE Trans. Pattern Anal. Machine Intell., 1981, PAMI-3(6), 701–708). However, these methods either focus on a small subset of objects or a small fraction of distances, and the resulting maps are generally difficult to interpret.
Recently, two very effective alternative strategies were described. The first is based on a self-organizing procedure which repeatedly selects subsets of objects from the set of objects to be mapped, and refines their coordinates so that their distances on the map approximate more closely their corresponding relationships. U.S. Pat. Nos. 6,295,514 and 6,453,246, each of which is incorporated by reference herein in its entirety). The method involves the following steps: (1) placing the objects on the map at some initial coordinates, yi, (2) selecting a subset of objects, (3) revising the coordinates, yi, of at least some of the selected objects so that at least some of their distances, dij, match more closely their corresponding relationships rij, (4) repeating steps (2) and (3) for additional subsets of objects, and (4) exporting the refined coordinates, yi, for the entire set of objects or any subset thereof.
The second method attempts to derive an analytical mapping function that can generate mapping coordinates from a set of object features. (See U.S. application Ser. No. 09/303,671, filed May 3, 1999, and U.S. application Ser. No. 09/814,160, filed Mar. 22, 2001, each of which is incorporated by reference herein in its entirety). The method works as follows. Initially, a subset of objects from the set of objects to be mapped and their associated relationships are selected. This subset of objects is then mapped onto an m-dimensional map using the self-organizing procedure described above, or any other MDS algorithm. Hereafter, the coordinates of objects in this m-dimensional map shall be referred to as “output coordinates” or “output features”. In addition, a set of n attributes are determined for each of the selected subset of objects. Hereafter, these n attributes shall be referred to as “input coordinates” or “input features”. Thus, each object in the selected subset of objects is associated with an n-dimensional vector of input features and an m-dimensional vector of output features. A supervised machine learning approach is then employed to determine a functional relationship between the n-dimensional input and m-dimensional output vectors, and that functional relationship is recorded. Hereafter, this functional relationship shall be referred to as a “mapping function”. Additional objects that are not part of the selected subset of objects may be mapped by computing their input features and using them as input to the mapping function, which produces their output coordinates. The mapping function can be encoded in a neural network or a collection of neural networks.
Both the self-organizing and the neural network methods are general and can be used to produce maps of any desired dimensionality.
MDS can be particularly valuable for analyzing and visualizing combinatorial chemical libraries. A combinatorial library is a collection of chemical compounds derived from the systematic combination of a prescribed set of chemical building blocks according to a specific reaction protocol. A combinatorial library is typically represented as a list of variation sites on a molecular scaffold, each of which is associated with a list of chemical building blocks. Each compound (or product) in a combinatorial library can be represented by a unique tuple, {r1, r2, . . . , rd}, where ri is the building block at the i-th variation site, and d is the number of variation sites in the library. For example, a polypeptide combinatorial library is formed by combining a set of chemical building blocks called amino acids in every possible way for a given compound length (here, the number of variation sites is the number of amino acids along the polypeptide chain). Millions of products theoretically can be synthesized through such combinatorial mixing of building blocks. As one commentator has observed, the systematic combinatorial mixing of 100 interchangeable chemical building blocks results in the theoretical synthesis of 100 million tetrameric compounds or 10 billion pentameric compounds (Gallop et al., “Applications of Combinatorial Technologies to Drug Discovery, Background and Peptide Combinatorial Libraries,” J. Med. Chem. 37, 1233–1250 (1994), which is incorporated by reference herein in its entirety). A computer representation of a combinatorial library is often referred to as a virtual combinatorial library.
MDS can simplify the analysis of combinatorial libraries in two important ways: (1) by reducing the number of dimensions that are required to describe the compounds in some abstract chemical property space in a way that preserves the original relationships among the compounds, and (2) by producing Cartesian coordinate vectors from data supplied directly or indirectly in the form of molecular similarities, so that they can be analyzed with conventional statistical and data mining techniques. Typical applications of coordinates obtained with MDS include visualization, diversity analysis, similarity searching, compound classification, structure-activity correlation, etc. (See, e.g., Agrafiotis, D. K., The diversity of chemical libraries, The Encyclopedia of Computational Chemistry, Schleyer, P. v. R., Allinger, N. L., Clark, T., Gasteiger, J., Kollman, P. A., Schaefer III, H. F., and Schreiner, P. R., Eds., John Wiley & Sons, Chichester, 742–761 (1998); and Agrafiotis, D. K., Myslik, J. C., and Salemme, F. R., Advances in diversity profiling and combinatorial series design, Mol. Diversity, 4(1), 1–22 (1999), each of which is incorporated by reference herein in its entirety).
Analyzing a combinatorial library based on the properties of the products (as opposed to the properties of their building blocks) is often referred to as product-based design. Several product-based methodologies for analyzing virtual combinatorial libraries have been developed. (See, e.g., Sheridan, R. P., and Kearsley, S. K., Using a genetic algorithm to suggest combinatorial libraries, J. Chem. Info. Comput. Sci, 35, 310–320 (1995); Weber, L., Wallbaum, S., Broger, C., and Gubemator, K., Optimization of the biological activity of combinatorial compound libraries by a genetic algorithm, Angew. Chem. Int. Ed. Eng, 34, 2280–2282 (1995); Singh, J., Ator, M. A., Jaeger, E. P., Allen, M. P., Whipple, D. A., Soloweij, J. E., Chowdhary, S., and Treasurywala, A. M., Application of genetic algorithms to combinatorial synthesis: a computational approach for lead identification and lead optimization, J. Am. Chem. Soc., 118, 1669–1676 (1996); Agrafiotis, D. K., Stochastic algorithms for maximizing molecular diversity, J. Chem. Info. Comput. Sci., 37, 841–851 (1997); Brown, R. D., and Martin, Y. C., Designing combinatorial library mixtures using genetic algorithms, J. Med. Chem., 40, 2304–2313 (1997); Murray, C. W., Clark, D. E., Auton, T. R., Firth, M. A., Li, J., Sykes, R. A., Waszkowycz, B., Westhead, D. R. and Young, S. C., PRO_SELECT: combining structure-based drug design and combinatorial chemistry for rapid lead discovery. 1. Technology, J. Comput.Aided Mol. Des., 11, 193–207 (1997); Agrafiotis, D. K., and Lobanov, V. S., An efficient implementation of distance-based diversity metrics based on k-d trees, J. Chem. Inf Comput. Sci., 39, 51–58 (1999); Gillett, V. J., Willett, P., Bradshaw, J., and Green, D. V. S., Selecting combinatorial libraries to optimize diversity and physical properties, J. Chem. Info. Comput. Sci., 39, 169–177 (1999); Stanton, R. V., Mount, J., and Miller, J. L., Combinatorial library design: maximizing model-fitting compounds with matrix synthesis constraints, J. Chem. Info. Comput. Sci., 40, 701–705 (2000); and Agraflotis, D. K., and Lobanov, V. S., Ultrafast algorithm for designing focused combinatorial arrays, J. Chem. Info. Comput. Sci., 40, 1030–1038 (2000), each of which is incorporated by reference herein in its entirety).
However, as will be understood by a person skilled in the relevant art(s), this approach requires explicit enumeration (i.e., virtual synthesis) of the products in the virtual library. This process can be prohibitively expensive when the library contains a large number of products. That is, the analysis cannot be accomplished in a reasonable amount of time using available computing systems. In such cases, the most common solution is to restrict attention to a smaller subset of products from the virtual library, or to consider each variation site independently of all the others. (See, e.g., Martin, E. J., Blaney, J. M., Siani, M. A., Spellmeyer, D. C., Wong, A. K., and Moos, W. H., J. Med Chem., 38, 1431–1436 (1995); Martin, E. J., Spellmeyer, D. C., Critchlow, R. E. Jr., and Blaney, J. M., Reviews in Computational Chemistry, Vol. 10, Lipkowitz, K. B., and Boyd, D. B., Eds., VCH, Weinheim (1997); and Martin, E., and Wong, A., Sensitivity analysis and other improvements to tailored combinatorial library design, J. Chem. Info. Comput. Sci., 40, 215–220 (2000), each of which is incorporated by reference herein in its entirety). Unfortunately, the latter approach, which is referred to as reagent-based design, often produces inferior results. (See, e.g., Gillet, V. J., Willett, P., and Bradshaw, J., J. Chem. Inf. Comput. Sci.; 37(4), 731–740 (1997); and Jamois, E. A., Hassan, M., and Waldman, M., Evaluation of reagent-based and product-based strategies in the design of combinatorial library subsets, J. Chem. Inf Comput. Sci., 40, 63–70 (2000), each of which is incorporated by reference herein in its entirety).
Hence there is a need for methods, systems, and computer program products that can be used to analyze large combinatorial chemical libraries, which do not have the limitations discussed above. In particular, there is a need for methods, systems, and computer program products for rapidly generating mapping coordinates for compounds in a combinatorial library that do not require the enumeration of every possible product in the library.