The present invention relates to combinatorial chemistry and computer aided molecular design. The present invention also relates to pattern analysis, information representation, information cartography and data mining. In particular, the present invention relates to predicting measurable or computed properties of products in a combinatorial chemical library based on features of their corresponding reagents.
Algorithmic efficiency has been a long-standing objective in computational drug design. There is perhaps no other problem in chemistry where the need for efficiency is as pressing as in combinatorial chemistry. As will be understood by a person skilled in the relevant art, a significant bottleneck in the virtual screening of a large combinatorial chemical library is the explicit enumeration of products and the calculation of their pertinent properties.
Whether it is based on molecular diversity, molecular similarity, structure-activity correlation, or structure-based design, the design of a combinatorial experiment typically involves the enumeration of every possible product in a virtual library, and the computation of key molecular properties that are thought to be pertinent to the application at hand. (See, e.g., Agrafiotis, D. K., The diversity of chemical libraries, The Encyclopedia of Computational Chemistry, Schleyer, P. v. R., Allinger, N. L., Clark, T., Gasteiger, J., Kollman, P. A., Schaefer III, H. F., and Schreiner, P. R., Eds., John Wiley and Sons, Chichester, 742-761 (1998); and Agrafiotis, D. K., Myslik, J. C., and Salemme, F. R., Advances in diversity profiling and combinatorial series design, Mol. Diversity, 4(1), 1-22 (1999), each of which is incorporated by reference herein in its entirety).
Several product-based methodologies for screening virtual libraries have been developed. (See, e.g., Sheridan, R. P., and Kearsley, S. K., Using a genetic algorithm to suggest combinatorial libraries, J. Chem. Info. Comput. Sci., 35, 310-320 (1995); Weber, L., Wallbaum, S., Broger, C., and Gubemator, K., Optimization of the biological activity of combinatorial compound libraries by a genetic algorithm, Angew. Chem. Int. Ed. Eng., 34, 2280-2282 (1995); Singh, J., Ator, M. A., Jaeger, E. P., Allen, M. P., Whipple, D. A., Soloweij, J. E., Chowdhary, S., and Treasurywala, A. M., Application of genetic algorithms to combinatorial synthesis: a computational approach for lead identification and lead optimization, J. Am. Chem. Soc., 118, 1669-1676 (1996); Agrafiotis, D. K., Stochastic algorithms for maximizing molecular diversity, J. Chem. Info. Comput. Sci., 37, 841-851 (1997); Brown, R. D., and Martin, Y. C., Designing combinatorial library mixtures using genetic algorithms, J. Med. Chem., 40, 2304-2313 (1997); Murray, C. W., Clark, D. E., Auton, T. R., Firth, M. A., Li, J., Sykes, R. A., Waszkowycz, B., Westhead, D. R. and Young, S. C., PRO_SELECT: combining structure-based drug design and combinatorial chemistry for rapid lead discovery. 1. Technology, J. Comput.-Aided Mol. Des., 11, 193-207 (1997); Agrafiotis, D. K., and Lobanov, V. S., An efficient implementation of distance-based diversity metrics based on k-d trees, J Chem. Inf. Comput. Sci., 39, 51-58 (1999); Gillett, V. J., Willett, P., Bradshaw, J., and Green, D. V. S., Selecting combinatorial libraries to optimize diversity and physical properties, J Chem. Info. Comput. Sci., 39, 169-177 (1999); Stanton, R. V., Mount, J., and Miller, J. L., Combinatorial library design: maximizing model-fitting compounds with matrix synthesis constraints, J Chem. Info. Comput. Sci., 40, 701-705 (2000); and Agrafiotis, D. K., and Lobanov, V. S., Ultrafast algorithm for designing focused combinatorial arrays, J Chem. Info. Comput. Sci., 40, 1030-1038 (2000), each of which is incorporated by reference herein in its entirety).
These product-based methodologies become impractical, however, when they are applied to large combinatorial libraries, i.e. libraries that contain a large number of possible products. In such cases, the most common solution is to restrict attention to a smaller subset of products from the virtual library, or to consider each substitution site independently of all the others. (See, e.g., Martin, E. J., Blaney, J. M., Siani, M. A., Spellmeyer, D. C., Wong, A. K., and Moos, W. H., J Med. Chem., 38, 1431-1436 (1995); Martin, E. J., Spellmeyer, D. C., Critchlow, R. E. Jr., and Blaney, J. M., Reviews in Computational Chemistry, Vol. 10, Lipkowitz, K. B., and Boyd, D. B., Eds., VCH, Weinheim (1997); and Martin, E., and Wong, A., Sensitivity analysis and other improvements to tailored combinatorial library design, J. Chem. Info. Comput. Sci., 40, 215-220 (2000), each of which is incorporated by reference herein in its entirety). Unfortunately, the latter approach, which is referred to as reagent-based design, often produces inferior results in terms of meeting the primary design objectives. (See, e.g., Gillet, V. J., Willett, P., and Bradshaw, J., J. Chem. Inf. Comput. Sci.; 37(4), 731-740 (1997); and Jamois, E. A., Hassan, M., and Waldman, M., Evaluation of reagent-based and product-based strategies in the design of combinatorial library subsets, J. Chem. Inf. Comput. Sci., 40, 63-70 (2000), each of which is incorporated by reference herein in its entirety).
Hence there is a need for methods, systems, and computer program products that can be used to screen large combinatorial chemical libraries, which do not have the limitations discussed above.
The present invention provides a method, system, and computer program product for determining properties of combinatorial library products from features of library building blocks.
As described herein, at least one feature is determined for each building block of a combinatorial library having a plurality of products. A training subset of products is selected from the plurality of products of the combinatorial library, and at least one property is determined for each product of the training subset of products. A building block set is identified for each product of the training subset of products, and an input features vector is formed for each product of the training subset of products. A supervised machine learning approach is used to infer a mapping function that transforms the input features vector for each product of the training subset of products to the corresponding at least one property for each product of the training subset of products. After the mapping function is inferred, it is used for determining, estimating, or predicting properties of other products of the library. Properties of other products are determined, estimated, or predicted from their corresponding input features vectors using the inferred mapping function. Building block sets are identified for a plurality of additional products of the combinatorial library. Input features vectors are formed for the plurality of additional products. The input features vectors for the plurality of additional products are transformed using the mapping function to obtain at least one estimate property for each of the plurality of additional products.
In embodiments of the invention, both measured values and/or computed values are used as features for the building blocks of the combinatorial library. Both measured values and/or computed values are also used as properties for the products of the training subset. In embodiments of the invention, at least one of the features of the building blocks is the same as at least one of the properties of the products.
In an embodiment of the invention, the mapping function is implemented using a multilayer perceptron. The multilayer perceptron is trained to implement the mapping function using the input features vector and the corresponding at least one property for each product of the training subset of products.
In an embodiment of the invention, the building blocks of the combinatorial library include reagents used to construct the combinatorial library. In other embodiments, the building blocks of the combinatorial library include fragments of the reagents used to construct the combinatorial library. In still other embodiments, the building blocks of the combinatorial library include modified fragments of the reagents used to construct the combinatorial library.
Further embodiments, features, and advantages of the present invention, as well as the structure and operation of the various embodiments of the present invention, are described in detail below with reference to the accompanying figures.