The present invention relates to combinatorial chemistry. More particularly, it relates to virtual combinatorial libraries used in computer aided molecular design.
Among the tools available to a medicinal chemist, combinatorial chemistry is one of the most powerful and best suited for exploring chemical space in search of new drug leads. Combinatorial chemistry provides access to millions of novel compounds from a limited number of building blocks using synthetic procedures that work reliably across a wide range of starting materials.
A virtual combinatorial library is a collection of chemical compounds or products, in electronic form, generated by combining a number of chemical building blocks such as reagents. For example, a polypeptide virtual combinatorial library can be formed by combining a set of chemical building blocks called amino acids, in electronic form, in every possible or nearly every possible way for a given compound length (i.e., the number of amino acids in a polypeptide compound).
Generally speaking, there are two kinds of virtual combinatorial libraries that can be formed: a viable library and an accessible library. A viable library is relatively small in size. It is assembled from readily available reagents that have been filtered, for example, by a medicinal chemist. A viable library will often have a physical counterpart. An accessible library, on the other hand, is relatively large in size. It can encompass millions or billions of products. An accessible library will typically include all possible reagents that are in principle compatible with a particular chemical reaction scheme. Typically, an accessible library is so large that it can never be physically synthesized in its entirety. Thus, in many cases, appropriate selection techniques must be applied to an accessible library in order to identify a subset of compounds or products for physical synthesis and biological testing. In order to take advantage of robotic hardware, minimize the number of reagents, and simplify the logistical aspects of a chemical experiment, physical libraries are almost invariably synthesized in the form of arrays, which represent the products derived by combining a given subset of reagents in all possible combinations as prescribed by the reaction scheme.
Depending on their use, virtual combinatorial libraries are divided into two main categories: (1) focused or directed libraries, which are biased against a specific target, structural class, or known pharmacophore; and (2) exploratory or probe libraries, which are target-independent and are designed to span a wide range of physicochemical and structural characteristics. Focused libraries are typically designed to follow up on a known lead, optimize a set of properties, or validate some structure-activity hypothesis. Access to the chemical structures of the products is required in order to assess molecular similarity, predict biological activity, or estimate some other property of interest. In contrast, probe libraries explore chemical space in search of novel hits, and their design is based predominantly on molecular diversity. Although fairly diverse libraries can be built by selecting a diverse set of reagents, there is overwhelming evidence (see V. J. Gillet et al., The effectiveness of reactant pools for generating structurally-diverse combinatorial libraries, J. Chem. Inf. Comput. Sci., 1997, 37, 731-740; and E. A. Jamois et al., Evaluation of reagent-based and product-based strategies in the design of combinatorial library subsets, J. Chem. Inf. Comput. Sci., 2000, 40, 63-70, which is incorporated by reference herein in its entirety), but not conclusive evidence (see A. Linusson et al., Statistical Molecular Design of Building Blocks for Combinatorial Chemistry, J. Med. Chem., 2000, 43, 1320-1328; and E. J. Martin et al., Oriented Substituent Pharmacophore PropErtY Space (OSPPREYS): A substituent-based calculation that describes combinatorial library products better than the corresponding product-based calculation, J. Mol. Graphics Modell., 2000, 18, 383-403, each of which is incorporated by reference herein in its entirety), which suggests that product-based designs are substantially better.
Experience suggests that selections based exclusively on molecular diversity tend to include xe2x80x9cextremexe2x80x9d reagents, which can increase cost, cause delays due to limited availability, lead to unforeseen synthetic problems, and produce unusual compounds of limited pharmaceutical interest. The hit rate achieved with such libraries has proven disappointingly low (see A. R. Leach and M. M. Hann, The in silico world of virtual libraries, Drug Discovery Today, 2000, 5, 326-336, each of which is incorporated by reference herein in its entirety), and the compounds often exhibit unfavorable biological properties that could potentially result in ADME liabilities (see C. A. Lipinski et al., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997, 23, 3-25; and D. N. Rassokhin and D. K. Agrafiotis, Kolmogorov-Smirnov statistic and its application in library design, J. Mol. Graphics Modell., 2000, 18(4-5), 370-384, each of which is incorporated by reference herein in its entirety). Thus, the focus in the design of probe libraries has began to shift from pure diversity to chemical feasibility, availability of monomers, and drug likeness (see D. N. Rassokhin and D. K. Agrafiotis, Kolmogorov-Smirnov statistic and its application in library design, J. Mol. Graphics Modell., 2000, 18(4-5), 370-384; A. R. Leach and M. M. Hann, The in silico world of virtual libraries, Drug Discovery Today, 2000, 5, 326-336; J. Sadowski and H. Kubinyi, A scoring scheme for distinguishing between drugs and non-drugs. J. Med. Chem., 1998, 41, 3325-3329; Ajay et al., Can we learn to distinguish between xe2x80x9cdrug-likexe2x80x9d and xe2x80x9cnondrug-likexe2x80x9d molecules?, J. Med. Chem., 1998, 41, 3314-3324; and J. Wang and K. Ramnarayan, Toward designing drug-like libraries: a novel computational approach for prediction of drug feasibility of compounds, J. Comb. Chem., 1999, 1, 524-533, each of which is incorporated by reference herein in its entirety).
Creating designs that combine molecular diversity or similarity with desired property profiles and drug likeness requires the use of optimization techniques such as simulated annealing (see D. K. Agrafiotis, Stochastic algorithms for maximizing molecular diversity, J. Chem. Inf. Comput. Sci., 1997, 37, 841-851; D. K. Agrafiotis, On the use of information theory for assessing molecular diversity. J. Chem. Inf. Comput. Sci., 1997, 37(3), 576-580; D. K. Agrafiotis and V. S. Lobanov, An efficient implementation of distance-based diversity metrics based on k-d trees, J. Chem. Inf. Comput. Sci., 1999, 39(1), 51-58; M. Hassan et al., Optimization and visualization of molecular diversity of combinatorial libraries, J. Comput. Aided. Mol. Des., 1996, 2, 64-74; and A. C. Good and R. A. Lewis, New methodology for profiling combinatorial libraries and screening sets: cleaning up the design process with HARPick, J. Med. Chem., 1997, 40, 3926-3936, each of which is incorporated by reference herein in its entirety) or genetic algorithms (see U.S. Pat. Nos. 5,463,564; 5,574,656; 5,684,711; and 5,901,069 to D. K. Agrafiotis et al.; R. D. Brown and Y. C. Martin, Designing combinatorial library mixtures using a genetic algorithm, J. Med. Chem., 1997, 40, 2304-2313; and V. J. Gillet et al., Selecting combinatorial libraries to optimize diversity and physical properties, J. Chem. Inf. Comput. Sci., 1999, 39, 169-177, each of which is incorporated by reference herein in its entirety) and access to the properties of the individual products. To that end, in silico enumeration or virtual library generation becomes an essential part of the design process.
Despite advances in the processing speed and storage capacity of modern computers, there are many combinatorial libraries that defy enumeration. Enumeration, or product expansion, refers to the translation of a library into a database containing connection tables for the products of the library. For example, it is easy to imagine a combinatorial library containing 1012 compounds (see R. D. Cramer et al., Virtual compound libraries: a new approach to decision making in molecular discovery research. J. Chem. Inf. Comput. Sci. 1998, 38, 1010-1023, which is incorporated by reference herein in its entirety), which would require over three years to enumerate at a rate of 10,000 structures per second. Since most of the descriptors that are typically employed in diversity profiling, similarity searching and QSAR are calculated at a much slower rate, an exhaustive analysis of such a library would be impossible. Hence, there is a need for virtual library enumeration and analysis techniques that are scalable and that can be applied to massive virtual libraries containing hundreds of millions of compounds.
The present invention provides a method, system, and computer program product for encoding and building products of a virtual combinatorial library. As described herein, the invention involves a pre-calculation or encoding stage in which data and computer instructions needed to build products of a virtual combinatorial library are generated, compiled, and stored in a compact data structure for subsequent retrieval. This stage of the invention eliminates any need to fully enumerate the virtual combinatorial library whenever a product is needed. The invention also involves a real-time or building stage, in which the data and computer instruction of the stored data structure are accessed and used, for example, to quickly build or generate product connection tables for selected product of the library on an as needed basis.
As described herein, during the encoding stage of embodiments of the invention at least one chemical transformation for generating product connection data from reagent connection data and one or more reagent substructure patterns involved in forming the products of the virtual combinatorial library are encoded in a computer readable form (e.g., a scripting language). A compiler operates on the encoded information and generates reagent mapping data. The reagent mapping data is generated from the one or more reagent substructure patterns and reagent connection data for a set of reagents from which the products of the virtual combinatorial library are formed. In an embodiment, the reagent mapping data encodes how an atom or group of atoms of the one or more reagent substructure patterns is mapped to an atom or group of atoms of a reagent molecule. The compiler compiles the encoded at least one chemical transformation to generate computer instructions that can control the operation of a processor. A library object containing the compiled computer instructions, the generated reagent mapping data, and the reagent connection data for the set of reagents is then generated and stored in a memory.
During the building stage of embodiments of the invention, a builder is used to generate product connection data for the products of the virtual combinatorial library. The builder uses the compiled computer instructions stored as a part of the library object. The builder operates on reagent mapping data and reagent connection data retrieved from the library object. In an embodiment, the reagent mapping data is stored as a plurality of reaction maps, and the reagent connection data for the set of reagents is stored as a plurality of reagent connection tables. In an embodiment, the output of the builder is a product connection table for each product built.
In an embodiment, data needed to build a particular product is retrieved using a product identification number. In another embodiment, data needed to build a particular product is retrieved using an identification number associated with one or more reagents used to form the particular product.
Further embodiments, features, and advantages of the present invention, as well as the structure and operation of the various embodiments of the present invention, are described in detail below with reference to the accompanying figures.