This invention relates to a system and method useful for the generation and representation of chemical libraries and, more particularly, to a computer-implemented system and method useful for the generation and representation of combinatorial chemistry libraries.
Combinatorial chemistry allows scientists to generate large numbers of unique molecules with a small number of chemical reactions. Rather than using the traditional approach of synthesizing novel compounds one at a time, compounds are synthesized by performing chemical reactions in stages, and reacting all of the molecules formed in stage n-1 with each reactant in stage n. An example of this process is shown in FIG. 1. While, for purposes of this example, it is assumed that R1-R9 of FIG. 1 represent single reactants which are used to perform single reactions, those skilled in the art will appreciate that any or all of R1-Rn can represent multiple reactions with which different types of chemistry or chemical sequences can be performed.
In stage 1 of the example of FIG. 1, molecules A and B are reacted with reactant R1. Similarly, molecules C and D are reacted with reactant R2, and molecules E and F are reacted with reactant R3 (although only one of each type of molecule is shown in FIG. 1, many of each type are used in the first stage and, consequently, many of each type are formed in subsequent stages). Molecules A-F are the "starting molecules," and the molecules formed after each stage are represented in FIG. 1 by the starting molecule followed by the sequence of reactants separated by colons.
In stage 2, all of the molecules formed in stage 1 are reacted with reactants R4, R5 and R6, and in stage 3, all of the molecules formed in stage 2 are reacted with reactants R7, R8 and R9. As is shown in FIG. 1, this process generates 54 diverse molecules after stage 3, having started with only six molecules and having performed only nine reactions. The diverse library of molecules thus formed may be used to screen for biological activity against a therapeutic target or for any other desirable property.
A general formula for the maximum number of unique molecules which can be formed using a combinatorial process is ##EQU1## where N is the number of stages, R is the number of reactants at stage j, K is the total number of reactants in the first stage, and m is the number of molecules reacted with reactant n. This formula represents the maximum number of unique molecules formed because it is possible for different reaction steps to generate the same compounds.
The following references are related to combinatorial chemistry, and are hereby incorporated by reference in their entirety: PCT International Application Number WO 94/08051, filed Oct. 1, 1993; "Combinatorial Approaches Provide Fresh Leads for Medicinal Chemistry," Chemical & Engineering News, Vol. 72, Feb. 7, 1994, pp. 20-26; "A Paradigm for Drug Discovery Employing Encoded Combinatorial Libraries," Proc. Natl. Acad. Sci. USA, Vol. 92, pp. 6027-6031, June 1995; "Synthesis of a Small Molecule Combinatorial Library Encoded with Molecular Tags," Journal of the American Chemical Society, Vol. 117, No. 20, pp. 5588-5589, 1995; "A General Method for Molecular Tagging of Encoded Combinatorial Chemistry Libraries," The Journal of Organic Chemistry, Vol. 59, No. 17, pp. 4723-4724, 1994; "Synthetic Receptor Binding Elucidated with an Encoded Combinatorial Library," Journal of the American Chemical Society, Vol. 116, No. 1, pp. 373-374, 1994; "Complex Synthetic Chemical Libraries Indexed with Molecular Tags," Proc. Natl. Acad. Sci. USA, Vol. 90, pp. 10922-10926, December 1993; "The Promise of Combinatorial Chemistry", Windhover's In Vivo The Business & Medicine Report, Vol. 12, No. 5, May, 1994, pp. 23-31.
When a compound generated using combinatorial chemistry is found to have a desirable property, it is important to be able to determine either the structure of the compound or the manner in which it was synthesized so that it can be made in large quantities. Until recently, combinatorial chemistry was practical only for generating peptides and other large oligomeric molecules because direct structure elucidation for most compounds is problematic, and such large molecules (made of repeating subunits) offered the advantage of being amenable to sequencing to determine their structure. In contrast, only very small libraries of small (i.e., nonoligomeric) molecules could be generated because, since such small molecules cannot be sequenced, the size of the library had to be kept small enough to allow a scientist to keep track of every compound made.
Combinatorially generated peptide libraries proved to be of limited value. Peptides are poor therapeutic agents, in part because of their lack of stability in vivo. Drug companies preferred libraries of small organic molecules which, unlike most large molecules such as peptides, can frequently act when taken orally.
A need therefore existed for a scheme by which the reaction history of small molecules generated using combinatorial chemistry could be tracked. A method was developed for "tagging" the generated compounds with an identifier for each reaction step in its synthesis. The process is called the "cosynthesis" method because, as a compound is synthesized, a tag linked to the compound (or to the solid support, e.g., bead, upon which the compound is being synthesized) by means of a chemical bond is also synthesized, which encodes the series of steps and reagents used in the synthesis of the library element. When a library compound is found to have a desirable property, the tag is sequenced to determine the series of reaction steps which formed the compound. Because the tags must be sequenced, large molecule tags such as oligonucleotides and oligopeptides have been used.
The cosynthesis method has many inherent problems. For example, the tagging structures themselves are necessarily chemically labile and unstable and as such are incompatible with many of the reagents commonly used in small molecule combinatorial chemistry. Additionally, multiple protecting groups are required and the cosynthesis of a tag may reduce the yield of the library compounds. For these reasons, the cosynthesis method has not made small molecule combinatorial chemistry a commercially viable technology.
The assignee of the instant invention has developed a proprietary, pioneering technology which makes small molecule combinatorial chemistry commercially feasible. This technology is fully described in PCT published application number WO 94/08051 and employs binary coding of the synthesized compounds such that only the presence or absence of tags, and not their sequence, defines the compound's reaction history. The operation of the assignee's binary coding system is depicted in FIGS. 2A-2C.
FIG. 2A shows a three-stage combinatorial synthesis with three reactants in each stage. While, as is known, two binary digits can uniquely identify four reactants, in a preferred embodiment, the binary digits 00 are not used to identify a reactant. Consequently, as shown in FIG. 2B, the reaction history of any compound formed in the combinatorial synthesis of FIG. 2A can be represented with a six-digit binary code. The two least significant digits represent the reactant employed in stage 1, the next two digits the reactant employed in stage 2, and the two most significant digits the reactant employed in stage 3. The two digit binary code for each reactant in each stage is shown below the reactant in FIG. 2A, with underlining representing bits contributed by other stages.
As shown in FIG. 2C, then, compound A, which was synthesized with reactants R3, R5 and R9, can be represented with the binary code 111011. Similarly, compound B, which was synthesized with reactants R1, R6 and R8, can be represented with the binary code 101101 and compound C, which was synthesized with reactants R2, R4 and R7, can be represented with the binary code 010110.
Pursuant to the assignee's proprietary tagging technology, each of the bits of the binary code which defines a compound's reaction history is represented by a tagging molecule. These tagging molecules are bound to the solid support as the synthesis progresses such that the presence of a tag indicates that the value of the bit it represents is "1", while the absence of a tag indicates that the value of the bit it represents is "0". As illustrated in FIG. 2B, tag T1 represents the least significant bit, with successive tags assigned to successive bits such that tag T6 represents the most significant bit. As shown in FIG. 2C, then, tags T1, T2, T4, T5 and T6 will be bound to the solid support on which compound A was synthesized, tags T1, T3, T4 and T6 will be bound to the solid support on which compound B was synthesized, and tags T2, T3 and T5 will be bound to the solid support on which compound C was synthesized. While, in a preferred embodiment, the assignee's binary coding technique employs tagging molecules which are bound to the solid support, those skilled in the relevant art will appreciate that the assignee's binary coding technique is not limited to this implementation, and that binary coding can be implemented with any tagging technique including but not limited to radio tagging. Radio tagging is described in "Radio Tags Speed Compound Synthesis," SCIENCE, Vol. 270. p. 577, October 1995, which is hereby incorporated by reference in its entirety.
This binary tagging technique overcomes the above referenced disadvantages of the cosynthesis method, making small molecule combinatorial chemistry feasible. As alluded to in the above referenced article titled "The Promise of Combinatorial Chemistry", however, this chemical advance has given rise to a new engineering problem, namely, how to concisely represent the contents of small molecule combinatorial libraries, each potentially containing hundreds of thousands of unique chemical compounds, and how to plan their generation such that the probability of generating compounds with useful characteristics is increased. Existing systems, such as those developed by Tripos, Inc. ("Tripos"), MDL Information Systems, Inc. ("MDL") and Daylight Chemical Information Systems, Inc. ("Daylight") are either infeasible or impractical for use with small molecule combinatorial libraries because the representation schemes implemented by these systems do not allow for concise representations of all types of small molecule combinatorial libraries, for tracking of those libraries which are binary coded or for correct enumerations of those libraries generated on solid support.
In the MDL system, the operation of which is described with reference to FIGS. 3A-3D, a combinatorial library is represented with
1) one core chemical structure having attachment points for chemical moieties which are added or attached to the core structure at each stage of the combinatorial synthesis; and PA1 2) lists of the moieties which can be added to the core structure at each stage ("additions"). PA1 1) Index the database to determine the linear representations for each of the linked arbitrary monomer names in the library representation; PA1 2) Draw out the chemical structure of the monomers represented by each linear representation; and PA1 3) Substitute the number labels in the library definition for the number labels in the monomer definition. PA1 1) robotically synthesizing "directed diversity" chemical libraries; PA1 2) analyzing the compounds created in step (1); PA1 3) storing structure-activity data for the compounds created; PA1 4) comparing the structure-activity data for the compounds created with those desired for the library; PA1 5) assigning rating factors to the synthons based on how close the generated library is to the desired library; PA1 6) analyzing the structure-activity data to select which synthons will produce libraries with properties closer to the desired library; and PA1 7) generating computer instructions such that the next iteration will utilize the synthons selected in step (6). PA1 1) chemical libraries must be generated repeatedly, a process which may be impractical based on the limited availability of the necessary compounds and reactants. To the extent it is feasible, the process will be very expensive, particularly in light of the fact that many synthons which ultimately may turn out to be superfluous will be required; PA1 2) it is not especially useful with combinatorial chemistry, but rather implements what is explicitly described as a different process altogether, namely "directed diversity" chemistry (Col 5 lines 1-22); PA1 3) it does not describe a solution useful with small molecule chemistry, but rather states that "t!o date, most work with combinatorial chemical libraries has been limited only to peptides and oligonucleotides . . . " (Col 2 lines 32-34) and that "t!he peptide synthesis technology is preferred in producing the directed diversity libraries associated with the present invention"; and PA1 4) while a computer is described for evaluating the characteristics of the library generated vis-a-vis the desired library, no scheme is contemplated or described for graphically representing the contents of the generated library such that a scientist could quickly understand exactly what compounds were produced.
An example of an MDL representation of a combinatorial library is shown in FIG. 3A. The core structure is shown with attachment points R1-R4, representing points of attachment for moieties which can be added to the core structure in stages 1-4 of the combinatorial synthesis respectively. Also shown are four lists of structures, each list representing the compounds which can be added to the core structure in one of the four stages. The point of attachment of each compound added to the core structure is indicated with a dot. The contents of the combinatorial library can be enumerated by identifying all permutations of the compounds of the four lists as attached to the core structure. As used in this specification, the term "enumeration" will mean the process of generating representations of the entire structure of each of the compounds in the library based on the concise representation employed by a system.
The MDL system has many limitations which render it infeasible for use with small molecule combinatorial chemistry. For example, each addition can have at most two attachment points. While suitable for peptide chemistry, two attachment points per addition are insufficient to represent the structures contained in many small molecule combinatorial libraries. For example, the MDL system would be unable to represent a core structure such as that shown in FIG. 3B, since the moiety R2 has three points of attachment.
Another limitation of the MDL system which makes it infeasible for use with small molecule combinatorial chemistry is that all the possible additions at each reaction stage must attach at the same point or points on the core structure. It is possible in small molecule combinatorial synthesis to have different additions at a given reaction stage which attach at different points on the structures generated in previous stages.
Furthermore, the MDL system cannot represent additions which attach only on a subset of the structures formed in previous reaction stages. An example of such a library is shown in FIG. 3C, where a core structure of a combinatorial library can be seen along with a subset of the additions possible from stages 1 and 2. Since the first addition from stage 2 attaches only if the second addition from stage 1 attaches, the MDL system could not represent the library, since all additions from all stages must attach directly to the core structure in the MDL system. Since substituents from a given stage may or may not attach depending on the identities of the substituents attached during previous stages, this limitation also renders the MDL system unsuitable for use with small molecule combinatorial chemistry.
Finally, there are many ways in which small molecule combinatorial libraries can be generated for which a single core structure cannot be defined and which, consequently, cannot be represented by the MDL system. For example, if the possible additions at each of the first three stages are as shown in FIG. 3D, where the black boxes are used to represent chemical structures and the numbered bonds represent points of attachment, the MDL system could not represent the library. No single core structure can be defined to which all the possible additions from all the stages attach. Rather, in the example of FIG. 3D, two core structures are required, depending on whether the first or the second addition from R2 attaches to the addition from R1.
The Tripos system is similar in many respects to the MDL system, and is similarly incapable of concisely representing all types of small molecule combinatorial libraries. For example, like the MDL system, the Tripos system requires that a common core structure be defined, and like the MDL system, it cannot handle combinatorial chemistry where additions attach only to a subset of the structures formed in previous stages. Although in some respects the Tripos system is more flexible than the MDL system (e.g. all additions from all stages need not link directly to the core), it is in many respects even more limiting than the MDL system. For example, the core structure of the Tripos system must be well defined in that it cannot be made up of all variables. This severely limits the types of chemistry with which it can be used because, as discussed above, defining a core structure can be problematic. In short, for many of the same reasons discussed with respect to the MDL system, it is infeasible to use the Tripos system as a tool for representing the contents of small molecule combinatorial chemistry libraries.
The Daylight system, while purportedly designed to represent combinatorial libraries, does not solve the problem of concisely representing small molecule libraries in such a way that a chemist, from the concise representation alone, can understand the makeup of the library. In order to represent the contents of these libraries concisely, Daylight employs several levels of indirection, meaning the "concise" representation is actually useful only as an index into a database from which the contents of the library can ultimately be discerned. The operation of the Daylight system is illustrated in FIGS. 4A-4E.
In the Daylight system, monomers are assigned arbitrary names by the user, and are represented and stored in the system using a linear representation of the atoms of the monomer as demonstrated in FIG. 4A. The atoms are listed in the linear representation in the order in which they bind, with branches indicated in parentheses. Atoms to which other monomers bind, or which serve as the point of a ring closure on the monomer, are labeled with numbers appearing to the immediate right of the atom. When the final polymerized structure is enumerated, paired number labels internal to the monomer definition will be bound together first, after which like labels will be bound together from left to right in the order in which they appear. Because each monomer is independently defined, and it is impossible to know a priori the location in the polymerized structure at which any atom will bind, each monomer contains number labels from 1-N, and the representation scheme for the polymerized structure provides for substitution of these labels with labels indicating the attachment points in the polymerized structure. FIG. 4A shows the chemical structure for three monomers, the manner in which they are represented with the Daylight linear representation scheme, and arbitrary names which can be assigned to the monomers by the user. The sulfur atom in the "Cys" monomer has a label of 1. Since no other atom in Cys is labeled with a 1, which would indicate points for a ring closure, the sulfur atom can serve as a point of attachment to other monomers.
FIG. 4B demonstrates how a polymerized molecule can be represented in the Daylight system using linked monomer names. Also shown in FIG. 4B is the Daylight linear representation of these linked monomers, and the actual chemical structure of the polymerized molecule represented. Note that while the sulfur of every Cys monomer is not required to serve as a point of attachment, the two that do are bound at the sulfurs with like labels of "1".
FIGS. 4C and 4D present a more detailed example of the way monomers are represented in the Daylight system. FIG. 4C illustrates the way five monomers are represented by Daylight's linear representation scheme and arbitrary user-assigned names for each. The first monomer of FIG. 4C shows both ring binding indicators, which occur in pairs (the numbers 7, 8 and 9 in the first structure of FIG. 4C, indicated in grey), and unpaired numbers indicate points of attachment with other monomers (the numbers 1-4 in the first structure of FIG. 4C, indicated in underlining). As can be seen with respect to this monomer, which has been named "Pam" by the user, a lower case "c" represents a carbon of an aromatic ring and an upper case "C" represents a carbon of a non-aromatic ring. Single atom monomers do not require labels to indicate that they can serve as points of attachment, since they have only one possible point of attachment.
FIG. 4D demonstrates how the contents of a combinatorial library can be represented as a string of monomer names separated by periods (additional monomers, besides those defined in FIG. 4C, are used for purposes of the example). Monomer names followed by numbers in a polymerized compound or library definition indicate, by their order, the numbers which should be substituted for the numbers in the monomer definition. Thus, "Pam2768" indicates that the number in the first position following the monomer name, "2", should be substituted for the number "1" in the monomer Pam. Similarly, the number in the second position following the monomer name, "7", should be substituted for the number "2" in the monomer Pam, and so on. The identities of possible additions "2", "7", "6" and "8" are listed in brackets before each of the respective numbers. FIG. 4E shows a Daylight representation of a partial enumeration of the library of FIG. 4D, as well as the chemical structure it represents.
As is intuitively clear, a scientist could not look at a library representation such as EQU Pam2768.Brx;Clx;Fx;Hx;Ix!2.Clx;Fx;Hx;Nitrox!7. Eohx;Etx;Hx;Mcpx;Mex;Mohx;Phex;Ppox;Tfex;Tppx!6. Carx;Hx;Ohx!8
and obtain a conceptual understanding of the contents of the library. To obtain such an understanding of the library, the scientist would have to:
The Daylight system also has some of the same deficiencies as do the MDL and Tripos systems. For example, it is incapable of representing substituents which attach only on a subset of the structures formed in previous reaction stages. When substituents do not attach, the Daylight system will nevertheless show the unattached substituents as distinct members of the enumerated library, which is particularly unsuitable for use with the assignee's binary coding tagging technology as used with solid phase synthesis, wherein all unattached molecules are washed away and do not become part of the library. Additionally, neither the Daylight system nor the MDL or Tripos systems provide facilities for keeping track of small molecule combinatorial libraries generated with binary coding.
Planning the design of a library with molecules having desired characteristics has also proven to be difficult because there exists no computationally feasible deterministic method for selecting starting molecules and reactants with which a diverse small molecule combinatorial library having such characteristics will be created. The manner in which synthons (starting molecules and reactants collectively may be referred to as "synthons") are generally selected, and the limitations therewith, are detailed in U.S. Pat. No. 5,463,564 to Agrafiotis et al., issued on Oct. 31, 1995 (the "'564 patent"), which is hereby incorporated by reference in its entirety. The solution described in the '564 patent involves iteratively:
There are many drawbacks with the system described in the '564 patent, including but not limited to the following:
It is also highly questionable whether a system such as that described in the '564 patent could actually be built.
The following articles, each of which is hereby incorporated by reference in its entirety, also describe methods for selecting starting molecules and/or criteria used in their selection: "Measuring Diversity: Experimental Design of Combinatorial Libraries for Drug Discovery," Journal of Medicinal Chemistry, Vol. 38, No. 9, pp. 1431-1436, 1995 by Martin et al.; "A Nonlinear Map of Substituent Constants for Selecting Test Series and Deriving Structure-Activity Relationships. 1. Aromatic Series," Journal of Medicinal Chemistry, Vol. 37, No. 7, pp. 973-987, 1994; "Hydrogen Bonding. 32. An Analysis of Water-Octanol and Water-Alkane Partitioning and the .DELTA.log P Parameter of Seller," Journal of Pharmaceutical Sciences, Vol. 83, No. 8, pp. 1085-1100, 1994. However, none of these references describe an automated or semi-automated system for use with combinatorial chemistry or, for that matter, the application of evaluation criteria to a proposed combinatorial library.
There is, therefore, a need for a system and method which provides a concise and accurate representation of the contents of actual or planned small molecule combinatorial libraries created with solid phase synthesis. Additionally, there is a need for a system and method useful for planning the development of small molecule combinatorial libraries. Finally, there is a need for a system and method which combines these two capabilities such that synthons can be automatically and intelligently selected, and the library which would be combinatorially created therewith evaluated such that the results of the evaluation add to the intelligence with which synthons will be selected in the future.