A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates to computer-based analysis of data and to computer-based correlation of data features with data responses, in order to determine or predict which features correlate with or are likely to result in one or more responses. The invention is particularly suitable for use in the fields of chemistry, biology and genetics, such as to facilitate computer-based correlation of chemical structures with observed or predicted pharmacophoric activity. More particularly, the invention is useful in facilitating identification and development of potentially beneficial new drugs.
2. Description of Related Art
The global biotech and pharmaceutical industry is a $200 billion/year business. Most of the estimated $13 billion RandD spending in this industry is focused on discovering and developing prescription drugs. Current RandD effort is characterized by low drug discovery rates and long time-to-market.
In an effort to accelerate drug discovery, biotech and pharmaceutical firms are turning to robotics and automation. The old methods of rationally designing molecules using known structural relationships are being supplanted by a shotgun approach of rapidly screening it hundreds of thousands of molecules for biological activity. High Throughput Screening (HTS) is being used to test large numbers of molecules for biological activity. The primary goal is to identify hits or leads, which are molecules that affect a particular biological target in the desired manner. For instance and without limitation, a lead may be a chemical structure that binds particularly well to a protein.
Automated HTS systems are large, highly automated liquid handling and detection systems that allow thousands of molecules to be screened for biological activity against a test assay. Several pharmaceutical and biotech companies have developed systems that can perform hundreds of thousands of screens per day.
The increasing use of HTS is being driven by a number of other developments in the industry. The greater the number and diversity of molecules that are run through screens, the more successful HTS is likely to be. This fact has propelled rapid developments in molecule library collection and creation. Combinatorial chemistry systems have been developed that can automatically create hundreds of thousands of new molecules. Combinatorial chemistry is performed in large automated systems that are capable of synthesizing a wide variety of small organic molecules using combinations of xe2x80x9cbuilding blockxe2x80x9d reagents. HTS systems are the only way that the enormous volume of new molecules generated by combinatorial chemistry systems can be tested for biological activity. Another force driving the increased use of HTS is the Human Genome program and the companion field of bioinformatics that is enabling the rapid identification of gene function and accelerating the discovery of therapeutic targets. Companies do not have the resources to develop an exhaustive understanding of each potential therapeutic target. Rather, pharmaceutical and biotech companies use HTS to quickly find molecules that affect the target and may lead to the discovery of a new drug.
High throughput screening does not directly identify a drug. Rather the primary role of HTS is to detect lead molecules and supply directions for their optimization. This limitation exists because many properties critical to the development of a successful drug cannot be assessed by HTS. For example, HTS cannot evaluate the bioavailability, pharmacokinetics, toxicity, or specificity of an active molecule. Thus, further studies of the molecules identified by HTS are required in order to identify a potential lead to a new drug.
The further study, a process called lead discovery, is a time- and resource-intensive task. High throughput screening of a large library of molecules typically identifies thousands of molecules with biological activity that must be evaluated by a pharmaceutical chemist. Those molecules that are selected as candidates for use as a drug are studied to build an understanding of the mechanism by which they interact with the assay. Scientists try to determine which molecular properties correlate with high activity of the molecules in the screening assay. Using the drug leads and this mechanism information, chemists then try to identify, synthesize and test molecules analogous to the leads that have enhanced drug-like effect and/or reduced undesirable characteristics in a process called lead optimization. Ideally, the end result of the screening, lead discovery, and lead optimization is the development of a new drug for clinical testing.
As the number of molecules in the test library and the number of therapeutic target assays exponentially increase, lead discovery and lead optimization have become the new bottleneck in drug discovery using HTS systems. Because of the large number of HTS results that must be analyzed, scientists often seek only first-order results such as the identification of molecules in the library that exhibit high assay activity. In one such method, for instance, all of the molecules in the data set are divided into groups based on common properties of their molecular structures. An analysis is then made to determine which groups contain molecules with high activity levels and which groups contain molecules with low activity levels. Those groups representing high activity levels are then deemed to be useful groups. Commonly, the analysis will stop at this point, leaving chemists to analyze the members of the active groups in search of new or optimized leads.
In another method, a more extensive automated analysis is conducted in an effort to partition the molecules into groups of particular interest and particularly to derive structure-activity relationship rules. For instance, well known recursive partitioning techniques, commonly referred to as classification trees, may be used to iteratively partition a data set (such as results of HTS or other automated chemical synthesis) into active classes. The data set includes molecules and indicia of empirically determined potency (activity-level) per molecule.
According to this method, a set of descriptors is first generated, each indicating a structural feature that can be described as present or absent in a given molecule. For each molecule, a bit string is then built, indicating whether the molecule has each particular descriptor (1-bit) or not (0-bit). These strings are then configured as a matrix, in which each row represents a molecule and each column represents a descriptor. Recursive partitioning is then used to divide the molecules (rows) into exactly two groups according to whether the molecules have a particular xe2x80x9cbestxe2x80x9d descriptor in common. The xe2x80x9cbestxe2x80x9d descriptor is the descriptor that would result in the largest possible difference in average potency between those molecules containing the descriptor and those molecules not containing the descriptor.
The recursive partitioning method then continues iteratively with respect to each subdivided group, dividing each group into two groups based on a next xe2x80x9cbestxe2x80x9d descriptor. The result of this process is a tree structure, in which some terminal nodes may contain a preponderance of inactive molecules (or molecules having relatively low potency) and other terminal nodes may contain a preponderance of active molecules (or molecules having relatively high potency) (the latter being xe2x80x9cgood terminal nodesxe2x80x9d). Tracing the lineage of the structures defined by a good terminal node may then reveal molecular components that cooperatively reflect a high likelihood of potency.
Unfortunately, the use of recursive partitioning to partition molecules on the basis of their structural and activity similarity is limiting. For example, with the recursive partitioning analysis, each molecule can fall within only a single terminal node of the tree structure, based on one or more determinations along the way as to whether the molecule includes various descriptors known to confer activity. Consequently, if there may be more than one set of descriptors in a molecule (or set of molecules) that results in observed activity, the method may be unable to identify all of the pertinent descriptor sets.
In view of the foregoing, the inventors have discovered that a need exists for an improved method to screen HTS data.
The present invention is directed to a computer-based system (e.g., method, apparatus and/or machine) for identifying and correlating relationships between features and responses in a data set. In the chemistry field, for instance, the invention provides a computer-based system for generating (learning) structure-to-activity relationship (SAR) information and pharmacophore models for each pharmacophoric mechanism identified in the HTS screen of a diverse (heterogeneous) library. In this context, the term xe2x80x9cmechanismxe2x80x9d may refer to the different ways for the molecules in the library to interact with a specified target. A mechanism model or pharmacophore can be a multi-dimensional arrangement of physical and structural features that enable a molecule to interact with a target through a specific interaction with the target""s active site.
As noted above, existing analysis systems typically involve (i) dividing a set of molecules into subclasses based on structural similarity and then identifying which subclass represents higher potency and is therefore of interest for further study, or (ii) dividing a set of molecules into subclasses based on differences in potency for given structural features. The existing art thus addresses the question of how well a given subclassification distinguishes active molecules from inactive molecules.
In an exemplary embodiment, a computer learns pharmacophoric mechanisms by analyzing a plurality of molecules. More particularly, the computer begins with a set of data representing a plurality of molecules, where the data set preferably indicates for each molecule both a feature characteristic (e.g., a chemical structure and/or other features) and an activity characteristic (e.g., an observed or measured level of activity in one or more assays).
Provided with the input data set, the computer first identifies those molecules that have more than some predefined activity characteristic (level of activity), on an absolute or normalized scale. The computer then employs an agglomerative clustering technique to cluster representations of those molecules based on their structural similarity. The result of this process is a pyramidal data structure, in which each node of the structure represents one or more of the molecules.
As the pyramid is created, or after it is created, the computer preferably identifies, for each node, a feature set common to all of the molecules in the node. This common feature set may be a substructure, for instance. In that case, the computer preferably selects the largest common substructure, which is the structure most likely to explain why the molecules ended up together in the node.
In addition, for each node, the computer preferably identifies a measure of activity that is representative of the activity levels of the molecules in the node. For instance, the activity measure for a given node might be an average of the activity levels of the molecules represented by the node. This activity level may reasonably be correlated with the common substructure identified for the node, supporting a conclusion that the common substructure is, at least relatively speaking, responsible for that observed activity.
Thus, rather than merely determining how well a particular subgroup distinguishes active molecules from inactive molecules, an exemplary embodiment of the present invention can go further and determine the reason or reasons for the distinction: namely, the substructures responsible for the observed activity.
As it builds the pyramid or when it finishes building the pyramid, the processor may provide as output for viewing by an observer a description of some or all of the pyramid. By way of example, the output may take the form of a graphical depiction of the pyramid, illustrating the common substructures (e.g., chemical formulae) and representative activity levels (e.g., numerical measures, or color coding) that the processor identified per node.
Further, the processor may provide other useful output indicia. For example, the processor may provide an indication of whether the activity measure of a child node in the pyramid is greater than or less than the activity measure of its parent node and/or an indication of the extent of difference in activity. This activity differential may signify to a chemist what bearing the common substructure of the child node is likely to have with respect to the molecules of the parent node. For instance, if a given parent node gives rise to first and second children nodes, and the first child reflects an increase in average activity compared to the parent while the second child reflects a decrease in average activity compared to the parent, then a chemist can reasonably conclude that the common substructure of the first child node is likely to be a better lead (i.e., is more likely to correlate to the observed activity).
The process of agglomeratively clustering representations of molecules may generally operate as follows. First, as the base (starting level) of the pyramid, a processor forms a number of nodes (data objects, or cluster objects) in memory, each representing a respective single one of the molecules, and thus defining a singleton. Each node can thus be characterized by the structure of the molecule that it contains. (For instance, a node containing a C-N molecule can be characterized by the structure of the C-N molecule).
The processor then compares the nodes and determines which nodes are most similar to each other based on the structures of the molecules that the nodes contain. At this first level in the exemplary embodiment, this comparison is effectively a comparison of the molecules themselves, to determine which molecules are structurally most similar to each other. The processor merges those most similar nodes together into a new node, which can be characterized by the structures of the molecules that it contains. This effectively creates the next level of the pyramid, made up of the merged node and all of the remaining nodes, if any.
At the next level, the processor then repeats the comparison between nodes, merging together the most similar nodes to form another next level of the pyramid, and so forth. Ultimately, two nodes remain and are merged together to form the tip of the pyramid, which, in the exemplary embodiment, will represent the entire collection of molecules being clustered.
A problem arises at any given level of this analysis, however, when the processor encounters a tie in similarity (also referred to as a xe2x80x9ctie in proximityxe2x80x9d) between nodes. If the processor finds that a given node A is just as similar to node B as it is to node C, then (if this is the greatest inter-node similarity at this level) a question would arise as to which nodes the processor should merge together.
Ties in similarity are most likely to occur if the molecular structures are represented by bit vectors, for instance, where each structural element can be either present or absent (1 or 0), than if features are represented by real numbers (e.g., weights). Consider three molecules x, y and z, for instance, and five structural properties a, b, c, d and e. Assume the bit vectors for these molecules are:
Molecule x includes all but structures c and d. Molecule y includes all but structure e. And molecule z includes all but structure b. Thus, molecule x differs from molecule y by 3 bits, and molecule x differs from molecule z by 3 bits as well. In this scenario, if every structural property has the same weight, then molecule x is equidistant from molecules y and z.
With a set of diverse compounds, such bit vector representations could give rise to a large number of ties in similarity. But in a more typical case, as the homogeneity of the compound set increases, the likelihood of encountering ambiguous ties when employing bit-vector representations increases even more.
One way to solve this problem is to artificially break the tie. For instance, a rule can be preset to indicate that, in response to a tie in similarity such as that described above, the choice of whether to merge A with B or C should depend on at which level in the pyramid B and C were formed. For example, if B was formed by a merger two levels ago and C was formed by a merger three levels ago, then the rule might dictate that A should be merged with C. Other such rules could be developed as well.
By breaking a tie in similarity, however, the processor will likely discard very useful information, both in terms of the merger that the processor does not select to make and in terms of further mergers that would have evolved from that non-selected merger. For instance, by opting to merge A with B rather than with C, the processor might never develop a common substructure based on a merger of A and C and therefore might never provide such potentially useful information to a chemist. Further, until very high in the pyramid, the processor might then never merge the molecules of A and C together with the molecules of another node, D. Any common substructure that could have been developed from such a subsequent merger might therefore never appear, thus depriving a chemist of possibly useful information.
The present inventors have discovered, however, that a better way to deal with a tie in similarity during the clustering process is to use the tie rather than break the tie. In particular, according to an exemplary embodiment, when the processor determines at a given level of the pyramid that substantially the same greatest similarity exists both between nodes A and B and between nodes A and C, the processor will merge A separately with both B and C, so as to form two merged nodes, A-B and A-C. Consequently, the next level of the pyramid may be made up of these two merged nodes as well as other nodes (if any) from the current level.
By merging A separately with both B and C, the processor effectively maintains, rather than discards, information. For instance, the processor may identify a common substructure respectively for each of nodes (i) A-C, (ii) A-B, (iii) A, (iv) B and (v) C. And the processor may identify a representative activity measure for each of these nodes. Advantageously, the processor may then provide this and other information (e.g., activity differential information as mentioned above) as output for use by a chemist. With the benefit of this information, a chemist may thus readily determine, for instance, that a much greater activity differential exists between parent node A-C and child node A than between parent node A-C and child node C.
In the exemplary embodiment, the present invention therefore advantageously establishes a multi-domain pyramid (or tree) structure, built from the ground up (or from the leaves to the root). Each node of the pyramid may define a pharmacophoric mechanism (e.g., substructure) and represents or comprises one or more molecules that include that mechanism. Backtracking down the pyramid (i.e., opposite the direction that the pyramid was built), each parent node may lead to one or more children nodes, each preferably defining a further pharmacophoric mechanism, and each including those molecule(s) from its parent node that include the mechanism.
According to the exemplary embodiment, the processor may further trim the pyramid (i.e., the tree), to remove nodes that are not particularly useful. For instance, if the processor determines that the common substructure identified for a given node is the same as that of its parent node, then the processor can remove the child node from the pyramid and change the output to reflect that any children of the child node are instead children of the parent node. As another example, the processor can be programmed to remove all nodes from the base layer of the pyramid, since each of those nodes in the exemplary embodiment represents a single molecule, which is not particularly useful information for a chemist.
A pyramid structure produced in accordance with an exemplary embodiment of the invention can represent, in and of itself, a large amount of commercially valuable information, much of which was previously out of reach. As an example, for each node of the pyramid (after the root node), the common substructure (pharmacophoric mechanism) identified for the node can be commercially valuable information, since it represents a substructure that is likely to be responsible for observed pharmacophoric activity. Such a substructure might therefore be usefully employed to develop beneficial new drugs.
As another example, any lineage of nodes in the pyramid (e.g., from a given node up or down to another node) can embody a significant amount of commercially valuable information. By the time one or more molecules reaches a terminal node (i.e., the base) of the pyramid, for instance, the molecule(s) may have passed through a number of nodes defining their ancestral parent node(s), each having a respective common pharmacophore. This ancestral line of pharmacophores may therefore represent the pharmacophoric mechanisms that, cooperatively, are likely to result in an activity level reflected by the molecule(s) in the terminal node.
As yet another example, as noted above, the difference in activity levels between molecules in a child node and molecules in its parent node can be very valuable information, since the difference may represent the enhancing or detracting effect of the pharmacophoric mechanism that gave rise to the child node. Such information is even more valuable when a given parent node gives rise to a pair of children nodes and the activity differential varies greatly among the children nodes. For instance, if one child node reflects an activity increase compared to the parent, while the other child node reflects an activity decrease compared to the parent, it is reasonable to conclude that the pharmacophoric mechanism defined by the one child node is likely to be more useful for development of beneficial new drugs.
An exemplary embodiment of the present invention can thus take a massive amount of data representing chemical compounds and convert that data into a pyramid structure that conveniently and intuitively represents the foregoing and other valuable information. A chemist, who could not manually analyze such a vast amount of input data, can then readily analyze the organized information represented by the pyramid structure. The information generated by the invention can thus assist in the development of leads and in turn the development of beneficial new drugs.
Thus, in one respect, an embodiment of the invention can take the form of a method for identifying chemical substructures by analysis of a data set representing a plurality of chemical structures. The method can include executing a computer program to pyramidally cluster representations of the chemical structures, so as to produce in a data storage medium a hierarchy of clusters, where each cluster represents one or more of the chemical structures. This process can include comparing clusters and merging together pairs of clusters that have the greatest similarity. In this regard, the process can include finding, at a given level of the hierarchy, that at least two pairs of clusters have substantially the same similarity, and then responsively merging each pair respectively, so as to form at least two new clusters at the next level of the hierarchy.
Further, the process of executing a computer program to pyramidally cluster the molecular representations can involve applying a clustering algorithm. The identity of the clustering algorithm (i.e., the particular algorithmxe2x80x94such as Wards, complete-link, or the like) can be specified by a user, and a computer may execute the specified algorithm. Further, a user may specify one or more other aspects of the clustering algorithm, such as, for instance, a fuzziness parameter that indicates how strict or lenient the computer will be when deciding whether a tie in similarity exists between two pairs of clusters. As an example, the fuzziness parameter could indicate a range of similarities that could be considered ties.
With respect to each of the clusters of the hierarchy, the method can further include analyzing the chemical structure(s) in the cluster and determining a chemical substructure that is representative of the chemical structure(s) in the cluster. In turn, the method can include outputting for consideration by a person a description of at least a portion of the hierarchy and an indication of at least one of the representative chemical substructures.
In another respect, an embodiment of the invention can take the form of a method of identifying pharmacophoric mechanisms through analysis of a plurality of molecules, where each molecule has a respective feature characteristic and a respective activity characteristic. The method can involve establishing in a computer memory a plurality of cluster objects, each representing one of the molecules, and then agglomeratively clustering the cluster objects based on comparisons of the feature characteristics of the molecules that the cluster objects represent. In this process of agglomeratively clustering, to the extent any given cluster object is determined to be equidistant to a plurality of other cluster objects, the method may further include merging the given cluster object with each cluster object of the plurality of other cluster objects. In any event, the result can be, in a computer memory (or, equivalently, another type of data storage medium), a hierarchical pyramid made up of a number of cluster objects each representing a number of the molecules.
With respect to each of at least some cluster objects of the pyramid, the method may further include identifying a substructure that is common to molecules represented the cluster object. Such a substructure may define a respective pharmacophoric mechanism. In turn, the method may include outputting for viewing by a person a description of at least part of the hierarchical pyramid, including at least one of the identified substructures.
In yet another respect, an embodiment of the invention can take the form of a method of identifying pharmacophoric mechanisms through analysis of a plurality of molecules, where each molecule defines a feature characteristic and an activity characteristic. The method can include establishing in a computer memory a plurality of data objects, each representing one of the molecules and having associated with it a feature vector that represents the feature characteristic of the molecule.
In turn, the method can include pyramidally clustering the data objects based on their associated feature vectors, so as to form in the computer memory a pyramidal data structure having a number of nodes each representing one or more of the molecules. In the process of pyramidally clustering the data objects, the method preferably includes encountering a tie in proximity between a given node and at least two other nodes and responsively merging the given node separately with each of the at least two other nodes.
The method may further include, with respect to each node of the pyramidal data structure, identifying a chemical feature set common to the one or more molecules represented by the node. This chemical feature set can be considered to define a pharmacophore. Still further, the method can include providing an output that describes (or, equivalently, otherwise indicates) at least a portion of the pyramidal data structure and that includes a description of the chemical feature set identified with respect to at least one node of the pyramidal data structure.
In another respect, an embodiment of the invention could take the form of a method of learning pharmacophoric mechanisms through analysis of a plurality of molecules, each having a respective feature characteristic and a respective activity characteristic. This embodiment of the invention could involve selecting from the plurality of molecules a group of molecules that has at least a threshold activity characteristic (i.e., in an exemplary embodiment, each molecule of the group having at least the threshold activity characteristicxe2x80x94such as a threshold level of activity, for instance). Further, the method could involve establishing in a data storage medium a plurality of data objects that each represent at least one of the molecules of the group, such that at least one of the data objects (object 1) represents two or more molecules. Establishing these data objects in memory can itself involve developing a representation of each molecule and then agglomeratively clustering the representations into a hierarchy, where object 1 resides at a given level.
The invention may then- involve measuring similarity between the data objects based on the feature characteristics of the molecules represented by the data objects. Based on these measurements, the invention could involve making a determination that the similarity between object 1 and another data object (object 2) is substantially equal to the similarity between object 1 and still another data object (object 3). In response to that determination, the method could involve merging object 1 separately with object 2 to form a new data object (object 4) and with object 3 to form a new data object (object 5).
The method may then involve identifying at least (i) a common feature set among the feature characteristics of the molecules represented by object 1 and (ii) a common feature set among the feature characteristics of the molecules represented by object 4. Each of these common feature sets can be considered to define a respective pharmacophoric mechanism.
The method may further include providing to a person an indication of at least the common feature sets identified with respect to the molecules of objects 1 and 4. In conjunction with this output, the method could include computing representative activity levels of each object as well as a differential between the activity levels of at least objects 1 and 4, and possibly providing an indication of the differential. A person may then correlate the differential with the common feature set identified with respect to object 1.
The method may additionally include representing each feature characteristic as a binary vector whose members indicate the presence or absence of respective molecular features. (The process of so representing the feature characteristic may involve generating the binary vectors, or simply receiving the vectors as input.) With this arrangement, the process of measuring similarity between data objects can involve evaluating (i.e., measuring or computing) similarity between respective pairs the data objects based on the binary vectors of the molecules represented by the data objects of the pair. As between any two data objects, this similarity computation can involve computing a Tanimoto distance, a Euclidean distance, or other distance measure between the data objects.
In still another respect, an embodiment of the invention could take the form of a method for analyzing a plurality of molecules, each of which having a respective feature characteristic and a respective activity characteristic. The respective activity characteristic of each molecule preferably represents at least a threshold activity level. The method can then include establishing in a computer memory a plurality of cluster objects, each cluster object representing at least one of the molecules.
With respect to the cluster objects, the method can involve conducting a merging process that includes (i) comparing pairs of the cluster objects and, for each pair, measuring a respective dissimilarity between the cluster objects within the pair based on the feature characteristics of the molecules represented by the respective cluster objects, (ii) of the dissimilarities measured in step (i), identifying a smallest dissimilarity, (iii) selecting at least one pair of the cluster objects that has the smallest measured dissimilarity, and (iv) with respect to each of the at least one pair selected in step (iii), merging the cluster objects of the pair to establish a cluster object cooperatively representing the molecules that were represented by the cluster objects of the pair. With respect to at least each cluster object established in step (iv), the method can further include identifying a common substructure among the molecules represented by the cluster object.
At each level of merger, if at least two cluster objects have not yet been merged, the method can involve conducting the merging process again, but with respect to the cluster objects that have not yet been merged.
Further, the method can include outputting a description of at least one of the established cluster objects, including at least an indication of the common substructure identified for that cluster object. This output can include a graphical description (such as a tree structure) of cluster a objects, including for each cluster object an indication of the common substructure established with respect to the cluster object. Alternatively or additionally, a graphical depiction can include for each cluster object an indication of a measure of the activity characteristics of the molecules represented by the cluster object, and/or perhaps a measure of activity differential between parent and child clusters in the pyramid.
In yet another respect, an embodiment of the present invention can take the form of a processing system for screening a data set representing a plurality of molecules, so as to assist in identifying sets of molecular features that are likely to correlate with specified activity. The data set may define, for each represented molecule, a feature characteristic and an activity characteristic. And the processing system may include a processor, at least one data storage medium, and a set of machine language instructions stored in the data storage medium and executable by the processor to perform functions such as those described above.
In a further respect, an embodiment can take the form of a set of a computer-readable medium (such as a memory, a magnetic or optical disk, or a tape, for instance) that embodies a set of machine languages instructions executable by a computer for performing method steps such as those described above or such as those depicted in the accompanying figures.
In yet a further aspect, an exemplary embodiment of the present invention involves applying a pyramid structure generated in accordance with the invention in order to classify other compounds, so as to xe2x80x9cvirtuallyxe2x80x9d determine what level of activity might be expected of a known or unknown molecule.