1. Field of the Invention
The present invention relates to computer-based analysis of data and generally to the computer-based correlation of data features with data responses, in order to determine or predict which features correlate with or are likely to result in one or more responses. The invention is particularly suitable for use in the fields of chemistry, biology and genetics, such as to facilitate computer-based correlation of chemical structures with observed or predicted pharmacophoric activity. The invention is particularly useful in facilitating the identification and development of potentially beneficial new drugs.
For purposes of illustration, the invention will be described primarily in the context of computer-based analysis of chemical structure-activity relationships (SAR). However, based on the present disclosure, those of ordinary skill in the art will appreciate that the invention may be applicable in other areas as well. By way of example and without limitation, the invention may be applicable in genetics and antibody-protein analysis.
2. Description of Related Art
The global biotech and pharmaceutical industry is a $200 billion/year business. Most of the estimated $13 billion R&D spending in this industry is focused on discovering and developing prescription drugs. Current R&D effort is characterized by low drug discovery rates and long time-to-market.
In an effort to accelerate drug discovery, biotech and pharmaceutical firms are turning to robotics and automation. The old methods of rationally designing molecules using known structural relationships are being supplanted by a shotgun approach of rapidly screening hundreds of thousands of molecules for biological activity. High Throughput Screening (HTS) is being used to test large numbers of molecules for biological activity. The primary goal is to identify hits or leads, which are molecules that affect a particular biological target in the desired manner. For instance and without limitation, a lead may be a chemical structure that binds particularly well to a protein.
Automated HTS systems are large, highly automated liquid handling and detection systems that allow thousands of molecules to be screened for biological activity against a test assay. Several pharmaceutical and biotech companies have developed systems that can perform hundreds of thousands of screens per day.
The increasing use of HTS is being driven by a number of other developments in the industry. The greater the number and diversity of molecules that are run through screens, the more successful HTS is likely to be. This fact has propelled rapid developments in molecule library collection and creation. Combinatorial chemistry systems have been developed that can automatically create hundreds of thousands of new molecules. Combinatorial chemistry is performed in large automated systems that are capable of synthesizing a wide variety of small organic molecules using combinations of “building block” reagents. HTS systems are the only way that the enormous volume of new molecules generated by combinatorial chemistry systems can be tested for biological activity. Another force driving the increased use of HTS is the Human Genome program and the companion field of bioinformatics that is enabling the rapid identification of gene function and accelerating the discovery of therapeutic targets. Companies do not have the resources to develop an exhaustive understanding of each potential therapeutic target. Rather, pharmaceutical and biotech companies use HTS to quickly find molecules that affect the target and may lead to the discovery of a new drug.
High throughput screening does not directly identify a drug. Rather the primary role of HTS is to detect lead molecules and supply directions for their optimization. This limitation exists because many properties critical to the development of a successful drug cannot be assessed by HTS. For example, HTS cannot evaluate the bioavailability, pharmacokinetics, toxicity, or specificity of an active molecule. Thus, further studies of the molecules identified by HTS are required in order to identify a potential lead to a new drug.
The further study, a process called lead discovery, is a time- and resource-intensive task. High throughput screening of a large library of molecules typically identifies thousands of molecules with biological activity that must be evaluated by a pharmaceutical chemist. Those molecules that are selected as candidates for use as a drug are studied to build an understanding of the mechanism by which they interact with the assay. Scientists try to determine which molecular properties correlate with high activity of the molecules in the screening assay. Using the drug leads and this mechanism information, chemists then try to identify, synthesize and test molecules analogous to the leads that have enhanced drug-like effect and/or reduced undesirable characteristics in a process called lead optimization. Ideally, the end result of the screening, lead discovery, and lead optimization is the development of a new drug for clinical testing.
As the number of molecules in the test library and the number of therapeutic target assays exponentially increase, lead discovery and lead optimization have become the new bottleneck in drug discovery using HTS systems. Because of the large number of HTS results that must be analyzed, scientists often seek only first-order results such as the identification of molecules in the library that exhibit high assay activity. In one method, for instance, all of the molecules in the data set are divided into groups based on common properties of their molecular structures. An analysis is then made to determine which groups contain molecules with high activity levels and which groups contain molecules with low activity levels. Those groups representing high activity levels are then deemed to be useful groups. Commonly, the analysis will stop at this point, leaving chemists to analyze the members of the active groups in search of new or optimized leads.
In another method, for instance, a more extensive automated analysis is conducted in an effort to partition the molecules into groups of particular interest and particularly to derive structure-activity relationship rules. An example of this method is described in International Patent Application No. PCT/US98/07899 (designating the United States), filed Apr. 17, 1998 by Glaxo Group Ltd., published as International Publication No. WO 98/47087 on Oct. 22, 1998, and further by Xin Chen et al., “Recursive Partitioning Analysis of a Large Structure-Activity Data Set Using Three-Dimensional Descriptors” (University of North Carolina, and Glaxo Welcome, Inc., May 17, 1998), both of which are expressly incorporated herein by reference in their entireties.
As described by Glaxo and Chin et al., well known recursive partitioning (RP) techniques, commonly referred to as classification trees, are used to iteratively partition a data set (such as results of HTS or other automated chemical synthesis) into active classes. The data set includes molecules and indicia of empirically determined potency (activity-level) per molecule. According to the method, a set of descriptors is first provided, each indicating structural feature that can be described as present or absent in a given molecule. For each molecule, a bit string is then built, indicating whether the molecule has each particular descriptor (1-bit) or not (0-bit). These strings are then configured as a matrix, in which each row represents a molecule and each column represents a descriptor. RP is then used to divide the molecules (rows) into exactly two groups according to whether the molecules have a particular “best” descriptor in common. The “best” descriptor is the descriptor that would result in the largest possible difference in average potency between those molecules containing the descriptor and those molecules not containing the descriptor.
As further described, the method continues iteratively with respect to each subdivided group, dividing each group into two groups based on a next “best” descriptor selected from the group of descriptors. The result of this process is a tree structure, in which some terminal nodes may contain a preponderance of inactive molecules (or molecules having relatively low potency) and other terminal nodes may contain a preponderance of active molecules (or molecules having relatively high potency) (the latter being “good terminal nodes”). Tracing the lineage of the structures defined by a good terminal node may then reveal molecular components that cooperatively reflect a high likelihood of potency. After generating the tree structure through use of RP, it is possible to predict the activities of molecules that have not yet been empirically tested for activities. In particular, a known molecule can be passed down through the tree and examined for the presence or absence of descriptors established to confer activity. HTS or other analysis can then be efficiently conducted with respect to only those molecules that have at least a threshold level of predicted activity.
The present inventors have discovered that the use of RP to partition molecules on the basis of their structural and activity similarity is limiting. By way of example, with RP, each molecule can fall within only a single terminal node of the tree structure, based on one or more determinations along the way as to whether the molecule includes various descriptors known to confer activity. Consequently, if there may be more than one set of descriptors in a molecule (or set of molecules) that results in observed activity, RP may be unable to identify all of the pertinent descriptor sets.
For instance, given an initial set of 10 molecules, assume that the molecules are first partitioned on the basis of descriptor A into groups A0 and A1, where group A0 contains 3 low-potency molecules not having A and group A1 contains 7 high-potency molecules having A. Assume that the 7 high-potency molecules are then partitioned on the basis of descriptor B into groups B0 and B1, where group B0 contains 2 low-potency molecules not having B and group B1 contains 5 high-potency molecules having B. Finally, assume that the 5 high-potency molecules are then partitioned on the basis of descriptor C into groups C0 and C1, where group C0 contains 3 low potency molecules not having C. As a result, a reasonable conclusion is that molecules having descriptors A, B and C are likely to have a high degree of potency. However, assume further that there exists another descriptor D, and that if the original group of 10 molecules were divided on the basis of descriptor D, the tree would grow a different set of branches, indicative of a different set of descriptors corresponding to a high degree of potency. Unfortunately, since the RP method necessarily partitions molecules into mutually exclusive groups, it is unable to discover that lead candidates might in fact be optimized along two or more different pathways or branches.
A thorough exploration of numerous RP trees generated with the same data set, using a collection of methods, such as selective elimination of some features, or changing the splitting criterion, or performing surrogate splits, can create alternative sets of descriptors for the same molecule or set of molecules. However, these procedures would require a great deal of time and effort on the part of the user. It is often very difficult to find a consensus among the vast number of possible reasonable trees that can be useful in building a predictive model.
In addition, if the size of the two classes is significantly unbalanced as is often the case with active and inactive classes in a large diverse population of compounds (where typically 5% or less show as active in a high-throughput screen), building a classifier and hence a predictive model can be considerably more difficult. A small amount of noise in the data will often prevent the descriptors from discriminating the very small class from the much larger class. Unfortunately, in HTS, particularly in early screening, the noise level of the response is notoriously high, and the levels of false positive and false negative responses for molecules can be high. The present inventors have discovered that this noise level may contribute to compromised or faulty splitting decisions for the RP tree.