Medicinal chemists are faced with the continuing process of enhancing the desirable attributes of a wide range of pharmaceuticals and potential drug candidates. Typically, this process comprises the steps of:
(a) acquiring compounds for testing; PA1 (b) performing one or more biological assays and, possibly, analysis of physical properties; PA1 (c) examining the results and formulating a structural hypothesis that explains compound activity; and PA1 (d) designing a focused compound library to test the structural hypothesis. PA1 (1) formulate a substructure search query corresponding to the hypothesized active structure and conduct the substructure search; PA1 (2) determine the numbers of active and inactive compounds that contain the structural feature; and PA1 (3) perform a statistical calculation on the mean activity of compounds containing the structure versus mean activity of the full set.
At any one time, a chemist may receive assay data on a set of up to 10,000 compounds in step (c). His or her task is to determine what molecular feature or combination of features are responsible for the compounds biological or physical activity in order to formulate a structural hypothesis for such activity and thereby design a focused library for testing of the hypothesis. There are a number of software products presently on the market that provide some help to the medicinal chemist, including packages from Daylight Chemical Information Systems, Inc. of Mission Viejo Calif.; MDL Information Systems, Inc. of San Leandro Calif.; Oxford Molecular Group of Oxford, England; Molecular Simulations, Inc. of San Diego Calif.; Synopsis Scientific Systems, Ltd. of Leeds, England; and Tripos, Inc. of St. Louis Mo.
A popular type of software tool available today is classified as "molecular spreadsheets," modeled on spreadsheets for financial applications. The spreadsheet typically has one row for each substance or compound and one column which will have a structural diagram for the substance. Other columns of this spreadsheet will have substance identifiers and/or other bookkeeping information, biological response data and experimental or calculated property data such as molecular weight. The medicinal chemist will have to access several different corporate and/or project files to load desired information into the spreadsheet.
With chemical structures, biological activity and other data loaded into a molecular spreadsheet, the medicinal chemist will then sort on activity, bringing the most active compounds to the top and begin visually examining those compounds with the highest activity, one at a time. After the chemist has inspected 50-100 compounds, he or she will probably have noticed some substructure that seems to occur frequently and hypothesize that that substructure is partially responsible for the compound's activity. The chemist could verify this by the following procedure:
Each step in this process will require a different software program and a significant amount of the chemist's time. The process could then be repeated over and over until the medicinal chemist concludes that the evidence supports his or her hypothesis.
Presently available molecular spreadsheets force the chemist to deal with compounds one at a time and are, thus, limited to small sets. Furthermore, because the iterative process of verifying a structural hypothesis outlined above is so cumbersome and time-consuming, the chemist is forced to cut corners. Often this means that the inactive compounds are simply discarded. Eliminating inactive compounds precludes the opportunity to learn from negative results.
The structure of a chemical substance is responsible for its biological activity and physical properties. There is a large body of literature [8] and a number of commercially available software packages for correlating structural descriptors or quantitative structure-activity relationships (QSAR Programs). Two of the newest and most popular commercial programs are Comparative Molecular Field Analysis (CoMFA) and HQSAR, both from Tripos Assoc., St. Louis Mo.
There are usually three steps to QSAR analysis: 1) selection of a set of molecular descriptors, 2) calculation of the molecular descriptors for each substance; and 3) statistical analysis of descriptor/activity data. A wide variety of structural descriptors have been used, including generalized atom-pairs, atom-pair fingerprints, substructure search screens, two dimensional and three dimensional shape descriptors, partial atomic charges, and topological indices. In the HQSAR program, molecular structures are dissected into all possible connected atom-bond fragments of predetermined size (number of atoms). Once molecular descriptors have been identified, a statistical method is used to generate a QSAR model relating descriptors to activity. Commonly used statistical methods are multiple linear regression, principal component analysis and partial least squares.
CoMFA uses variance in field strengths around a set of aligned three dimensional structures to describe the observed variance in biological activity. Although CoMFA is the most popular and highly regarded three dimensional QSAR method, it requires expert knowledge. The chemist must make decisions regarding conformation and relative alignment which can be difficult and complex, especially with structurally diverse molecules.
Another approach to structure-activity relations (SAR) is known as "recursive partitioning." Computer algorithms have been developed [9] that partition a set of chemical structures into subsets based on a statistical calculation comparing subsets which contain 0, 1, or more instances of a predefined structural feature. Then the procedure is recursively reapplied to the newly created subsets until some statistical threshold is exceeded. The procedure produces a dendrogram where the nodes are compound sets. A dendrogram is a branching diagram representing a hierarchy of categories based on degree of similarity or number of shared characteristics. The root node is the full compound set or parent set, and the offspring of any node is a partitioning of the parent set. The structural features that have been used are similar to those used for clustering and conventional SAR. There is no provision for the chemist to participate in this partitioning process in the prior art programs.
There are a number of problems with integrating commercially available structure-activity software into the iterative drug discovery process of design, synthesis, testing, analysis and hypothesis formulation. For example, the molecular descriptors used for correlations in the available software are difficult for medicinal chemists to use for designing a compound set for the next iteration of the discovery process. Further, many of the commercial software packages require an expertise outside the typical medicinal chemist's knowledge and experience. The raw assay results will typically need to be first processed by a computational chemist. This is time-consuming and the medicinal chemist will not be able to participate and use his or her intuition and experience to guide the process. Another problem with presently available software packages is that there are a tremendous number of molecular descriptors to choose from, and selection of an optimal descriptor set can be time-consuming and may require assistance of a statistician to avoid problems such as colinearity and over-selection of descriptors.