This invention relates generally to the method of determining relationships between the structure or properties of chemical compounds and the biological activity of those compounds.
The pharmaceutical and biotechnology industries are continuously searching for effective therapeutic or diagnostic agents. The processes for finding effective agents includes target identification, ligand identification, toxicology and clinical trials.
Target identification is basically the identification of a particular biological component, namely a protein and its association with particular disease states or regulatory systems. A protein identified in a search for a chemical compound (drug) that can affect a disease or its symptoms is called a target. Protein are large chemical compound comprising a polymer chain of amino acids. The word protein is used herein to refer to any chemical compound that is involved in the regulation or control of biological systems (e.g. enzymes) and whose function can be interfered with by a drug.
The word disease is used herein to refer to an acquired condition or genetic condition. A disease can alter the normal biological systems of the body, causing an over or under abundance of chemical compounds (i.e. a xe2x80x9cchemical imbalancexe2x80x9d). The regulatory systems for these chemical compounds involve the use, by the body, of certain proteins to detect imbalances or cause the body to produce neutralizing compounds in an attempt to restore the chemical imbalance. The word body is used herein to refer to any biological system: e.g. plant, animal or bacterial.
Ligand identification includes search for a chemical compound that binds to a particular target. A ligand is a chemical compound that can attach itself to a protein and interfere with the normal functioning of the protein. A useful analogy is viewing the protein as a xe2x80x9clockxe2x80x9d and the ligand as a xe2x80x9ckey.xe2x80x9d A ligand that fits the xe2x80x9clockxe2x80x9d is called xe2x80x9cactive.xe2x80x9d
Toxicology and clinical trials involve characterizing the effects on the entire body of an identified ligand for a particular target. Additionally, the overall effectiveness regarding the disease must also be measured. These efforts are conducted in model bodies (i.e. generally animals) and then ultimately on the intended body (i.e. generally humans).
The present invention relates to ligand identification. In other words, a target has been identified and the identity of an active ligand is desired. Ligand identification generally involves the developing of a hypothesis that a particular chemical compound will be active, performing a physical experiment to determine if the hypothesized compound is active, and if the compound is not active, then returning to the step of developing a hypothesis.
There are several methods available for developing hypotheses that a particular chemical compound will be active.
A very slow and unpredictable process is introspection. That is, the expertise gained by humans in the hypothesis-experiment process can be put to use in developing new hypotheses regarding the selection of candidate ligands.
Computer simulation methods have also been proposed to reduce the cost of physical experiments. These methods include simulations of activity and suggestions for new candidate ligands. These simulations have not had broad success and are generally too slow and unreliable unless a number of active compounds have already been discovered and minor modifications are desired to improve some property.
The current method of choice is generally called high throughput screening (HTS). This includes the automation of the physical experiment step with robots so that hundreds of thousands or millions of experiments can be performed in a short period of time. This process has allowed a brute-force approach to ligand discovery. The hypothesis phase consists of obtaining large collections of molecules either from external suppliers or through combinatorial chemistry type production of large numbers of compounds. Combinatorial chemistry is a methodology in which many chemical reactions are performed simultaneously to produce a large collection of compounds. The large collection of compounds can then be physically tested with robots and activity results measured.
The universe of possible ligands is extremely large; estimated between 1040 and 10400 compounds. Accordingly, even with HTS approaches it is impossible to physically test all possible ligand candidates. Thus, methods are needed to discard the majority of the possibilities in advance or as the search proceeds.
It is generally accepted that the structure, composition, or physical properties of a ligand directly affect its biological activity against a target. The attempt to transform this qualitative belief into a quantitative method of activity assessment is known as the determination of Quantitative Structure Activity Relationships, or QSAR. QSAR began with the work of Hansch and was further developed by others. See, Hansch, C., Fujita, T, xcfx81-"sgr"-xcfx80 Analysis, A Method for the Correlation of Biological Activity and Chemical Structure, J.Am.Chem.Soc. 1964; Cramer, R. D., Patterson, D. E., Gunce, J. D., Comparative Molecular Field Analysis (CoMFA), 1. Effect of Shape on Binding of Steroids to Carrier Proteins, J.Am.Chem.Soc., 1988, 110, 5959-5967; and, Roger, D., Hopfinger, A. J., Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships, J.Chem.Info.Comp.Sci., 1994, 34.
Determining a QSAR generally includes the following steps:
First, a quantitative measure of activity needs to be defined.
Second, the ligand needs to be expressed in some quantitative manner. This step generally includes selecting a collection of numbers that characterize the ligand. These numbers are called molecular descriptors or descriptors.
Then, a functional relationship between activity and the selected descriptors must be determined. This includes developing a mathematical function that has the property that xe2x80x9cactivity=a function of the descriptorsxe2x80x9d, to a suitable high level of accuracy.
The functional relationship and the molecular descriptors are generally used to predict the activity of new candidate ligands.
Activity is traditionally measured as the amount of ligand needed to produce a particular interference with a target. The amount needed is on a continuous scale.
The selection of molecular descriptors is usually target-specific. Physical properties are often used. Mathematical properties based on the line drawing of a chemical compounds are also used. The use of the electric field of the ligand as a molecular descriptor is called Comparative Molecular Field Analysis (CoMFA) and has been the subject of previous patents. Other molecular descriptor sets include xe2x80x9cfingerprintsxe2x80x9d or hologramsxe2x80x9d, which are descriptions of small sub-structures in the ligand.
The most widely used method of determining the functional relationship is the statistical technique of regression or least squares. Techniques such as genetic algorithms and partial least squares are used to select the xe2x80x9cimportantxe2x80x9d descriptors from the xe2x80x9cless importantxe2x80x9d descriptors or xe2x80x9cnoisexe2x80x9d.
The use of high throughput screening (HTS) to identify active compounds has greatly challenged commonly used QSAR techniques. HTS usually generates large amounts of assay data, which initially classifies compounds as active or inactive. In addition, compounds in screening libraries are typically noncongeneric, i.e., they do not share similar core structures. This makes it difficult, if not impossible, to analyze HTS data by classical QSAR techniques and to predict active compounds.
Higher throughput reduces the precision of the activity measurement. Many HTS technologies report a binary condition; a candidate ligand is either xe2x80x9cactivexe2x80x9d or xe2x80x9cinactivexe2x80x9d. Some HTS technologies report a discrete measure; i.e. activity on a scale of 1 to 10. In either case, classical QSAR techniques require a continuous activity measurement, e.g. accurate to two to three decimal places.
Many HTS techniques have the unfortunate property that the activity measurement is error prone. The error rate is significant enough to warrant special attention since classical QSAR technology is very sensitive to error and outliers (data extremes). A significant error rate will neutralize the predictive capabilities of classical QSAR technology.
To exemplify, consider the following simple example. Suppose that activity y is linearly related to a single descriptor x. The linear relationship is expressed as follows:
y=mx+b
A conventional data set would consist of n observations (yi,xi). Without loss of generality it may be assumed that the slope is greater than zero, m greater than 0 the xi have mean 0 and variance 1, and that activity is indicated by the condition that   y   less than       0    ⁢                  (                                            i              .              e              .                              xe2x80x83                            ⁢              when                        ⁢                          xe2x80x83                        ⁢            x                     less than                                     -              b                        m                          )            .      
Using linear regression, the estimates for m and b are:             m      ^        =                  1        n            ⁢                        ∑                      i            =            1                    n                ⁢                  xe2x80x83                ⁢                              y            i                    ⁢                      x            i                                ,      xe2x80x83    ⁢            b      ^        =    y    ,      xe2x80x83    ⁢            y      _        =                  1        n            ⁢                        ∑                      i            =            1                    n                ⁢                  xe2x80x83                ⁢                  y          i                    
When presented with HTS binary measurements (i.e. 1 is active and 0 is inactive) representing the condition that yi less than 0 the linear regression estimates become:       m    ^    =                    1        n            ⁢                        ∑                      xi             less than                           b              /              m                                      ⁢                  xe2x80x83                ⁢                              x            i                    ⁢                      xe2x80x83                    ⁢                      b            ^                                =          a      n      
where a is the number of active compounds. These estimates are completely different than those obtained from non-binary input (e.g., the b estimate is always in the range [0,1] for binary data). For example, the estimated descriptor value at the boundary between active and inactive is:   x  =            -      1                      ∑                  xi           less than                                     -              b                        /            m                              ⁢              xe2x80x83            ⁢                        x          i                /        a            
This is inversely proportional to the mean active descriptor value. Contrast the above equation, which was developed with linear regression, with xe2x88x92b/m, the true descriptor value at the boundary. The assumptions of linear regression are not satisfied with binary HTS data.
It is an object of the present invention to provide a method for developing a quantitative structure activity relationship that overcomes the shortfalls of the prior art.
Another object of the present invention is to provide a method for developing a quantitative structure activity relationship that allows the prediction of a candidate compound for a particular target to be identified as either active or inactive.
A further object of the present invention is to provide a method for developing a quantitative structure activity relationship that is less sensitive to High Throughput Screening input data error and outliers than the prior art.
Still a further object of the present invention is to provide a method for developing a quantitative structure activity relationship and analyze candidate compounds with the use of computer equipment.
Yet a further object of present invention is to provide a method for developing a quantitative structure activity relationship that is not significantly influenced by data boundary effects.
Still a further object of the present invention is to predict whether or not a chemical compound is a member of a particular set.
Yet another object of this present invention is to provide a method for developing a quantitative structure activity relationship that includes obtaining a training set of chemical compounds with molecular descriptors consisting of a number of multidimensional vectors with an activity class for each of the vectors; partitioning the multidimensional vectors in groups having interdependence; transforming the descriptors such that the interdependence of the groups is lessened; estimating a probability distribution of the descriptors by assuming that the probability distribution of the product of each of the groups is approximately equal to the probability distribution of the molecular descriptors; performing the partitioning, transforming and estimating steps for each of the activity classes; and, developing a probability distribution for the activity classes.
Still a further object of the present invention is to provide a method for predicting activity of candidate ligands that includes developing a prediction model; obtaining a candidate chemical compound; and, applying the prediction model to the candidate compound.
Yet another object of the present invention is to provide a system for predicting activity of candidate compounds as either active or inactive that includes an analyzer that receives a training set of chemical compounds; a prediction model developed by the analyzer and is based on the training set; and, a sorter that receives a candidate ligand and receives the model from the analyzer, the sorter applies the model to the candidate ligand to predict the activity of the candidate ligand.
Still a further object of the present invention is to provide a computer-based method of generating a quantitative structure activity relationship that includes calculating a numerical representation of molecules consisting of n numbers per molecule; and, estimating a probability distribution that a molecules is active.