The present invention relates to an improved computational method for predicting a property and/or performance of polymers, and/or identifying and designing polymers that provide said desired property and/or performance, wherein the desired property can be provided by the neat, undiluted polymers, or diluted polymers in a composition.
An experienced chemist can tell much about the chemical reactivity or physical properties of a molecule just by looking at its structure. As the pool of chemical experience and knowledge accumulates, and the speed of computers increases, there is a growing desire to design methods to correlate the chemical and physical properties as well as other useful properties (such as biological activities) of the chemicals to their chemical structure.
The general method is described as a quantitative structure-activity relationship (QSAR) or quantitative structure-property relationship (QSPR), and is described in, e.g., H. Kubini in QSAR: Hansch Analysis and Related Approaches, published by VCH, Weinheim, Germany, 1993, and, D. J. Livingstone, Structure Property Correlations in Molecular Design, in Structure-Property Correlations in Drug Research, Han van de Waterbeemd, ed., Academic Press, 1996, said publications are incorporated herein by reference. In this method the structures of a representative set of materials are characterized using physical properties such as log P (base-10 logarithm of the octanol-water partition coefficient P), fragment constants like Hammett""s sigma, or any of a large number of computed molecular descriptors (for example, see P. C. Jurs, S. L. Dixon, and L. M. Egolf, Representations of Molecules, in Chemometric Methods in Molecular Design, Han van de Waterbeemd, ed., published by VCH, Weinheim, Germany, 1995.
In the general case, a xe2x80x9crepresentative setxe2x80x9d, sometimes also called a xe2x80x9ctraining setxe2x80x9d, of materials is a collection of materials that represent the expected range of change in both the property of interest (the property to be predicted using the model) and also the range of molecular structure types to which the model is designed to apply. The size of the set of materials necessary to constitute a xe2x80x9crepresentative setxe2x80x9d is dependent on the diversity of the target structures and the range of property values for which the model needs to be valid. Typically, one needs to have about 20 to about 25 materials to begin to generate statistically valid models. However, it is possible to obtain valid models with smaller sets of materials if there is a large degree of similarity between the molecular structures. A general rule of thumb suggests that the final model should include at least about five unique materials in a training set for each parameter (molecular descriptor or physical property) in the model in order to achieve a statistically stable equation and to avoid xe2x80x9coverfittingxe2x80x9d, the inclusion of statistical noise in the model. The range of the experimental property being modeled must also be broad enough to be able to detect statistically significant differences between members of the representative set given the magnitude of the uncertainty associated with the experimental measurement. For biological properties, a typical minimum range is about two orders of magnitude (100 fold difference between the lowest and highest values) because of the relatively large uncertainty associated with biological experiments. The minimum range requirement for physical properties (e.g. boiling points, surface tension, aqueous solubility) is usually smaller because of the greater accuracy and precision achieved in measuring such properties.
There are practical limits to the size of the molecules that can be studied using known QSAR techniques. Typically, these methods are applied to small organic molecules. The term xe2x80x9csmallxe2x80x9d usually refers to non-polymeric materials with less than about 200 atoms including hydrogens. The practical reason for this limitation is that the vast majority of calculated molecular descriptors begin to lose the ability to distinguish one structure from another as the size of the molecules gets larger. For example, the addition of one methyl group (a carbon and three hydrogens) to benzene increases the molecular weight (an example of a molecular descriptor) by about 17.9% whereas the addition of the same methyl group to a C100 linear alkane changes the molecular weight by less than 1%.
The model developed is often a multivariate, (involving many parameters, linear regression equation that is computed by regressing a selected set of molecular descriptors or physical properties against measured values of the property of interest (e.g., Y=m0+m1x1 . . . +mnxn, wherein Y is the measured property of interest, x1, x2 . . . xn are the molecular descriptors or physical properties, m0, m1 . . . mn are the regression coefficients, and n is the number of descriptors or physical properties in the model). A number of different methods have been employed for the selection of the parameters to be included in the regression equation, such as stepwise regression, stepwise regression with progressive deletion, best-subsets regression, etc. More recently, evolutionary methods such as genetic algorithms, or learning machines such as neural networks have been used for parameter selection.
The first indicator used to judge the quality of a regression model is the coefficient of multiple determination, or R2. This measures the proportion of the variation of the observed property (the property being modeled, the dependent variable) that is accounted for by the set of descriptors (independent variables) in the model. The correlation coefficient between the fitted property values (calculated using the model) and the experimentally observed property values is termed the coefficient of multiple correlation, commonly called the correlation coefficient, or R, which is the positive square root of R2. All commercial statistical packages report R2 as a standard part of the results of a regression analysis. A high R2 value is a necessary, but not a sufficient condition for a good model. It""s important that a model account for as much variation in the dependent variable as possible. However, the validity of the model must be determined using a variety of other criteria.
Once a model has been developed, it must be validated. This process includes the consideration of statistical validation of the model as a whole (e.g., overall-F value from analysis of variance, AOV) and of the individual coefficients of the equation (e.g., partial-F values), analysis of collinearity between the independent variables (e.g. variance inflation factors, or VIF), and the statistical analysis of stability (e.g., cross-validation). Most commercial statistics software can compute and report these diagnostic values. If possible, one employs an xe2x80x9cexternal prediction setxe2x80x9d, a set of materials for which the property of interest has been measured, but which were not included in the development of the model, to evaluate and demonstrate the predictive accuracy of the model.
A wide variety of software is available to perform various parts of the model development process. Descriptors can be pulled from databases (e.g., in the case of fragmental constants), or computed directly from the molecular structure of the materials. Non-limiting examples of programs which can be used to compute descriptors are SYBYL (Tripos, Inc., St. Louis, Mo.), Cerius2 (Accelrys, Princeton, N.J.), and ADAPT (P. C. Jurs, Pennsylvania State University, University Park, PA). These same programs can also be used to perform the statistical model development which includes the determination of the correlation coefficient between the computed estimates and the experimentally-derived property of interest plus subsequent model validation. Alternatively, commercial statistical programs like Minitab for Windows (Minitab, INC., State College, Pa.) can be used to generate and validate model equations.
One approach for describing the chemical structure of the chemical molecules in detail that is commonly used in QSAR/QSPR work is the group contribution method. In this approach, the structure of the molecule is divided into small fragments. The software keeps track of the number and type of each fragment. A database is then searched and a fragment-constant is found for each fragment in the structure. The physical property is then estimated by calculating the sum of constants for all fragments found in the structure multiplied by the number of times that fragment is found in the structure. For example, the group contribution method is used to compute and predict log P, the base-10 logarithm of the partition coefficient P, as described in A. Leo, Comprehensive Medicinal Chemistry, Vol. 4, C. Hansch, P. G. Sammens, J. B. Taylor and C. A. Ramsden, Eds., p. 295, Pergamon Press, 1990, incorporated herein by reference. Alternatively, a model developed to estimate and predict normal boiling points using whole-molecule structure descriptors is described in xe2x80x9cDevelopment of a Quantitative Structurexe2x80x94Property Relationship Model for Estimating Normal Boiling Points of Small Multifunctional Organic Moleculesxe2x80x9d, David T. Stanton, Journal of Chemical Information and Computer Sciences, Vol. 40, No. 1, 2000, pp. 81-90, incorporated herein by reference. In this approach, the structure is not divided into fragments. Rather, measurements of a variety of structural features are computed using the whole structure. For most of these small molecules, the chemical structure can be described quickly and accurately using these types of approaches.
There are also efforts to apply QSAR/QSPR methods to various classes of polymers including homopolymers and copolymers. A polymer is a chemical compound or mixture of compounds formed by polymerization and consisting essentially of repeating structural units called monomers. A homopolymer is comprised of essentially one type of monomer. A copolymer is comprised of more than one type of monomer. Approaches that are useful for small molecules however, are typically not applicable for developing predictive polymer QSAR""s. The number of atoms in the polymer molecule is usually much larger, and thus to develop the necessary descriptors for the group contribution method requires very large sets of experimental data. If a polymer contains a structural unit whose additive contribution to a certain property can not be estimated, the value of that property can not be predicted for that polymer. Attempts to by-pass the need for large sets of experimental data necessary to develop group contribution descriptors can result in time consuming force-field or quantum mechanical calculations, which often fail to provide accurate descriptors. Both approaches have been investigated by A. J. Hopfinger, M. G. Koehler, R. A. Pearlstein, and S. K. Tripathy in Journal of Polymer Science, Polymer Physics Edition, Vol. 26, 1988, pp. 2007-2028, and by J. Bicerano in Prediction of Polymer Properties, 2nd edition, Marcel Dekker, Inc., New York, Basel, 1996, incorporated herein by reference. Furthermore, except for some natural polymers such as enzymes, most polymers, especially synthetic polymers are mixtures of polymeric molecules of various molecular weights, sizes, structures and compositions. Commercially available polymers, especially those that are used by industry in large scale, commonly contain certain levels of unreacted fragments and/or by-products. In most cases, there is not one exact chemical formula or structure that can describe such a polymer. Such polymers are characterized most commonly by their average properties, such as, average molecular weight, viscosity, glass transition temperature, melting point, solubility, cloud point, heat capacity, interfacial tension and adhesion, refractive index, stress relaxation, sheer, conductivity, permeability, and the like. Another common way that polymers are characterized is by the number and type of monomers. Polymers are also sometimes defined by the amounts of starting ingredients used in the polymerization process; from the starting ingredients and the conditions under which the polymerization reaction proceeds, one can sometimes derive a generalized structure and/or formula of the resulting polymer.
Applications of QSAR/QSPR approaches to polymers typically use descriptors derived for repeated units, such as molecular weight of a repeat unit, end-to-end distance of a repeat unit in its fully extended conformation, Van der Walls volume of a repeat unit, positive and negative partial surface area normalized by the number of atoms, topological Randic index computed for a repeating unit, cohesive energy which can be estimated using group contribution method, and a parameter related to the number of rotational degrees of freedom of the backbone of a polymer chain, that can be derived from the structure of a repeat unit, as described by J. T. Seitz in Journal of Applied Polymer Science, Vol. 49, 1993, pp. 1331-1351, or by topological connectivity indices as described by J. Bicerano in Prediction of Polymer Properties, 2nd edition, Marcel Dekker, Inc., New York, Basel, 1996, both of which are incorporated herein by reference.
Most QSAR/QSPR polymer models correlate theoretically calculated molecular descriptors of a repeating unit with bulk physical properties of the polymer, such as glass transition temperature, refractive index, heat capacity, diamagnetic susceptibility, viscosity, thermal conductivity, and the like. In addition, development of these models requires atomic and/or group correction terms. Another approach to predicting properties of homopolymers of a regular structure is to model three repeating units for each polymer and calculate descriptors only for the middle unit. In this way influence of the adjacent units can be also taken into account, as described by Katritzky A. R. et al. in Journal of Chemical Information and Computer Sciences vol. 38, 1998, pp 300-304, incorporated herein by reference. However, a limitation of these models is that they are applicable only to homopolymers and can not be easily reapplied to block and/or random copolymers.
One approach to predicting properties of copolymers is via development and calculation of applicable group contribution descriptors and to extend existing group contribution tables. This, however, requires large experimental data sets. An approach to overcome this deficiency for alternating block copolymers is to treat blocks of a copolymer as separate polymers and assume simple additivity rules for prediction of extensive properties as described by J. Bicerano in Prediction of Polymer Properties, 2nd edition, Marcel Dekker, Inc., New York, Basel, 1996, incorporated herein by reference. Calculation of the properties of random copolymers require using weighted averages (from molar fractions of repeating units) of all extensive properties and appropriate definitions for the intensive properties in terms of the extensive properties as described by J. Bicerano in Prediction of Polymer Properties, 2nd edition, cited herein above.
The present invention relates to a novel approach of QSAR for polymers wherein the descriptors used are structural descriptors, which are experimentally generated and/or derived using one or more analytical methods. The term polymer as used herein comprises both homopolymer and copolymer, and mixtures thereof.
The present invention relates to a method for identifying a predictive model from which to select existing polymers, and/or to prepare new polymers having a desired property, the method comprising the steps of:
a. identifying a set of existing polymers including representatives having a broad range of values of the desired property;
b. determining the desired property for each of the polymers in the set, wherein the property of each polymer has a numerical value;
c. generating quantitative structural descriptors that characterize at least a portion of the molecular structure, preferably characterizing the whole molecular structure, of each polymer of the set of polymers; and
d. identifying a mathematical function that relates a selected group of quantitative structural descriptors to the desired property, said group comprises at least 2, preferably at least 3 quantitative structural descriptors, preferably from 2 to about 10, more preferably from 2 to about 6, and even more preferably from 2 to 4 quantitative structural descriptors, the predictive model comprising the identified mathematical function;
wherein the desired property can be provided by the neat, undiluted polymer, but preferably the desired property is provided by the polymer in a composition, more preferably the desired property is a useful functional property in a consumer product composition and/or industrial composition, and even more preferably the desired property is a consumer relevant property provided by the polymer under use conditions in a consumer product composition comprising the polymer.
The method of the present invention can further comprise the steps of
e. identifying one or more additional mathematical function(s); and
f. determining which mathematical function more accurately correlates molecular structure with the desired property.