The present invention relates to methods for determining and refining three dimensional structures of molecules, and more particularly a method for determining and refining three dimensional structures of molecules by non-linear recursive filtering of chemical information and structural observations.
Accurate knowledge of the three dimensional structure of macro-molecules is essential for modeling their biological functions and, thus, for structure-based drug design. Nuclear Magnetic Resonance (hereinafter referred to as NMR) technology makes it possible to measure quantitative characteristics of macro-molecules in solution. At the same time, this poses new challenges for the problem of converting the NMR data into a three dimensional structure. It is well understood that the NMR data, for example, Nuclear Overhauser Effect (hereinafter referred to as NOE) intensities have a complicated nonlinear structure arising from multiple-spin-effect, which is also known as the spin diffusion problem. Simplification of that information by attempting to un-couple spins leads to loss of accuracy. Direct refinement of the three dimensional structure by the NOE intensities poses, however, significant computational difficulties for large molecules, even within the current limit on the size of molecules accessible by NMR experiments. Also, the complexity of the measurement model is aggravated by various deficiencies in data such as experimental noise, uncertainties in the assignments of proton resonances and uncertainties in relaxation parameters and contributions from dynamical effects.
Traditionally, the problem of obtaining three dimensional structures from pre-processed experimental NMR data is divided into the distinct steps of determination and refinement. During determination, an initial family of structures is generated that approximately satisfy covalent and experimental restraints. Each member of the structural family is then refined against the experimental data to increase the final accuracy. This conceptual division is convenient due to the multiple-minima nature of a global functional representation of the optimization target for the entire structure. Historically, the algorithmic methods used during the determination step were distinct from the methods used during the refinement step. However, the dividing line can be blurred if algorithms are able to systematically handle the multiple-minima problem.
In NMR applications, the method of structure optimization is substantially defined by how the NMR data is treated. A significant effort in determination and refinement techniques of the NMR structures is aimed at utilizing more complete information from experimental NOE intensities. The results can be split into indirect and direct approaches. The indirect approach offers better interpretation of NOE intensities in terms of distances rather than the simplest classification of distances attributed to strong, medium and weak NOE peaks. The indirect approach is usually used at both determination and refinement stages. Various methods offer ways to smooth distances by triangle bound inequalities or by more accurate, but computationally intensive, tetrangle inequalities. In the indirect approach, relaxation matrix calculations are also used to account for all magnetization transfers, also known as spin diffusion, resulting in more accurate distance representations. The target distances can be iteratively modified manually or automatically to match observed and calculated two dimensional NOE intensities. Another approach is based on the back-transformation of the matrix of intensities to obtain the relaxation matrix and hence, the cross-relaxation rates. In practice the experimental NOE matrix is incompletely known due to overlap and missing assignments. This problem can be circumvented by iterative construction of a xe2x80x9chybridxe2x80x9d NOE matrix comprised of experimental and calculated intensities. A problem with indirect methods is the approximate nature of the distance xe2x80x9cmeasurementsxe2x80x9d which do not account properly for spin diffusion effects.
The direct approach directly incorporates NOE intensities as experimental constraints in powerful restrained dynamics simulations. This approach does not require the complete NOE matrix and provides significant improvement in accuracy due to direct accounting for spin diffusion effects. The main problem with the direct approach is that computing the gradient of the experimental potential, i.e. the gradient of each NOE intensity is very computationally intensive and needs to be done for each member of the structural family generated during the determination step, and for each time step in the restrained dynamics simulation.
Modern estimation techniques such as Kalman filters can be considered as alternative and supplementary to traditional global optimization techniques used for structural determination and refinement. The Double-Iterated Kalman Filter (hereinafter referred to as DIKF) was implemented to generate a protein structure from distances derived from NOE data, with geometrical restraints being interpreted as additional data. This approach enables both structure determination and a definitive estimate of its uncertainty and thus provides significant additional knowledge and insight on those regions of the protein which correspond to non-unique conformations.
Kalman filter techniques provide important new features for interpreting the quality of a determined molecular structure. However, since the Kalman filter is an optimal, recursive filter based on the linearization of measurement non-linearities, a significant amount of information is lost when the filter is applied to complicated, nonlinear structures such as NOE intensities. In addition, large matrix data structures are required whose straightforward manipulation severely limit the computational efficiency of these techniques when applied to large macro-molecules. Consequently, Kalman filters, including the DIKF, do not significantly improve the efficiency of determination and refinement algorithms of prior art systems, in terms of solving the multiple-minima problem.
There is a need for an improved method of determining and refining a three dimensional molecular structure, which utilizes non-linear processing techniques to solve the multiple-minima problem and which is computationally efficient.
It is therefore an object of the invention to provide an improved method of determining and refining a three dimensional molecular structure, which utilizes non-linear processing techniques to solve the multiple-minima problem in a computationally efficient manner.
Other objects and advantages of the present invention will become apparent upon consideration of the appended drawings and description thereof.
The foregoing and other objects are achieved by the invention which in one aspect comprises a computer assisted method for generating a representation of the three dimensional structure of a molecule. The representation is embodied in an n dimensional vector X=(X1, X2, . . . Xn)T, where n is an integer, and Xi is a structural parameter, for all i from 1 to n. The set of structural parameters X1 through Xn may include, but is not limited to, one or more of the following classes of structural parameters, or a combination thereof: (1) three dimensional Cartesian coordinates (i.e., x, y and z) for each of the N atoms; (2) three dimensional coordinates for each of the N atoms expressed in other suitable coordinate systems, for example, internal coordinate space or dihedral angle space; (3) order parameters; (4) B-factors; and, (5) any other parameters that describe features of a molecular structure. The invention also generates a covariance matrix PX which represents the statistical uncertainty of the generated vector of structural parameters.
The user of the invention provides as input a set, or sets, of observed data for the particular molecule under consideration. This observed data may consist of, but is not limited to, one or more of the following classes, or combinations thereof: (1) experimental NMR data, including COSY (Correlated Spectroscopy) data and NOESY (Nuclear Overhauser Effect Spectroscopy) data, also known as NOE intensities; (2) experimental X-ray crystallography measurements; (3) covalent information from molecular topology/parameter data bases, e.g., bond lengths, bond angles, dihedral angles; (4) interatomic distance measurements or estimates of such distances, derived from experimental data, homology considerations, or other source; and, (5) any other class of data that is relatable to structural characteristics of the molecule. In the most general case, the observed data input to the invention is signified by Yk, jobs, with associated noise xcex7k, j. The k index corresponds to different sets of observations as a function of time, for times k=1 through K. The j index corresponds to each component of the one or more classes of observed data identified above, for all components j=1 through J. Although the number of components J may depend on the time index k, such dependency is omitted for simplicity and without loss of generality. In most applications of the invention for molecular structure determination, only one time set will be available, and hence the input observed data will be a single vector composed of components from one or more of the classes identified above, Yl, jobs for all components j=1 through J. In cases where time sequenced measurements are available, the input observed data will be a matrix, Yk, jobs, for all k=1 to K and for all j=1 to J. In this more general case, the invention employs a propagation model to step from one time state k to the next.
The invention utilizes an analytical model g(X), which relates the observed data, Y, to the structural parameters, X. Thus, Ycalc(X)=g(X), where Ycalc(X) is the set of theoretical, i.e., modeled, observed data, as compared to the actual observed data Yk, jobs. The analytical model, g(X), which consists of mathematical expressions that relate each component of each class of observed data to one or more of the structural parameters Xi, may include, but is not limited to: (1) NMR measurements, i.e., NOE intensities, in terms of interatomic distances and the atomic coordinates; (2) X-ray crystallography measurements in terms of the atomic coordinates; (3) bond or non-bond distances between two atoms in terms of atomic coordinates; (4) bond angles and dihedral angles in terms of atomic coordinates; and, (5) mathematical relationships for any other class of observed data in terms of the appropriate structural parameters. The specific nature of the analytical model relating the observed data Y to the structural parameters X does not matter to the invention, as long as such a relationship exists and can be reduced to algorithmic form. In one aspect, for ease of use, the invention contains the analytic models for categories 1 through 4 above. In another aspect, the invention provides the means for the user to specify the analytic model algorithm relating arbitrary observed data to the corresponding structural parameters.
The user provides an initial state of the vector of structural parameters, X. The invention uses this initial state to start the nonlinear recursive filter (hereinafter referred to as NRF), which recursively processes each observed data element Yk, jobs from the kth observation set, for all j=1 to J. For each observed data element Yk, jobs, the NRF recursive step consists of a prediction step, followed by an update step. Outputs of the prediction step are denoted with a {circumflex over ( )} superscript, for example {circumflex over (X)}, and outputs of the update step are denoted with a * superscript, for example X*, when it is necessary to distinguish between the two. During the prediction step for the jth element, the invention uses the vector of structural parameters, X*[k, jxe2x88x921], and the associated covariance PX*[k, jxe2x88x921], from the previous (jxe2x88x921)th element processing, and the analytic model, g(X), to compute new calculated observed data, Yk, jcalc(X*[k, jxe2x88x921]), and associated covariances {circumflex over (P)}y[k, j]and {circumflex over (P)}XY[k, j]. The update step then computes the difference between Yk, jcalc({circumflex over (X)}[k, j]) (where {circumflex over (X)}[k, j]=X*[k, jxe2x88x921] is the trivial structural parameter vector prediction step) and Yk, jobs and uses the difference in conjunction with the covariance information {circumflex over (P)}Y[k, j] and {circumflex over (P)}XY[k, j] to generate a new vector of structural parameters X*[k, j], and associated covariance, PX*[k, j]. In this manner, the non-linearities contained in the analytic model are isolated in the prediction step, where they can be handled accurately and efficiently. The filtering process continues recursively for all elements of the observed data from j=1 to J, and if there are time varying observations, from k=1 to K. At the completion of the recursive filtering, the final generated vector of structural parameters, X*[K, J], represents the best estimate of the three dimensional structure of the molecule consistent with the input observed data, Y, associated uncertainties, xcex7, and the analytic model, g(X).
The covariance-based formulation of the NRF is an important aspect of the invention. First, the covariance, PX* provides a direct estimate of the uncertainty in the generated structural parameters. Second, it provides the memory mechanism from one recursive step to the next in the NRF. More generally, this memory mechanism enables the ability to restart, or further refine, a given molecular structure in response to new observed data without wasting the effort expended during the initial refinement.
The mathematical operations required for processing the covariance matrices in the NRF increase as the number of structural parameters for the molecule increase. One aspect of the invention called the Refinement Wave (hereinafter referred to as RW), takes advantage of the sparse nature of the covariance matrix, and manages the NRF processing in order that the mathematical operations and required computer memory do not become prohibitive for large molecule applications. To accomplish this, the invention operates on only a small fragment, the active fragment, of the covariance matrix at any one point during NRF processing. The RW utilizes a multi-level criterion to identify and select the appropriate active fragment of the covariance matrix for processing at each step of the NRF recursion. In one embodiment of the invention, a first level criterion can exclude portions of the covariance matrix corresponding to those pairs of atoms which cannot be closer than a specified cutoff (e.g., 10 xc3x85), given the initial structure and its a priori uncertainties. A second level criterion can identify non-essential covariances from the magnitudes of correlations evidenced in the mutual covariance PXY, and excludes all low magnitude elements from subsequent mathematical operations. Additional levels in the criterion can manage active/passive memory exchanges.
These aspects of the RW element of the invention enable an important asymptotic property, in that the computational time scales as n, where n is the number of structural parameters, as opposed to n2 or greater for most prior art methods. In addition, since the RW only requires a relatively small portion of the covariance matrix in active memory at one point during the processing, the invention can be applied, in principle, to a molecule of any size using only moderate computer memory and speed resources.