The present invention relates generally to the determination of crystal structure from the analysis of diffraction patterns, and, more particularly, to identification of protein crystal structure represented by electron density patterns.
The determination of macromolecular structures, e.g., proteins, by X-ray crystallography is a powerful tool for understanding the arrangement and function of such macromolecules. Very powerful experimental methods exist for determining crystallographic features, e.g., structure factors and phases. While the structure factor amplitudes can be determined quite well, it is frequently necessary to improve or extend the phases before a realistic atomic model of the macromolecule, such as an electron density map, can be built.
Many methods have been developed for improving the phases by modifying initial experimental electron density maps with prior knowledge of characteristics expected in these maps. The fundamental basis of density modification methods is that there are many possible sets of structure factors (amplitudes and phases) that are all reasonably probable based on the limited experimental data that is obtained from a particular experiment, and those structure factors that lead to maps that are most consistent with both the experimental data and the prior knowledge are the most likely overall. In these methods, the choice of prior information that is to be used and the procedure for combining prior information about electron density with experimentally-derived phase information are important features.
Until recently, electron density modification-has generally been carried out in a two-step procedure that is iterated until convergence occurs. In the first step, an electron density map is obtained experimentally and then modified in real space in order to make it consistent with expectations. The modification can consist of, e.g., flattening solvent regions, averaging non-crystallographic symmetry-related regions, or histogram-matching. In the second step, phases are calculated from the modified map and are combined with the experimental phases to form a new phase set.
The disadvantage of this real-space modification approach is that it is not clear how to weight the observed phases from those obtained from the modified map. This is because the modified map contains some of the same information as the original map and some new information. This has been recognized for a long time and a number of approaches have been designed to improve the relative weighting from these two sources, including the use of maximum-entropy methods, the use of weighting optimized using cross-validation, and xe2x80x9csolvent-flipping.xe2x80x9d
A comprehensive theory of the phase problem in X-ray crystallography and a formalism for solving it based on maximum entropy and maximum likelihood methods has been presented by Bricogne, Acta Cryst. A40, pp. 410-445 (1984) and Bricogne, Acta Cryst. A44, pp. 517-545 (1988). This formalism describes the contents of a crystal in terms of a collection of point atoms along with probabilities for their positions. From the positions of these atoms, crystallographic structure factors can be calculated, with a certainty depending on the certainties of the positions of the atoms. Extensions of the formalism are described in Bricogne (1988). The extended formalism specifically addresses the situation encountered in crystals of macromolecules in which defined solvent and macromolecule regions exist in the crystallographic unit cell, and formulas for calculating probabilities of structure factors based on the presence of xe2x80x9cflatxe2x80x9d solvent regions are presented (Bricogne, 1988). The implementation of this formalism is not straightforward according to Xiang et al., Acta Cryst. D49, pp. 193-212 (1993), who point out that a full fledged implementation of this approach would be highly desirable and would provide a statistical technique for enforcing solvent flatness in advance. Xiang et al. (1993) report that they settled for an approximation in which solvent flatness outside the envelope is imposed after the calculation of a model for the distribution of atoms, which corresponds to the existing procedure of flattening the solvent in an electron density map (Wang, Methods Enzymol. 115, pp. 90-112 (1985)).
Somoza, et al., Acta Cryst. A51, pp. 691-708 (1995) describe an algorithm for recovering crystallographic phase information that is related to the method of Bricogne (1988), but in which electron density is estimated by minimizing a combined target function consisting of the weighted sum of two terms. One term is the weighted sum of squares of differences between calculated and known electron density in the region where electron density is known. The other term is the weighted sum of squares of differences between calculated and observed amplitudes of structure factors. In this method, the electron density in a model description of the crystal is adjusted in order to minimize the combined target function. The use of the first term was shown by Somoza et al (1995) to correspond to the solvent flattening procedures described above. This allowed solvent flattening and other related density modification procedures (such as non-crystallographic symmetry averaging) to be carried out without the iterative phase recombination steps required in previous methods. Beran and Szoke, Acta Cryst A51, pp. 20-27 (1995) describe a procedure for finding crystallographic phases that lead to an electron density map that matches known electron density within a target region. This procedure consists of minimizing a target function given by the squared difference between calculated and known electron density within the target region, by adjusting crystallographic phases and using observed amplitudes. The method was shown to be superior to difference Fourier methods and the improvement was attributed to the ability of the method to specify the uncertainties in electron density in different physical regions of the unit cell of the crystal.
The present invention solves the same problem that earlier procedures proposed by Bricogne (1988) address, and also includes the use of likelihood as a basis for choosing optimal crystallographic structure factors. The assumptions used in the present procedure differ substantially from those used by Bricogne (1988). For treatment of solvent and macromolecule (protein) regions in a crystal, Bricogne develops statistical relationships among structure factors based on a model of the contents of the crystal in which point atoms are randomly located, but in which atoms in the protein region are sharply-defined with low thermal parameters and atoms in the solvent region are diffuse, with high thermal parameters. In the present approach, no assumptions about the presence of atoms or possible values of thermal factors are used. Instead, it is assumed that values of electron density in the protein and solvent regions, respectively, are distributed in the same way in the crystal as in a model calculation of a crystal that may or may not be composed of discrete atoms.
The methods used to find likely solutions to the phase problem are also very different in the present approach compared to that of Bricogne (1988) because the assumptions used require the problem to be set up in different ways. Bricogne (1988) applies a maximum-entropy formalism developed by Bricogne (1984) to find likely arrangements of atoms in the crystal, which in turn can be used to calculate the arrangement of electron density in the crystal. In the present method, likely values of the structure factors are found by applying a likelihood-based approach based on a combination of experimental information and the likelihood of resulting electron density maps. These structure factors can be used to calculate an electron density map that is then, in turn, a likely arrangement of electron density in the crystal.
The present invention also addresses much the same problem that earlier procedures by Somoza et al. (1995) address. However the present invention differs considerably in the way that it is formulated, and consequently in the way that a solution is obtained. In particular, the approaches are different because Somoza et al (1995) use electron density as the independent variable and the present method uses crystallographic phases, generally fixing amplitudes to measured values. In the method of Somoza et al., (1995), the range of possible combinations of amplitudes and phases of structure factors that is explored is the set of all those that correspond to all or a subset of arrangements of positive electron density in the map, while in the present invention it is all possible crystallographic phases in combination with observed amplitudes. Consequently the two methods sample different possible combinations of amplitudes and phases of crystallographic structure factors.
The mathematical approaches for obtaining solutions in the two methods are different as well. The method of Somoza et al. (1995) calculates derivatives of their target function with respect to electron density, resulting in a linear system of equations to solve for the electron density at all points in the electron density map, while the present invention calculates derivatives of a likelihood-based target function with respect to structure factors in order to solve for crystallographic phases (or phases and amplitudes if amplitudes are not measured).
Finally, the target function that is optimized in the method of Somoza et al. (1995) is a weighted sum of squared differences, while the target function in the present invention is a log-likelihood-based function. The target function in Somoza et al (1995) simply restrains the electron density in the region where it is known to be similar to the known values. The present invention instead calculates the log-likelihood of the electron density map and maximizes it. Consequently the weighting schemes and the details of the target functions used are different.
The method of Beran and Szoke (1995) is related to the present method in that their target function has the same form as a special case of the map log-likelihood function to be described below, in which the local map log-likelihood is zero for all points outside the target area and a constant for all points within it. The method of Beran and Szoke differs in several ways from the present invention. First, as in the method of Somoza et al (1995), the target function is a weighted sum of squared differences, while the target function in the present case is a log-likelihood-based function. Second, in the method of Beran and Szoke (1995), no terms corresponding to agreement of phases with experimental observations are considered, while in the present method, these are a key source of phase information. Finally, the method of Beran and Szoke (1995) differs from the present method in that the means used to minimize the target function is based on simulated annealing (a method for sequentially adjusting phases in a biased, but random, walk, choosing those that improve the target function), contrasted with the approach of the present invention of calculating gradients of likelihood-based target function with respect to phases.
Various objects, advantages and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
An electron density map for a crystallographic structure having protein regions and solvent regions is improved by maximizing the log likelihood of a set of structures factors {Fh} using a local log-likelihood function:
xe2x80x83LL(xcfx81(x, {Fh}))=ln[p(xcfx81(x)|PROT)pPROT(x)+p(xcfx81(x)|SOLV)pSOLV(x)+p(xcfx81(x)|H)pH(x)],
where pPROT(x) is the probability that x is in the protein region, p(xcfx81(x)|PROT) is the conditional probability for xcfx81(x) given that x is in the protein region, and pSOLV(x) and p(xcfx81(x)|SOLV) are the corresponding quantities for the solvent region, pH(x) refers to the probability that there is a structural motif at a known location, with a known orientation, in the vicinity of the point x; and p(xcfx81(x)|H) is the probability distribution for electron density in the vicinity of point x, given that the structural motif actually is present.