The new initiative for high throughput structural determination promises to revolutionize all stages of the drug discovery process by providing many new high-resolution structures of novel protein folds and complexes between proteins and small molecule drugs. This new knowledge will allow drug development teams to acquire a much better understanding of structure activity relationships. But, before the vision of high throughput protein crystallography can be realized, many time-consuming steps in the process must be overcome. The invention described herein seeks to address two of the bottlenecks in high throughput crystallography: the determination of new protein structures and the identification of new leads for drug compounds. Although the hurdles occur at different stages of the process, both may be addressed by extending pair-wise comparisons of molecules to the scale of large databases.
One of the guiding principles of drug discovery is that similarly shaped molecules are more likely to share biological properties than dissimilar molecules. Thus, a number of algorithms have been developed for making shape-based comparisons of molecules in the field of small molecule drug discovery [1–8]. These approaches rely on strict superpositioning of coordinates, matching and aligning of chemical descriptors, or making topological comparisons of molecules. In general, these methods were designed to find molecules that are similar in activity and so are limited to compounds that vary at a few chemical groups. Thus these methods will group compounds with very similar structure but will not identify molecules where only a small subset of the structure is shared between two compounds. A method that does have the capability to identify subsets of structures but was developed specifically for comparing proteins, is DALI [9]. Briefly, DALI generates a matrix of all interatomic Cα vectors for each polypeptide chain in the comparison. Both matrices are reduced to essential contact patterns of structural elements in the polypeptide, and then the patterns are aligned, compared, and scored according the degree of similarity. The scores from multiple alignments are ultimately ranked in the output according to the similarity score. The technique is quite powerful when applied to proteins with known structures; however, there is no means to extend the software to other types of molecules or to include protein atom types other than Cα in the comparison. A more flexible pair-wise comparison of molecules that can be extended to many types of structures must be an integral component of the drug discovery process and any improvement in methodology will speed the way to new drug leads.
Beyond small molecule drug discovery, another arena in which pair-wise comparison of structures is important is in the determination of new protein structures through x-ray crystallographic methods. Two common approaches to solving structures are available to the crystallographer: one is multiple isomorphous replacement (MIR), and the other is molecular replacement (MR). MR can be thought of as a type of pair-wise comparison between molecules, but with the special condition that for one of the molecules the structure has not been modeled. MR consists of positioning and orienting the structure of a known molecule in the crystal environment of a protein for which x-ray data is available. Fourier-based Patterson methods are used to generate grids containing peaks that represent interatomic distances for the x-ray data and the structure of the known model. The grids are rotated and translated with respect to one another until the correlation is maximized. MR is used exclusively when crystallographic data is collected from a protein with strong structural homology to another protein. In most cases where MR is applied, the known structure comprises 25% or more of the mass of the unknown protein. Furthermore, as long as there is high structural homology, molecular replacement has succeeded with sequence homology as low as 33% as in the case for protein kinases [10]. In general, this means that MR has only been useful in the context of a protein that has been very well characterized (for which the function is known or guessed). Using MR to help solve structures of the enormous numbers of proteins with unknown function identified in the human genome project would at first seem unfeasible.
Without functional information the search space for candidate models becomes much larger and the barriers to applying MR much greater. In the past, when confronted with a large search space, a crystallographer would abandon MR in favor of other, more time-consuming approaches such as MIR. But the availability of powerful computers and the growing number of protein structures deposited with the Protein Data Bank (PDB) could potentially make molecular replacement much more viable technique. Currently, there are over 14,000 structures in the PDB, and that number is increasing exponentially [11]. As more folds are deposited the likelihood of a match between a model in the PDB and the subject protein increase accordingly. With the invention available to mine protein structural databases systematically and automatically, it should be possible to use molecular replacement for the ab initio determination of any protein structure. Current methods for automating molecular replacement searches, however, are too primitive.
Most current molecular replacement algorithms are modifications of the original rotation function [12] and translation function formulated by Crowther and Blow [13]. The existing embodiments currently do not permit automated database searches; however, two programs appear to be promising candidates for modifications to allow them to do database searches: EPMR [14] and AMoRe [15].
EPMR employs evolutionary search algorithms on a variation of the brute force six-dimensional search for rotation and translation solutions. The algorithm randomly samples six-dimensional space to find a set of starting solutions with high correlation coefficients. Those that satisfy criteria set by the program are subjected to iterative rounds of searches in which the starting orientation of the models have been shifted randomly by small increments. The process is repeated until the solutions are optimized, and then the program calls for a round of local rigid body refinement. The authors claim better signal-to-noise ratios in the solutions and a higher tolerance of errors and incompleteness in the search models than AMoRe.
However, EPMR is a time-consuming algorithm, and so AMoRe is still preferred by many because of its speed and ability to test many solutions simultaneously. AMoRe is based on a fast rotation function using spherical harmonics and Bessel function expansions. The modifications to the rotation search permit more accurate calculation of the rotation matrices and provide better resolution of the rotation peaks.
Even though the execution time for AMoRe is must faster than EPMR, AMoRe has two limitations that make it cumbersome to use for high volume comparisons. In the normal mode of operation, AMoRe must be run in an iterative manner. A crystallographer intervenes at the end of each cycle to analyze and parse out needed parameters from the log files generated by AMoRe and feeds them into the next round of computation. Thus, AMoRe lacks automation. Furthermore, AMoRe requires support programs to manage input data. AMoRe is part of the CCP4 program suite, and uses defined input formats in order to make it compatible with other programs in the suite. As such is the case, AMoRe requires that input data be passed through the programs f2mtz and pdbset. All the programs, including AMoRe, are designed to run under a single processor and cannot be recompiled easily to take advantage of multiple CPUs. Both of these conditions prevent a user from taking advantage of the computing power normally available to distributable applications.
The lack of automation and limited computing power available to AMoRe make an exhaustive search of the complete protein data bank impractical. Assuming a dedicated crystallographer could edit, write, and parse the files necessary to complete a molecular replacement search every 10 minutes, then a crystallographer working around the clock would take more than 100 days to complete the task. Aside from the Herculean effort on the part of the crystallographer, keeping track of the output generated from the effort would also require a database. Currently, there are no programs available that satisfy the requirement for conducting high throughput pair-wise shape-based comparisons of protein molecules or small molecules.