A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the Patent and Trademark files or records, but otherwise reserves all copyright rights whatsoever.
The invention relates to a computer-implemented method for determining all-atom, real-space protein structures.
Protein sequences can be deduced from the DNA sequence of an organism, and the worldwide genome project has provided tens of thousands of new protein sequences. Proteins have flexible backbones, and protruding, rotating side-chains. They can take up a countless number of shapes i.e. conformations, in three-dimensional space. Yet proteins eventually fold into an ordered structure, the native conformation. The protein-folding problem reflects the inability to predict the native folded conformation of a protein given only its amino acid sequence.
Various methods have been described for solving the folding problem. The methods include direct and template-based methods. Direct methods try to determine the native conformation as a lowest energy point in some defined hyperspace of conformational possibilities. Template-based methods compare a sequence of unknown three-dimensional structure against a library of known three-dimensional structure and score good matches as likely folds. There is a substantive body of research literature on these methods, but successes are rare and often not reproducible. There has been a call for new computational methods that broadly explore comformational space and that are true to the details of protein structure (Dill, K. A. et al Nature Structural Biology 4:10, 1997; Karpus, M. Fold Des. 2:S69, 1997).
The present inventors have developed a method to generate plausible random protein structures. All-atom proteins are made directly in continuous 3-dimensional space starting from primary sequence with an N to C directed build-up method. The method uses a novel pipelined residue addition approach in which the leading edge of the protein is constructed 3 residues at a time for optimal protein geometry, including the placement of cis proline. Build-up methods represent a classic N-body problem, expected to scale as N2. When proteins become more collapsed, build-up methods are susceptible to backtracking problems which can scale exponentially with the number of residues required to back out of a trapped walk. Solutions to both these problems have been provided, using a multiway binary tree that makes the N-body problem of bump-checking scale as NlogN, and speeding up backtracking by varying the number of tries before backtracking based on available conformational space.
In particular, the method constructs all-atom protein structures in O(NlogN) time (rather than in quadratic O(N2) time) by residue addition that is balanced in both speed and detail. The primary sequence and a multidimensional trajectory graph system is employed, which directs the sampling of conformational space and behaves like the theoretical protein folding funnel. Trajectory graphs can direct either the random sampling of protein conformer space (the funnel xe2x80x9cmouthxe2x80x9d), or direct the reconstruction of a known protein backbone (the funnel xe2x80x9cspoutxe2x80x9d). Several novel geometrical, methodological, and algorithmic approaches are introduced in the method. A schematic diagram of a method of the invention is shown in FIG. 1.
The methods of the invention have been validated at both extremes of the folding funnel by comparison with polymer theory, and by reconstructing known proteins. In particular, random all-atom proteins generated using the E. coli genomic amino acid composition had radius of gyration statistics that showed the expected swelling compared to non-self avoiding random polyalanine, and Flory""s (12) theoretical curves approximate a lower bound for these results. For tests of protein fold reconstruction using nine different protein folds, an average RMSD of 0.63 xc3x85 was obtained for Cxcex2, C, N and O backbone atoms. WHAT_CHECK, a protein-structure checking software suite (30) validated that the method generates physically and chirally valid backbones and sidechains,
The binary-d tree is a new hierarchical data structure developed to deal with the O(N2) problem of atomic bump-checking (collision detection based on atomic radii). It permits overall O(NlogN) time complexity (validated out to N=2,500 amino acids), together with efficient backtracking. It utilizes a unique 3-dimensional tree that partitions space in a relative fashion unlike voxels used in an octree system. Branch and bound search methods on the binary-d tree can retrieve coordinates contained by probe volumes. The method allows atoms or sections of molecules to be moved without repartitioning the space occupied by the entire set of atoms. Binary searches are also used in the fitting of amino acid backbones between alpha-carbons, and in the random sampling of the trajectory graphs, which also contribute to the overall O(NlogN) performance of the method of the invention.
Therefore, in accordance with an aspect of the invention a method is provided for creating or identifying a conformation of a protein of known or unknown structure which comprises the steps of;
(a) providing an amino acid sequence of the protein;
(b) constructing a backbone structure of xcex1-carbons of the protein by adding and removing carbon atoms through chain elongation and backtracking, wherein an atom is positioned based on a predicted two-dimensional space, and wherein backtracking removes an atom if it is closer to its neighbour than allowed by van der Waals radii;
(c) positioning xcex2 carbons, C, N, and O atoms to provide favourable bond lengths and bond angles; and
(d) positioning sidechain rotamers; thereby outputting a conformation of the protein.
The method constructs the conformation of the protein in O(NlogN) time, and it is constructed in real space and not confined to a lattice. The conformation is preferably an all atom protein structure, including hydrogen atoms. The method may further comprise assembling different conformations of the protein to provide an ensemble of conformations of the protein. The ensemble may be incorporated in a database which may comprise from about 50,000 to 500,000 different conformations of the protein.
Another aspect of the invention is a computer-implemented process for identifying a conformation of a protein of known or unknown structure from an amino acid sequence of the protein. The steps of the process performed by the computer include (a) constructing a backbone structure of xcex1-carbons of the protein by adding and removing carbon atoms through chain elongation and backtracking, wherein an atom is positioned based on a predicted two-dimensional space, and wherein backtracking removes an atom if it is closer to its neighbour than allowed by van der Waals radii; (b) positioning xcex2 carbons, C, N, and O atoms to provide favourable bond lengths and bond angles; and (c) positioning sidechain rotamers; thereby identifying a conformation of the protein.
Another aspect of the invention is part of a computer system for creating or identifying a conformation of a protein of known or unknown structure from an amino acid sequence of the protein. This part of the computer system includes (a) means for constructing a backbone structure of xcex1-carbons of the protein by adding and removing carbon atoms through chain elongation and backtracking, wherein an atom is positioned based on a predicted two-dimensional space, and wherein backtracking removes an atom if it is closer to its neighbour than allowed by van der Waals radii; (b) means for positioning xcex2 carbons, C, N, and O atoms to provide favourable bond lengths and bond angles; and (c) means for positioning sidechain rotamers.
Another aspect of the invention is part of a computer system for identifying favorable areas of conformational space in an ensemble of conformations of a protein. This part of the computer system includes a conformer generator module and a structure analysis module. The conformer generator module has an input for receiving an amino acid sequence of the protein and it defines an ensemble of conformations of the protein by (a) constructing a backbone structure of xcex1-carbons of the protein by adding and removing carbon atoms through chain elongation and backtracking, wherein an atom is positioned based on a predicted two-dimensional space, and wherein backtracking removes an atom if it is closer to its neighbour than allowed by van der Waals radii; (b) positioning xcex2 carbons, C, N, and O atoms to provide favourable bond lengths and bond angles; and positioning sidechain rotamers. The module records the amino acid sequence of the protein as an ensemble of conformations of the protein wherein each conformer of the protein is represented by a backbone conformation graph. The structure analysis module is connected to the output of the conformer generator module and it comprises means for creating or identifying a next ensemble of conformers of the protein using a weighting scheme and scoring function; and means for repeating these steps until the backbone conformation graph maintains its shape.
Backbone structures of xcex1-carbons of a protein may be constructed by randomly sampling xe2x80x9ctrajectory graphsxe2x80x9d or xe2x80x9ctrajectory distributionsxe2x80x9d of amino acid residues representing a statistical sampling for each amino acid residue of the conformational space it is observed to visit in known proteins wherein the trajectory graphs are resolved into xcex1, xcex2, and coil secondary structure components for each amino acid residue. The secondary-structure based trajectory graphs may be recombined in predicted proportions (e.g. %xcex1, %xcex2, % coil) for each amino acid in a protein to be analyzed to form a starting backbone conformation graph. Electron microscopy, atomic microscopy, and/or NMR data may also be used to confine selected conformational spaces i.e. the data may be mapped into the trajectory graphs.
The invention contemplates a part of a computer system for constructing backbone structures of xcex1-carbons of a protein comprising a trajectory file which defines trajectory graphs or distributions of amino acid residues representing a statistical sampling for each amino acid residue of the conformational space it is observed to visit in known proteins wherein the trajectory graphs are resolved into xcex1, xcex2, and coil secondary structure components for each amino acid type; and optionally the graphs are recombined in predicted secondary structure proportions for the protein.
The invention provides a novel hierarchical data structure (i.e. binary-d tree) that fits residues in the backbone structure of a protein between xcex1-carbons and in random sampling of the trajectory graphs.
Another aspect of the invention is a computer-implemented process for identifying favorable conformational spaces in an ensemble of conformations of a protein comprising:
(a) providing a database of conformations of the protein wherein each conformer is represented by a backbone conformation graph;
(b) creating a next ensemble of conformations of the protein using a weighting scheme and scoring function; and
(c) repeating step (b) until the backbone conformation graph maintains its shape.
The methods and processes of the invention provide an ensemble of conformations of a protein i.e. conformers. Each protein in an ensemble may be generated using the method of the invention on a single processor. Other advantages of the processes and methods of the invention include fixed memory usage, minimal disk usage, and many adjustable parameters to affect the quality versus speed tradeoff of structure generation.
The details of the preferred embodiment of the present invention are set forth in the accompanying drawings and the description below. Once the details of the invention are known, numerous additional innovations and changes will become obvious to one skilled in the art.