1. Field of the Invention
This invention relates to method and apparatus for extracting and evaluating mutually coinciding or similar portions between sequences of atoms or atomic groups in molecules and/or between three-dimensional structures of molecules and, particularly to a method and apparatus for automatically extracting and evaluating mutually coinciding or similar portions between amino acid sequences in protein molecules and/or between three-dimensional structures of protein molecules.
2. Description of the Related Art
A gene is in substance DNA, and is expressed as a base sequence including four bases of A (adenine), T (thymine), C (cytosine), and G(guanine). There are about twenty types of amino acids constituting an organism, and it has been shown that arrangements of three bases correspond to the respective amino acids. Accordingly, it has been found out that the amino acids are synthesized according to the base sequences of the DNA in the organism and that a protein is formed by folding the synthesized amino acids. The arrangement of amino acids is expressed as an amino acid sequence in which the respective amino acids are expressed in letters similar to the base sequence.
A method for determining a sequence of bases and amino acids has been established together with the development of molecular biology, and therefore a huge amount of gene information including a base sequence data and an amino acid sequence data has been stored. Thus, in the field of gene information processing, a core subject has been how to extract biological information concerning the structure and function of the protein out of the huge amount of stored gene information.
A basic technique in extracting the biological information is to compare the sequences. This is because it is considered that a similarity is found in the biological function if the sequences are similar. Accordingly, by searching a data base of known sequences whose functions are known for a sequence similar to an unknown sequence a homology search for estimating a function of an unknown sequence, and an alignment such that a sequence is rearranged so as to maximize the degree of analogy between the compared sequences when researchers compare the sequences are presently studied.
Further, it is considered that a region of the sequence, in which a function important for the organism is coded, is perpetuated in the evolution process. For instance, a commonly existing sequence pattern (region) is known to be found when the amino acid sequences in proteins having the same function are compared between different types of organisms. This region is called a motif. Accordingly, if it is possible to extract the motif automatically, the property and function of the protein can be shown by finding which motif is included in the sequence. Further, the automatic motif extraction is applicable to a variety of protein engineering fields such as strengthening of the properties of the preexisting proteins, addition of functions to the preexisting proteins, and synthesis of new proteins. As described above, it can be considered as an effective means in extracting the biological information to extract the motif out of the amino acid sequence. However, the extracting method is not yet established, and the researchers currently decide manually which part is a motif sequence after the homology search and alignment.
A dynamic programming technique that is used in a voice recognition processing has been the only method used for automatically comparing two amino acid sequences.
However, according to the method of comparing the amino acid sequences using the dynamic programming technique, the amino acid sequences are compared two-dimensionally. Thus, this method requires a large memory capacity and a long processing time.
Meanwhile, in the fields of physics and chemistry, in order to examine the properties of a new (unknown) substance and to produce the new substance artificially, three-dimensional structures of substances are determined by a technique such as an X-ray crystal analysis or an NMR analysis, and information on the determined three-dimensional structures is stored in a data base. As a typical data base, a PDB (Protein Data Bank) in which three-dimensional structures of proteins or the like identified by the X-ray crystal analysis of protein are registered is widely known and universally used. Further, a CSD (Cambridge Structural Database) is known as a data base in which chemical substances are registered.
In the protein, a plurality of amino acids are linked to one another as a single chain and this chain is folded in an organism to thereby form a three-dimensional structure. In this way, the protein exhibits a variety of functions. The respective amino acids are expressed by numbering them from an N-terminal through a C-terminal. These numbers are called amino acid numbers, amino acid sequence numbers, or amino acid residue numbers. Each amino acid includes a plurality of atoms according to the type thereof. Therefore, there are registered names and administration numbers of protein, amino acid numbers constituting the protein, types and three-dimensional coordinates of atoms constituting the respective amino acids, and the like in the PDB.
It is known that the three-dimensional structure of the substance is closely related to the function thereof from the result of chemical studies conducted thus far, and a relationship between the three-dimensional structure and function is shown through a chemical experiment in order to change the substance and to produce a substance having a new function. Particularly, since a structurally similar portion (or a specific portion) between the substances having the same function is considered to influence the function of the substance, it is essential to discover a similar structure commonly existing in the three-dimensional structures.
However, since there is no method of extracting a characteristic portion directly from the three-dimensional coordinate, the researchers are at present compelled to express the respective three-dimensional structures in a three-dimensional graphic system and to search the characteristic portion manually. There is in general no method of determining an orientation of the substance, and thus the characteristic portion is searched while rotating one substance using the other substance as a reference, which requires a substantial amount of time.
When the researcher searches the similar three-dimensional structure, an r.m.s.d (root mean square distance) value is used as a scale of the similarity of the three-dimensional structures of the substances. The r.m.s.d value is a value expressing a square root of a mean square distance between the corresponding elements constituting the substances. Empirically, the substances are thought to be exceedingly similar to each other in the case where the r.m.s.d value between the substances is not greater than 1 xc3x85.
For instance, it is assumed that there are substances expressed by a point set A={a1, a2, . . . , ai, . . . , am} and a point set B={b1, b2, . . . , bj, . . . , bn}, wherein ai (i=1, 2, . . . , m) and bj (i=1, 2, . . . , n) are vectors expressing positions of the respective elements in the three-dimensional space. The elements constituting these substances A and B are related to each other, and the substance B is rotated and moved so that the r.m.s.d value between the corresponding elements is minimized. For example, if ak is related to bk (k=1, 2, . . . , n), the r.m.s.d value is obtained in the following equation (1) wherein U denotes a rotation matrix and Wk denote respective weights:                               r          .                      xe2x80x83                    ⁢          m          .                      xe2x80x83                    ⁢          s          .                      xe2x80x83                    ⁢          d          .                =                                            (                                                ∑                                      k                    =                    1                                    n                                ⁢                                  (                                                                                    w                        k                                            ⁡                                              (                                                                              Ub                            k                                                    -                                                      a                            k                                                                          )                                                              2                                    )                                            )                                      1              2                                n                                    (        1        )            
A technique of obtaining the rotation and movement of the substances, which minimizes the r.m.s.d value between these corresponding points, is proposed by Kabsh et al. (for example, refer to xe2x80x9cA Solution for the Best Rotation to Relate Two Sets of Vectors,xe2x80x9d by W. Kabsh, Acta Cryst. (1976), A32, 923), and is presently widely used. However, since the same number of points are compared according to this method, the researchers are presently studying, by trial and error, which combinations of elements are related to the other substances so as to obtain the minimum r.m.s.d value.
Further, it is necessary to study the preexisting substances in order to produce the new substance. For instance, in the case where the heat resistance of a certain substance is preferably strengthened, a structure commonly existing among the strong heat resisting substances is determined, and such a structure is added to a newly produced substance to thereby strengthen the function of the substance. To this end, such a function is required as to retrieve the necessary structure from the data base. However, the researchers are presently studying the necessary structure from the data base, by trial and error, using the computer graphic system for the aforementioned reasons.
As described above, the operators are compelled to graphically display the three-dimensional structure of the substance they want to analyze using the graphic system, and to analyze by visual comparison with other molecules on a screen, superposition, and like operations.
Meanwhile, basic structures such as an xcex1 helix and a xcex2 strand are commonly found in the three-dimensional structure of protein, and they are called a secondary structure. Methods of carrying out an automatic search by a similarity of the secondary structure without using the r.m.s.d. value have been considered. According to these methods, a partial structure is expressed by symbols of the secondary structures along the amino acid sequence and the comparison is made using these symbols. Therefore, the comparison could not be made according to a similarity of the spatial positional relationship of the partial structure.
As mentioned above, the case where the three-dimensional structure of the substance is analyzed using the CSD and PDB, a great amount of time and labor are required to manually search a huge amount of data for a structure and to compare the retrieved structure with the three-dimensional structure to be analyzed, thereby imposing a heavy burden on the operators. For that matter, the data included in the data base cannot be utilized effectively, thus presenting the problem that the structure of the substance cannot be analyzed sufficiently. Accordingly, there has been the need for a retrieval system that retrieves the structure based on the analogy of the three-dimensional structures of the three-dimensional structure data base.
An object of the invention is to provide method and apparatus capable of automatically extracting and evaluating mutually coinciding or similar portions between sequences of atoms or atomic groups in molecules such as protein molecules in accordance with a simple processing mechanism.
Another object of the invention is to provide method and apparatus capable of automatically extracting and evaluating mutually coinciding or similar portions between three-dimensional structures of the molecules such as protein molecules.
In accordance with the present invention there is provided a method of analyzing sequences of atomic groups including a first sequence having m atomic groups and a second sequence having n atomic groups where m and n are integers, comprising the steps of:
a) preparing an array S[i] having array elements S[0] to S[m];
b) initializing all array elements of the array S[i] to zero and initializing an integer j to 1;
c) adding 1 to each array element S[i] that is equal to an array element S[r] and that ixe2x89xa7r if the array element S[r] is equal to an array element S[rxe2x88x921] where r is an occurrence position of j-th atomic group of the second sequence in the first sequence;
d) adding 1 to the integer j;
e) repeating the steps c) and d) until the integer j exceeds n; and
f) obtaining a longest common atomic group number between the first and the second sequences from a value of the array element S[m].
It is preferable that the method further comprises the steps of:
g) preparing an array data[k] having array elements data[0], datat[1] . . . ;
h) storing paired data (r, j) in an array element data[k] if the array element S[i] is changed in the step c) where k=s[r];
i) linking the paired data (r, j) stored in the step h) to paired data (rxe2x80x2, jxe2x80x2) if rxe2x80x2 less than r and jxe2x80x2 less than j where the paired data (rxe2x80x2, jxe2x80x2) is one stored in an array element data[kxe2x88x921]; and
j) obtaining a longest common subsequence between the first and the second sequences and occurrence positions of the longest common subsequence in the first and the second sequence by tracing the link formed in the step i).
In accordance with the present invention there is also provided a method of analyzing three-dimensional structures including a first structure expressed by three-dimensional coordinates of elements belonging to a first point set and a second structure expressed by three-dimensional coordinates of elements belonging to a second point set, comprising the steps of:
a) generating a combination of correspondence satisfying a restriction condition between the elements belonging to the first point set and the elements belonging to the second point set from among all candidates for the combination of correspondence; and
b) calculating a root mean square distance between the elements corresponding in the combination of correspondence generated in the step a).
In accordance with the present invention there is also provided a method of analyzing three-dimensional structures including a first structure expressed by three-dimensional coordinates of elements belonging to a first point set and a second structure expressed by three-dimensional coordinates of elements belonging to a second point set, comprising the steps of:
a) dividing the second point set into a plurality of subsets having a size that is determined by the size of the first point set;
b) generating a combination of correspondence satisfying a restriction condition between the elements belonging to the first point set and the elements belonging to each of the subsets of the second point set from among all candidates for the combination of correspondence; and
c) calculating a root mean square distance between the elements corresponding in the combination of correspondence generated in the step b).
In accordance with the present invention there is also provided a method of analyzing three-dimensional structures including a first structure expressed by three-dimensional coordinates of elements belonging to a first point set and a second structure expressed by three-dimensional coordinates of elements belonging to a second point set, comprising the steps of:
a) dividing the first point set and the second point set into first subsets and second subsets, respectively, according to a secondary structure exhibited by the three-dimensional coordinates of the elements of the first and the second point sets;
b) generating a combination of correspondence satisfying a first restriction condition between the first subsets and the second subsets from among candidates for the combination of correspondence;
c) determining an optimum correspondence between the elements belonging to each pair of subsets corresponding in the combination of correspondence generated in the step b), and
d) calculating a root mean square distance between all of the elements corresponding in the optimum correspondence in the step c).
In accordance with the present invention there is also provided an apparatus for analyzing sequences of atomic groups including a first sequence having m atomic groups and a second sequence having n atomic groups where m and n are integers, comprising:
means for preparing an array S[i] having array elements S[0] to S[m];
means for initializing all array elements of the array S[i] to zero and initializing an integer j to 1;
means for renewing the array S[i] by adding 1 to each array element S[i] that is equal to an array element S[r] and that ixe2x89xa7r if the array element S[r] is equal to an array element S[rxe2x88x921] where r is an occurrence position of j-th atomic group of the second sequence in the first sequence;
means for incrementing the integer j by 1;
means for repeatedly activating the renewing means and the incrementing means until the integer j exceeds n; and
means for obtaining a longest common atomic group number between the first and the second sequences from a value of the array element S[m].
It is preferable that the apparatus further comprises:
means for preparing an array data[k] having array elements data[0], data[1] . . . ;
means for storing paired data (r, j) in an array element data[k] if the array element S[i] is changed by the renewing means where k=S[r];
means for linking the paired data (r, j) stored by the storing means to paired data (rxe2x80x2, jxe2x80x2) if rxe2x80x2 less than r and jxe2x80x2 less than j where the paired data (rxe2x80x2, jxe2x80x2) is one stored in an array element data[kxe2x88x921]; and
means for obtaining a longest common subsequence between the first and the second sequences and occurrence positions of the longest common subsequence in the first and the second sequence by tracing the link formed by the linking means.
In accordance with the present invention there is provided an apparatus for analyzing three-dimensional structures including a first structure expressed by three-dimensional coordinates of elements belonging to a first point set and a second structure expressed by three-dimensional coordinates of elements belonging to a second point set, comprising:
means for generating a combination of correspondence satisfying a restriction condition between the elements belonging to the first point set and the elements belonging to the second point set from among all candidates for the combination of correspondence; and
means for calculating a root mean square distance between the elements corresponding in the combination of correspondence generated by the generating means.
In accordance with the present invention there is provided an apparatus for analyzing three-dimensional structures including a first structure expressed by three-dimensional coordinates of elements belonging to a first point set and a second structure expressed by three-dimensional coordinates of elements belonging to a second point set, comprising the steps of:
means for dividing the second point set into a plurality of subsets having a size that is determined by the size of the first point set;
means for generating a combination of correspondence satisfying a restriction condition between the elements belonging to the first point set and the elements belonging to each of the subsets of the second point set from among all candidates for the combination of correspondence; and
means for calculating a root mean square distance between the elements corresponding in the combination of correspondence generated by the generating means.
In accordance with the present invention there is also provided an apparatus for analyzing three-dimensional structures including a first structure expressed by three-dimensional coordinates of elements belonging to a first point set and a second structure expressed by three-dimensional coordinates of elements belonging to a second point set, comprising:
means for dividing the first point set and the second point set into first subsets and second subsets, respectively, according to a secondary structure exhibited by the three-dimensional coordinates of the elements of the first and the second point sets;
means for generating a combination of correspondence satisfying a first restriction condition between the first subsets and the second subsets from among candidates for the combination of correspondence;
means for determining an optimum correspondence between the elements belonging to each pair of subsets corresponding in the combination of correspondence generated in the generating means, and
means for calculating a root mean square distance between all of the elements corresponding in the optimum correspondence.