1) Field of the Invention
This invention relates to a common structure extraction apparatus which extracts a common structure from two three-dimensional structures each formed from a set of sequenced points, and more particularly to an apparatus which retrieves and extracts analogous common portions from a plurality of substances having different three-dimensional structures.
2) Description of the Related Art
In the fields of physics and chemistry, in order to investigate a property of a novel or unknown substance or artificially produce a new substance, a molecular structure is analyzed to make clear a mechanism for manifestation of a function of the substance.
Thanks to results of investigations in the past, it is known that a function and a three-dimensional structure of a substance have a close relation to each other, and it is considered that a structurally analogous (or specific) portion contributes very much to a function of a substance.
Thus, three-dimensional structures of various substances have been made clear and determined by such techniques as an X-ray crystal analysis or an NMR (Nuclear Magnetic Resonance) method, and data bases are produced for the three-dimensional structures which have thus become clear.
When a research worker tries to retrieve and extract analogous portions between three-dimensional structures from such a data base as described above, a series of cumbersome operations must be performed. If such retrieval and extraction can be performed automatically, then the burden of the series of operations to the research worker can be reduced.
In recent years, in order to assist clarification and production of a novel substance and modification to a function of a known substance, much effort has been and is being directed to operations to determine a three-dimensional structure of an object substance by such a technique as an X-ray crystal analysis or an NMR method and store the thus determined three-dimensional structure into a data base. One of representative data bases which are spread world-wide is the Protein Data Bank (PDB) in which three-dimensional structures of proteins, ribonucleic acids and like substances are registered. Further, the Cambridge Structural Database (CSD) is known as a data base in which chemical substances are registered.
A protein is constituted from a plurality of amino acids connected to each other by way of peptide linkages like a chain folded in vivo to form a three-dimensional structure and manifests various functions. The individual amino acids are represented by numbering them in order from 1 beginning with a terminal of N (nitrogen) and ending with the other terminal of C (carbon). The numbers are called amino acid numbers or amino acid residue numbers.
A protein is normally constituted from about 20 amino acids and is arranged stably including a portion having an a helix structure, another portion having a beta structure which extends generally linearly in a zigzag pattern and a further portion of a disordered random coil structure at a variable rate. Meanwhile, each amino acid is constituted from a plurality of atoms depending upon the kind thereof. Accordingly, information including the name of a protein, a management number, the numbers of amino acids forming the protein, the kinds and three-dimensional coordinates of atoms constituting each of the amino acids is registered in the PDB.
Again, thanks to results of investigations in the past, it is known that a function and a three-dimensional structure of a substance have a close relation to each other, and much effort is directed to operations to make clear the relation between a function and a structure. Above all, since it is considered that a structurally analogous (or specific) portion between different substances having a same function contributes very much to the function of the substances, it is essentially required to find out an analogous structure which exists commonly between different three-dimensional structures.
Under present conditions, however, since no technique is available to directly extract a characteristic portion from three-dimensional coordinates of a three-dimensional structure of a substance, each research worker manually searches for a characteristic portion by displaying each three-dimensional structure by means of a 3D (three-dimensional) graphic system. Generally, there is no fixed method for deciding the orientation of a substance, and since a substance is rotated with reference to another substance to search for a characteristic portion thereof, much time is required for the operation.
When a research worker searches for an analogous three-dimensional structure, an rmsd (root mean square distance) value is used as a scale for the analogy between three-dimensional structures of substances. The rmsd value is a square root of a mean square distance between components of substances matched with each other. Empirically, where the rmsd value between two substances is smaller than 1 angstrom, it is considered that the two substances are very analogous to each other.
A popular method for calculation of an rmsd value will be described below with reference to FIGS. 79(A) to 79(D).
It is assumed that there are a substance A represented by such a point set P={p1, p2, . . . . , pi, . . . , pN} as shown in FIG. 79(A) and another substance B represented by such a point set T={t1, t2, . . . . , tj, . . . , tN} as shown in FIG. 79(B). The elements (points) constituting the substances A and B are matched with each other as shown in FIG. 79(C), and the substance B is rotated or moved and superposed on the substance A as shown in FIG. 79(D) so that the rmsd value between the thus matched elements may exhibit a lowest value. The rmsd value is calculated in accordance with the following equation:   rmsd  =                              ∑                      k            =            1                    N                ⁢                              (                                          w                k                            ⁡                              (                                                      Ut                    k                                    -                                      p                    k                                                  )                                      )                    2                    N      
where N is the number of the matched points, U is a rotation matrix, and wk is a weight at each of the matched points.
A technique for calculation of rotation or movement of a substance which minimizes the rmsd value between such matched points has been proposed by Kabsh et al. and is utilized widely at present.
However, since the technique compares different substances with each other in regard to an equal number of points, it is the existing state of the things that it is found out by trial and error of a research worker what matching between components of a substance and components of the other substance minimizes the rmsd value.
Further, in order to produce a novel substance, existing substances must be investigated. For example, when it is intended to increase the heat resisting property of a certain substance, a structure common to substances which are high in heat resisting property is searched out, and the structure is additionally provided to the substance to be produced newly to promote the function.
Accordingly, a function of retrieving a necessary structure from a data base is required. However, it is the existing state of the things that, because of a similar reason to that described above, a structure is searched out from a data base by trial and error of a research worker using a computer graphic system in a similar manner as described above.
Furthermore, in recent years, the importance of an analogous structure is recognized also in designing and improvement of a protein. One of example is an experiment for improvement in function of human lysozyme (HL). It has been found out that the three-dimensional structure of the protein HL which does not have an activity to couple a calcium ion includes a structure analogous to that of xcex1 lactalbumin which is a protein which couples a calcium ion.
Thus, it has been reported that an experiment to replace an amino acid at a portion of the structure in HL with another amino acid of a different kind by a genetic recombination operation proved coupling of the resulted substance to a calcium ion (Kuroki R. et al., Proc. Natl. Acad. Sci, U.S.A., 86, pp.6,903-6,907, 1989). As can be seen also from the report, information which is very important for designing and improvement of a protein can be obtained by paying attention to analogy between structures.
The assignee of the present invention has proposed a three-dimensional structure processing apparatus designed so as to superpose sets of points forming three-dimensional structures, sets of points having sequential relations or sets of partially matched points with each other such that the rmsd value between them may have an optimum value or to search out a structure having a high degree of analogy from a data base of three-dimensional structures of protein (refer to Japanese Patent Laid-Open Application No. Heisei 6-180737, Application No. Heisei 4-331703, filed on Dec. 11, 1992).
In the three-dimensional structure processing apparatus, a combination of a fixed number of points on a three-dimensional coordinate system represented by a point set is prepared as a search key (probe), and the point set is searched from among point sets representing three-dimensional structures of a plurality of substances stored in a data base to determine whether or not a same or analogous structure to that of the point set serving as the probe is included as a structure of a portion of the three-dimensional structure of the substance.
To this end, the three-dimensional structure processing apparatus fundamentally operates in the following manner. First, upon matching of elements of two point sets, such a method as to match them with the objects displaced from each other or to make combinations of matching of points using a tree structure. Then, narrowing down of candidates (points determined to have been matched) based on a geometrical relation, narrowing down of candidates based on a predetermined threshold value requirement, narrowing down of candidates based on an attribute of a point and some other narrowing down are performed to produce combinations of elements satisfying the requirements. Thereafter, from among the thus produced combinations, a combination which presents a minimum mean of distances between individual points (which corresponds to an rmsd value) of the two point sets is searched out, and the position and the orientation in which the two three-dimensional structures coincide best with each other are calculated. Then, a result of the thus calculated superposition is outputted as a result of retrieval.
The narrowing down of candidates based on a geometrical relation is performed by any of such techniques as described in the following items {circle around (1)} to {circle around (3)}; the narrowing down of candidates based on a predetermined threshold value requirement is performed by such a technique as described in the following item {circle around (4)}; and the narrowing down of candidates based on an attribute of a point is performed in such a technique as described in the following item {circle around (5)}.
{circle around (1)} Narrowing down of candidates based on a distance relation: upon matching, only those point sets between which the distance relation between an element in a point set (point set A) and s adjacent elements and the distance relation between an element in the other point set (point set B) and s adjacent elements remain within a tolerance are selected to narrow down the candidates.
{circle around (2)} Narrowing down of candidates based on an angle: only those point sets between which angles between an element of a point set A and s adjacent elements remain within a tolerance from angles between an element in the other point set B and s adjacent elements are selected to narrow down the candidates.
{circle around (3)} Narrowing down of candidates based on a distance and an angle from the center of gravity: the centers of gravity are calculated among selected points, and distances and angles with respect to the thus calculated centers of gravity are compared with each other in a similar manner as in the technique {circle around (1)} or {circle around (2)} described above to narrow down the candidates.
{circle around (4)} Narrowing down of candidates based on a threshold value requirement; a predetermined threshold value is set, and when an attribute value of a candidate is higher than the threshold value, it is abandoned or trimmed away. In this instance, the number of nils (points for which matched points are not present) is limited such that, upon matching between elements of a point set A and the other point set B, when the total number of nils becomes greater than the threshold value, the elements are removed from candidates of combinations to avoid production of a useless candidate. Further, when elements bi of the point set B are matched with elements ai of the point set A, if the rmsd value among all points is extremely great, since it is desired to except the elements from candidates, a threshold level for the rmsd value is provided, and if the rmsd value is equal to or lower than the threshold level, then the point is left as a candidate, but if the rmsd value is higher than the threshold level, the point is excepted from a candidate. Thus, candidates for matching are narrowed down efficiently.
{circle around (5)} Narrowing down of candidates based on an attribute of a point: as an attribute of a point, for example, the kind, the hydrophilic property, the hydrophobic property or the polarity of charge of an atom, an atomic group or a molecule may be used. By checking whether or not such attribute or attributes of a point coincide with those of another point, it is determined whether the point should be left as a candidate.
If matching of components of substances can be automated as described above, then it is possible to retrieve and extract, from a data base, an analogous structure which exists commonly between different substances having a same function. However, when a three-dimensional structure of a substance is analyzed making use of the existing CSD or PDB, since retrieval of structures from a large amount of data and comparison between structures are performed by a manual operation, much time and labor are required, which is a burden to the operator.
Further, with the three-dimensional structure processing apparatus proposed by the assignee of the present invention, it can be retrieved whether or not a partial structure constituted from a point set of a fixed scale which has a known structure is present as a common structure in a three-dimensional structure of another substance.
However, the three-dimensional structure processing apparatus has a subject to be solved in that it is difficult to detect, when two three-dimensional structures having similar functions and having a common structure are superposed as a whole, what portions of the entire three-dimensional structures have a common structure because a portion which makes a key for retrieval (probe) is unknown.
If a common structure which is similar in structure can be extracted when partial matching is performed to superpose two three-dimensional structures with each other, then it is recognized that also the substance of one of the two three-dimensional structures has a same function as the function which the substance having the other three-dimensional structure has.
Further, when two different three-dimensional structures are known to have a plurality of common structures from the fact that they have similar functions, it is sometimes unknown what common structure makes the center (or makes a nucleus). In this instance, if the partial structure (structure which makes a key) serving as the center for superposition is determined in error, then when the two three-dimensional structures are superposed at the nucleus provided by the partial structure, even if an analogous common structure is actually included in the two three-dimensional structures, a common structure cannot be detected. Therefore, another subject to be solved by the three-dimensional structure processing apparatus is precise discrimination of a common structure which makes the center.
It is an object of the present invention to provide a common structure extraction apparatus wherein analogous portions in different three-dimensional structures can be extracted automatically by means of a computer to allow automation of superposed display of three-dimensional structures in a computer graphic system or retrieval of an analogous three-dimensional structure from a data base to reduce the time, the number of operators and the cost required for a retrieving and extracting operation of a common structure and achieve a high efficiency in a retrieving and extracting operation of a common structure.
In order to attain the object described above, according to the present invention, there is provided a common structure extraction apparatus for extracting, from two sequenced point sets each forming a three-dimensional structure, a set of points of a common portion between the two point sets as a common structure between the two three-dimensional structures, which comprises an entire structure superposition section for parallelly and rotationally moving the entire two point sets in accordance with partial matching information for partial matching between the two point sets to superpose the two point sets with each other, a common portion length calculation section for calculating a number of points paired with each other to form a common portion between the two point sets superposed with each other by said entire structure superposition section as a common portion length, a cumulative distance calculation section for accumulating distances between the points paired with each other to form a common portion between the two point sets superposed with each other by said entire structure superposition section to obtain cumulative distance information, and a common portion extraction section for extracting that one of common portions between the two point sets with which the common portion length calculated by said common portion length calculation section exhibits a greatest length and the cumulative distance information calculated by said cumulative distance calculation section exhibits a lowest value as a common structure.
With the common structure extraction apparatus, a plurality of three-dimensional structures which can be partially matched with each other can be superposed with each other to accurately and rapidly extract another common structure existing between the three-dimensional structures. This allows display of a common structure by a graphic system, retrieval of an analogous structure from a data base, estimation of a function based on analogy in structure and so forth.
Accordingly, since an operation which has conventionally been proceeded by trial and error by research workers in order to achieve improvements for discovery or reinforcement of a function of a substance such as a protein can be established and executed as a research and development cycle in which a function is estimated based on a structure and then a result of the estimation is proved by an experiment, the efficiency in operation can be improved very much.
Further objects, features and advantages of the present invention will become apparent from the following detailed description when read in conjunction with the accompanying drawings in which like parts or elements are denoted by like reference characters.