1. Technical Field
The present invention relates to a method for changing an array, and a method, an apparatus, a storage medium and a transmission medium for analyzing a structure. In particular, the present invention relates to a method for changing an array in order to analyze its structure, a method for employing the array changing method to perform the analyzation of the structure of the array, an array structure analyzation apparatus specifically provided for the employment of the array structure analyzation method, a storage medium on which is stored a program that permits a computer to implement and apply the array structure analyzation method, and a transmission medium for transmitting the program that permits a computer to implement and apply the array structure analyzation method.
2. Prior Art
Recently, the deciphering of genetic information has been completed for a variety of organisms other than human beings, and it is now anticipated that the same process of explication can be performed for the human genome. For DNA, which is the main component of chromosomes and which is represented by an array of four bases consisting of adenine (A), thymine (T), cytosine (C) and guanine (G), and for RNA, which is transcribed from DNA and which is represented by an array of four bases, in which the T in DNA is replaced by uracil (U), the analyzation of the genetic information is performed by replacing an array of the bases of a single standard DNA or RNA base array with a character string for convenience, extracting from the obtained character string the pattern of a character string that frequently appears, and analyzing the extraction results. Conventionally, as a technique, a suffix tree (a data structure) is well known that is effective for performing a rapid search of character strings to extract a character string that appears frequently, or a character string that is common to two or more character strings. The suffix tree represents all the suffixes in a character string wherein the character “$,” which does not exist in a string that is currently being processed, is added at the end of the pertinent character string. An example is the character string “mississippi$” that, as is shown in FIG. 7, is obtained by adding the character “$” at the end of “mississippi,” the character string that is currently being processed.
As is shown in FIG. 7, a label that corresponds to the character string is provided for each edge of the suffix tree. The first character of each label that is provided for each outgoing edge of one node (including a root node) differs from the others, and the edges are sorted in accordance with the first characters of the labels (for example, in FIG. 7, the edges are arranged in the English alphabetic order from left to right). In the suffix tree, the array of the labels that are provided for the individual edges from the root node to a specific leaf node (a node at the distal end of an edge to which no other edge is connected) is used as a suffix that corresponds to the specific leaf node (for example, “issippi$” is the suffix that corresponds to the leaf node of the array extending from the edges with labels “ppi$,” “ssi” and “i” to the root node, and “ssissippi$” is the suffix that corresponds to the leaf node of the array extending from the edges with labels “ssippi$,” “si” and “s” to the root node.
An algorithm is well known whereby the data structure of a suffix tree can be constructed within the time that corresponds to O(n Log s) where n denotes the length (character count) of the original character string, and s denotes the number of types of alphabetic characters that form the original character string). In particular, when the alphabet is an integer alphabet (numerals from 1 to n), the data structure of a suffix tree can be constructed within the time that corresponds to O(n). Therefore, even when a target character string is enormously long, like a character string that represents a DNA or an RNA base array, the data structure of a suffix tree for the pertinent character string can be completed within a short period of time (more specifically, a linear time relative to the length of the original character string). Further, if the suffix tree is employed, a character string having a length (character count) m can be found in the target character string within a time that corresponds to O(m log s), so that a character string used in common or a frequently appearing character string can be listed within a short period of time (a linear time relative to the length of the original character string).
In addition, when the label provided for each edge is replaced with information that represents the locations of the first character and the last character (character preceding “$”) of the label in the original character string (e.g., “mississippi$” is replaced with [1●11]), the length of the character string that represents the suffix tree can be fitted to the constant times for the length of the original character string. The suffix array is also well known as a technique by which the length of the character string that represents a suffix tree can be reduced.
As was previously described, leaf nodes of a suffix tree correspond respectively to the suffixes of an original string. When the individual suffixes are arranged beginning with the suffix that corresponds to the leaf node at one end of the suffix tree (the left end in FIG. 7), an array wherein all the suffixes of an original string are arranged in dictionary order is obtained. When the suffixes that are elements of the array are replaced with data that represent the locations of the first characters of the suffixes in the original string (e.g., “ippi$” is replaced with “8”), an array (called a suffix array) having the same length as the original character string is obtained. For example, the suffix array for “mississippi” in FIG. 7 is “8 5 2 11 1 10 9 7 4 6 3.”
When the above suffix array is employed, the memory capacity required for a search for a character string can be reduced compared with when a suffix tree is employed. However, the time required for searching for the character string is O(m log n), where n denotes the length of a target character string and m denotes the length of a character string that is to be searched for.
A parameterized suffix tree is also well known as a technique (a data structure) for searching for a character string that frequently appears or a character string used in common when the character string includes a variable. For a gene sequence, such as a DNA or RNA base array, a specific element in the array may be exchanged with another specific element (for example, the A and T or the G and C of DNA complement each other and can be exchanged). Thus, in a parameterized suffix tree, when replaceable elements of an array are employed as variables, and when by replacing these variables in character strings (the character strings that include the variables) the character strings can be altered so that they are the same, the character strings are regarded as being the same.
For example, when x, y and z are defined as variables and a, b and c are defined as fixed characters, “axbycxaza” and “azbxczaya” are regarded as being the same character string (called a p-string (Parameterized String)) because by exchanging the variables x, y and z the same character array can be obtained. Encoding that is expressed as prev( ) is used to detect a p-string. This encoding is used to replace variables in a character string with a numerical value (the first variable that appears is 0) that represents the distance from the same variable that appeared immediately before. When the encoding prev( ) is performed for the two previously mentioned character strings, prev(axbycxaza)=prev(azbxczaya)=a0b0c4a0a is obtained.
The parameterized suffix tree represents the result that is obtained by performing the prev( ) encoding for all the suffixes of a character string to which the character $, which is not present in a target character string, is added (this differs from a normal suffix tree that is prepared while an array obtained by performing the prev( ) encoding, for a character string to which the character $ has been added, is regarded as a normal character string). In a parameterized suffix tree, as well as a suffix tree, leaf nodes correspond to the respective suffixes. Each edge has a label that corresponds to a partial character string, and the arrangement of labels that are provided for edges from the root node to a specific leaf node represents the result obtained by the prev( ) encoding for a suffix that corresponds to the specific leaf node.
Further, in the same manner as for the suffix tree, the first character of each label, which is provided for each edge extending from a node (including the root node), differs the others, and the labels are sorted in accordance with the first character. In addition, since the labels of each of the individual edges are represented by the first and the last positions of the original character string, the data structure has the size of the constant times of length of the character string.
For a gene sequence, such as a DNA or RNA base array, it is well known that although arrays that have the same structure may have different appearances they tend to have the same functions or properties. For a DNA base array, for example, when either or both of the complementary A and T, and the complementary G and C components are exchanged with each other, or when the non-complementary A and C components are exchanged and the non-complementary T and G components are exchanged, the structure of the array (the relationship of the elements of the array) tends to be unchanged, even though the array differs from the original array, and the functions and the properties obtained by effecting the exchange tend to be similar to those of the original array. Therefore, when analyzing a gene sequence, it is extremely important that arrays having the same structure be defined as the same array, regardless of whether the array themselves are identical, and that a frequently appearing array be extracted or that a partial array commonly included in two arrays be searched for.
On the other hand, with the conventional technique for employing a suffix tree or a suffix array, a character string other than an identical one can not be defined as being the same character string, so that even though an array may have the same structure, if it has a different element arrangement it can not be treated as the same array. Further, in a parameterized suffix tree, a character string wherein variables are simply replaced is defined as being the same character string. Thus, when, for example, only A and C are exchanged in the DNA base array, or only T and G are exchanged, or when A is exchanged with C and T with A, an array having a different structure from the original array can not be distinguished from an array having the same structure as the original array. Therefore, even when any of the above conventional techniques is employed, it is difficult to efficiently analyze a gene sequence.
To resolve the above shortcomings, it is one object of the present invention to provide a method for changing an array in order to efficiently analyze the structure of the array.
It is another object of the present invention to provide a method, an apparatus, a storage medium and a transmission medium for efficiently analyzing the structure of an array.