1. Field of the Invention
The present invention relates to a data structure used to search an array for a frequently appearing segment, such as a character string, or to search for an array segment that is common to two or more arrays, and to a pattern search method using this data structure.
2. Related Art
A suffix tree is a well known data structure that can be effectively employed to perform a quick search of character strings for a frequently appearing segment or for a character string segment used in common in two or more character stings. A suffix tree is one in which all the suffixes in a character string are represented by adding, to the end of a process target character string, the character $, which is not present in the character strings that are processed. The leaf nodes (nodes, at the ends of edges, to which no edges are connected) of a suffix tree correspond to individual suffixes.
When a specific character is designated in a predetermined character string, the suffix is a character string that follows the specific character.
In FIG. 6 is shown a diagram of an example suffix tree. In FIG. 6, the suffix tree is constructed for a character string “mississippi$”, obtained by adding the character $ to the end of the process target character string “mississippi”.
Appended to each edge of the suffix tree is a label corresponding to a character string segment, and the arrangement of the labels appended to the edges descending from the root node to a leaf node is employed to define the pertinent leaf node as a suffix. In the example in FIG. 6, “issippi” is a suffix corresponding to the leaf node that is reached from the root node via edges to which the labels “i”, “ssi” and “ppi” are appended. Similarly, “ssissippi” is a suffix for a leaf node that is reached from the root node via edges to which the labels “s”, “si” and “ssippi” are appended.
Further, a different first character may be provided for each label appended to each edge extending from a single node (including the root node) in the suffix tree, and the edges are sorted in accordance with the first character of each label. In the example in FIG. 6, the edges are arranged in alphabetical order (in the order i, m, p and s) from left to right.
An algorithm O(n log s), wherein n denotes the length of the target character string, and s denotes the size (the number of character types) of the alphabetical entry constituting the character string, is a well known algorithm used for the generation of a suffix tree. The O(n) algorithm is especially well known as one that is used when an integer alphabet (the numbers from 1 to n) is used. The meaning of O(func(n)) is that when the actual calculation time is t, a pair of constants c and k is always present, so that 0≦t≦c×func(n) is established for n (n≧k). Therefore, O(n log s) means that the calculation can be performed within a constant time period represented by n log s, and O(n) means that the calculation can be performed within a constant time period represented by n (in this case, within a constant time, since n is also a constant).
When this algorithm is employed, the search for a character string segment having a length m can be performed within a time period corresponding to O(m log s). And since the size of the alphabet used is normally a constant, this period can be calculated as a linear time. The memory capacity required for a storage device used to process a suffix tree for English text (n characters) is 20n to 40n bytes.
Since the data size of a suffix tree is large, to reduce this size, a well known suffix array is used as the data structure when a search is mounted for a similar pattern. As is described above, each the leaf nodes of a suffix tree corresponds to the suffix of a character string. When the suffixes are arranged, beginning at one end (the left in FIG. 6) of the suffix tree with the suffix corresponding to a leaf node, an array can be obtained wherein all the suffixes of a process target character string are listed in order, as in a dictionary. It should further be noted that the end character $ is placed at the end of each suffix.
The suffixes, which are the elements of the array, are replaced by entries representing the positions of the first character of each suffix in the target character string; for example, “ippi$” is replaced by “8” and “issippi$” is replaced by “5”. As a result, an array (a suffix array) is obtained that has the same length as the target character string. In FIG. 6, for example, the suffix array for “mississippi$” is “8 5 2 11 1 10 9 7 4 6 3 12”. In this case, it is assumed that the character $ is placed after all the other characters, in accordance with the dictionary order.
When a suffix array is employed, compared with when a suffix tree is used, the memory capacity required for a search mounted for a character string is reduced. Further, since a binary search is performed, the time required for the character string search for a is O(p log q), where q denotes the size of a database and p denotes the length of the character string to be searched for. Generally, since the required memory capacity is four bytes for each character, for text consisting of English characters (one byte each), 5n bytes are required for a database containing n characters of text.
Further, a table reflecting a common prefix length for adjacent suffixes can also be provided. And when such a table is employed, compared with when only a suffix tree array is employed, the search time can be reduced to O(p+log q). In this case, 9n bytes are required for the database.