1. Field of the Invention
This invention is related to code string searches that search with a computer for codes or code strings consisting of bit strings in the same way as character string searches that search for character codes or character code strings consisting of bit strings.
2. Description of Related Art
Recently it has become customary to use word processing to create business documents, and by the spread of the interne, the number and size of electronic documents, using character codes consisting of bit strings that can be processed by computers, have grown immensely throughout the world. For this reason, various character string search methods are being developed in order to fetch a necessary document from out of this huge amount of documents using computers.
In these character string search methods it is general practice to prepare an index ahead of time in order to realize fast searches. For example, the method of extracting words from the documents for the index and making an inverted index that associates the name of a document that includes those words for each of those words is well known. This method has the advantages that the size of this inverted index is relatively small, the search is fast, and configuring the index is easy. How ever there are languages for which words are difficult to extract. And this method has the disadvantage that when a search is made for a set of multiple words it becomes necessary to process word position matches for the document. And a search for an arbitrary string of characters in a single document is also difficult.
And so an index called a suffix array has been developed that enables a search for any character string. The patent reference 1 and non-patent reference 1 below disclose a suffix array and a search method using that array.
FIG. 1A describes an example of previous search methods related to the above suffix array. FIG. 1A shows an example of a character string, character string 10, which is the target of a search. Character string 10 consists of the alphabetic characters A, B, C, E, and the separator character $. The character A is located in character positions 1, 4, and 7 of character string 10. The character B is located in character positions 2 and 5 of character string 10. The character C is located in character positions 6 and 8 of character string 10. The character D is located in character position 3 of character string 10. The separator character $ is located in character position 9, which is the tail end of character string 10.
Also FIG. 1A depicts the suffixes in character position sequence 20, the suffixes in dictionary sequence 20a, and the suffix array 30 which correspond to the character string 10. FIG. 1A further depicts the arrow with a dotted line 81 showing that the suffixes in character position sequence 20 are those of the character string 10 and the arrow with a dotted line 82 showing that the suffixes in dictionary sequence 20a is obtained by sorting the suffixes in character position sequence 20 into dictionary sequence.
Character string 10, as shown in the suffixes in character sequence 20, can be thought to have 9 suffixes as its partial character strings. By sorting suffixes in character position sequence 20, which has suffixes arranged in the character position sequence of the leading character of each suffix, into dictionary sequence, suffixes in dictionary sequence 20a is obtained. At this time, by storing the character position of the leading character of the suffix rearranged in dictionary sequence in an array, suffix array 30 is obtained. By means of this suffix array, the leading character position of a partial character string that matches the pattern of the search character string can be obtained from among the character strings that are the target of the search.
FIG. 1B describes conceptually a character string search using a compressed suffix array in an example of a prior art search method and shows compressed suffix array 50 (a conceptual diagram) associated with search character string 40 and suffix array 30 shown in described referencing FIG. 1A. In array element number (i) of compressed suffix array 50 (conceptual diagram) is stored the next array element number (Ψ). The next array element number (Ψ) is an array element number of suffix array 30 wherein is stored a character position which has 1 added to the character position stored in array element number (i) of suffix array 30.
By changing the content stored in the array from a character position to a next array element number (Ψ), the values stored in each character group are arranged in ascending order, as shown in the drawing. As a result, because the value stored in each array element need not be the actual next array element number (Ψ) itself but can be an increment on the value of the previous array element number, the bit width of the addresses can be made smaller, and the amount of information can be compressed.
Regarding the concept of a search, FIG. 1B shows the search steps from each of the characters in the illustrated search character string 40 by means of the arrow with a dotted line to array element numbers (i) of compressed suffix array 50 (conceptual diagram) and by means of an arrow between the numbers 3, 6, 9 shown in bold for those array element numbers (i), and the numbers 6, 9 shown in bold in the next array element number (Ψ). In other words, given that from among the array element numbers corresponding to the leading character A in search character string 40, 3, for example, is selected and the next array element number 6 in array element number 3 is the array element number corresponding to the second letter B in the search character string 40, and the next array element number 9 in array element number 6 is the array element number corresponding to the third letter E in the search character string 40, it can be understood that character string 10 that is the target of searches will result in a hit in a search using search character string 40.    Patent document 1: JP 3,672,242 B    Non-Patent document 1: Sadakane Kunihiko, “A Note on the Compressed Suffix Arrays”; IEICE technical report, Data engineering; 100 (226), pp. 49-56, 2000/07/19; The Institute of Electronics, Information and Communication Engineers.