Conventionally, a search technique is present that includes a character component table indicating the correlation between a character and a document that includes the character, and a condensed text file storing therein and correlating condensed text obtained by removing ancillary words from a document and the document. According to the searching technique, the character component table is referred to; a document that corresponds to the character included in a search keyword is identified; and based on the result of referring to the character component table, the document that includes the search keyword is identified from the condensed text in the condensed text file (see, e.g., Japanese Patent No. 2986865).
Another disclosed technique involves reading text data into character strings each having a length of “n”; recording information that indicates that one of the character strings is present in an entry of a character component table corresponding to the character string; dividing a search term into character strings each having a length of “n”; outputting a document whose presence information is recorded in all entries of a concatenated character component table corresponding to each of the character strings; screening objects to be searched by executing stepwise character component table searches before searching for the text itself; and thereby, executing a full-text search at a high speed (see, e.g., Japanese Patent No. 3263963).
A technique is disclosed that realizes high-speed full-text searching, equivalent to when a document is searched for that is constituted of a language having few types of phonograms such as the English language, by a concatenated character component table search unit that can fully narrow down candidates from a given search term (see, e.g., Japanese Patent No. 3497243).
Another disclosed technique involves generating a character component table describing an appearance state of a character in text data for each document to be registered; recognizing the document structure according to a predetermined document structure name; dividing the text data for each structure; for each character that appears, setting “1” at the position of a specific bit that corresponds to the document structure in which the character appears; storing a structure bit string having described therein an appearing document structure position for each character; when a user designates “critical work” as a character string to be searched for and “name of the invention”, “claims”, or “effect of the invention” as the document structure, executing a character component table search using “critical work” and obtaining documents 1, 7, 15, 38, . . . as the result; taking the bit AND of a designated document structure bit string “100100001” based on the designated document structure and a structure bit string of the document retrieved; and obtaining the documents 1, 7, 38, . . . as the search result (see, e.g., Japanese Patent No. 3518933).
However, in the conventional techniques, the character component table is generated using 64,000 types of character codes each of which is a 16-bit character code for content constituted of a tremendous number (for example, 10,000) of document files. When the character component table is increased by adding those for two sequential characters, three sequential characters, four sequential characters, . . . to that for single characters to reduce search noise, a size explosion is caused and the file size of the data is drastically increased. Therefore, a problem arises in that the processing becomes difficult in the hardware environment in terms of resource-saving. On the other hand, if the file size is reduced using a hash function, etc., a problem arises in that search noise increases and the search speed is slowed. A further problem arises in that the processing time to create the character component tables for two sequential characters, three sequential characters, four sequential characters, . . . increases.