1. Field of the Invention
The present invention relates to a document management method for managing registered documents effectively to search a great number of documents saved in a storage for a document matching with a retrieval key word, a document search method for searching for a document, and a document management system to manage documents effectively.
2. Description of the Related Art
There is known a method of making an index at the time of saving document data in a storage to speedup retrieval when document data matching with a search key word is searched for from a set of document data saved in large quantity in a database. A method for indexing N characters in units of continuous N characters of document data is known. This is referred to as a N-Gram index system. N represents an integer more than 1 and it is conventional for a Japanese document to clip Gram in units of N=2 (Bi-Gram). It is general for an English document to clip Gram in units of more than N=3. In the case of, for example, N=2, a character string of, for example, “XML ” is clipped as “XM”, “ML”, “L ”, “”, “”, “”, “”, “”. In retrieval of the set of document data, the search is done using Gram clipped from the retrieval key word as an index.
The N-Gram index system needs not a dictionary depended upon language and facilitates a multilingual application. It is used for Japanese, and Chinese that has no glossary delimiter such as blank in particular. If searching is done with Gram being combined with an offset (occurrence position of Gram in the document data), search loss can be reduced.
Although having such a merit, the N-Gram index system has a problem of a trade off with respect to a size of Gram (size of N). In other words, if the size of N increases, a candidate of document data corresponding to the Gram which is the index is refined, so that a retrieval speed is enhanced. A Gram information region (region for storing information on Gram in a storage) increases exponentially. In contrast, if the size of N decreases, the number of candidates of document data corresponding to the Gram increases. As a result, the number of times for collaing the position increases so that the search time increases. Further, if the size of N increases, the number of kinds of indexes (Gram classes) increases. When an index is extracted from, for example, Japanese document with N=2, the Gram classes of more than 3M-byte occurs. Accordingly, when N increases than 2, it is clear that an index data size increases further.
Japanese Patent Laid-Open No. 2000-57151 provides a method of increasing the size of N for the purpose of increasing a search speed and suppressing increase of an index data size to minimum, with respect to a problem of a trade off on the size of N. In other words, the position information of text data having the positional relation as a substring of a retrieval term is extracted by an index corresponding to the substring of the retrieval term, and the size of index corresponding to the substring of text data is compared with a predetermined reference index size. When the size of index is larger than reference index size, it is determined whether the substring corresponding to the index is most likely to be searched for. When it is most likely to be searched for, an extension character string obtained by adding a character string to the substring and an index corresponding to the extension character string are made.
According to Japanese Patent Laid-Open No. 2000-57151, if the size of N is increased, the number of Gram classes may be decreased when a long search key word is given. However, it is difficult to set precisely a reference for determining whether it is most likely that the character string corresponding to the index is searched for and increase the size of N in effect. Accordingly, there is a limit for times for registering and retrieving a document to be short.
An object of the present invention is to provide a document management method capable of achieving shortening of times for registering and searching a document while using an N-Gram index system, a document retrieval method using the same, a document management system therefor.