The present invention relates to a text index registering and retrieving method for retrieving text data registered and a recording medium of a program for executing the method. In particular, the present invention relates to a technique effectively applicable to the text index registering and retrieving method for acquiring the desired document by designating a query term (search term) consisting of a character string of a keyword and by thus searching a document database of the whole document.
Various conventional document retrieving methods for searching the document database storing a large amount of document have been proposed. One of the conventional methods is disclosed in JP-A-8-190571, in which a full text search is carried out efficiently using plural-character information to reduce retrieval noises, shorten the processing time and reduce the amount of operation of the storage disks of the database at the same time.
A document retrieving method will be described below specifically taking the Japanese language as an example.
Briefly, the above-mentioned method comprises the steps of storing a plural-character occurrence file as a text index with existence of a plurality of characters in text data of a document and referencing the existence of the plural characters stored in the plural-character occurrence file and determining a document containing the plural characters included in the query term in a designated conditional formula as a candidate of retrieval, wherein the plural-character store compression step includes the substeps of defining the number of the types (type number) of the plural-character components appearing in the textual data and the number of documents (document number) containing each of the plural-character components, registering a bit string including "1" at the position corresponding to the document number of the document in the case where the summation of documents containing the plural-characters is larger than a predetermined threshold value, and storing the document number of the same document as binary data in the case where the summation of documents is smaller than the threshold value. The plural-characters are called n-grams.
In this method, the document numbers for all the plural-character components in each document are registered in a plural-character occurrence file as text index, and a document containing all the plural-character components included in the query term is retrieved with reference to the plural-character occurrence file at the time of retrieval. The plural-character occurrence file (table) used in this case contains a list of document numbers for respective plural-character components, i.e., identifiers of the documents containing each plural-character component.
According to the above-mentioned conventional document retrieving method, the document numbers for all the plural-character components in each document are registered in the plural-character occurrence file. A text index for database having a great number of documents in store, however, includes a vast number of plural characters appearing in the documents, and the storage size of the text index becomes large, so that as many file accesses as the number of the unique types of plural characters contained in the document are required at the time of registration. The resulting problem is that a very long processing time is required for registering a new document or replacing or deleting a document registered in a large database. Fine searching algorithms which use the information of the position of the plural-characters in the document, require more index space so that the index size becomes much bigger than the original text size.
In deleting a specified document from the database, for example, all the document numbers for the related character components registered in the plural-character occurrence file are required to be deleted. A text index for a large database, however, may have a plural-character occurrence file of a capacity in the giga-byte order. It is practically not possible to update the database of this size on line,
The foregoing description concerns a document retrieving method using plural-character components of the Japanese language. A method using words is known for English, however, using not any artificial plural-character components of the Japanese language, in which maintenance of indexes same as plural-character components is also a burdensome process when a large amount of documents are registered (Refer to Textbase, Open Information Systems January 1996, V11 n1). Another example of the troublesome work of updating the indexes (web search engine) for home pages of web on an internet is widely known. Also, as an example of indexes for texts on a database, the "IBM DB2 Text Extender Administration and Programming Guide" discloses a designation that the index is not updated immediately at the time of text registration at the time of defining a text index. On the other hand, Oracle's "Developing Application with the Oracle ConText Option" discloses a deferred index update mechanism. These conventional methods, however, disclose no means for solving the problem that the accuracy of retrieval using an index is adversely affected by the delay in updating the index. Specifically, these well-known techniques are intended to provide only means for avoiding the index updating in a time zone busy with retrieval by delaying the updating, but not any means for guaranteeing the retrieval accuracy which makes it possible to retrieve even a document that has just been registered.
A possible means against this problem is to store a second text index in main memory and all updating is performed by the second index thereby to eliminate the input/output time taking into consideration the fact that most of the time required for index updating is consumed for input and output with an external storage such as a magnetic disk. A succession of a great amount of updating work, however, increases the second index with the result that it is impossible for the second index to be resident in the main memory, and the update and retrieval work of the second index also requires the input/output operation, thereby considerably lowering the performance.