The present invention generally relates to a document data processing system and particularly to a full document retrieval system also known as a full text search system For searching and retrieving a full text of a document From a document database on the basis of a designated character string. In more particular, the present invention is concerned with a document retrieval method and system which is capable of speeding up a full text retrieval processing significantly by using an auxiliary File For the search processing.
In the document registration/retrieval systems known heretofore, such a scheme is generally adopted in which a word or term (referred to as a keyword) representing the content of a document to be registered is used as an index. According to this method, however, it is necessary to have an expert called "indexer" read thoroughly every document to be registered and assign pertinent keywords to the documents on the basis of his or her understanding of the contents thereof. As an attempt For evading such troublesome and time-consuming work For the document registration, there has been proposed a method according to which the words or terms occurring in the texts of a document are all registered as the keywords in an index file, as is disclosed, For example, in JP-A-63-198124.
However, the method mentioned above still suffers from a drawback that difficulty is encountered in determining a semantically meaningful word or term of a minimum unit upon preparation or creation of the index file. Besides, due to possible deficiency in a word dictionary and/or grammatical rules, analysis of sentences often Fails of success, presenting a problem that even an important word can not be extracted as the keyword.
As an approach to solve the above problem, there has already been proposed a full document retrieval system which is also referred to as the full text search system and in which documents are straightforwardly loaded in a database through the medium of a computer as texts composed of coded characters upon document registration, while upon retrieval of a document, contents of all the documents stored in the database are read to thereby retrieve the document containing a given or designated keyword (hereinafter referred to as "search term" to distinguish it from the authorized or controlled keyword used in conjunction with the conventional system), as is disclosed, for example, in an article entitled "Text Database Manage System SIGMA and Applications" contained in "Study Reports of The Information Processing Society of Japan: Informatics Fundamentals 14-7", Vol. 89, No. 66 (Jul. 27, 1989). This. full text search system Features among others a character-by-character based scanning of a whole text file from the beginning, as is described in the preamble of the second section of the abovementioned article. By virtue of this feature, it is possible to search or retrieve a document from the database by using the text body as a clue, even in the case where there is available no index file containing document identifiers corresponding to the keywords. In other words, by conducting a character-string based search for all the text data with the aid of a given search term, only the document in which the search term is described or contained can be outputted as the result of the retrieval.
This full document or text retrieval system takes, however, a lot of time for the search processing because the whole text file has to be scanned from the beginning on a character-by-character basis, incurring a problem that the full text search can not practically be applied to a large scale database. As stated also in the abovementioned article in the second section, the full text search system under consideration can realize only the search processing speed (rate) on the order of 2 MB/sec., even by resorting to the use of a general-purpose large scale computer. Of course, the processing speed on this order can afford a practically admissible search time so far as the capacity of a database is several megabytes or so. In reality, however, a database used in practice for the business purpose or the like usually demands a capacity of several hundred megabytes or so. In that case, the full text search system mentioned above will not be in the position to assure any satisfactory response time for the document search.
In an effort to cope with the difficulties mentioned above, the inventors of the present application have already proposed an information retrieval system in which the reading of text data as well as the search processing effected by using a search term are speeded up by providing hardware dedicated thereto, while performing in precedence to a text body search a presearch, so to say, on an auxiliary File in which the text data are previously stored in the compressed state, to thereby screen or shift the documents to undergo the text body search, with a view to realizing the full text search at an equivalently increased speed. In this conjunction, reference may be made to PCT/JP/90/00774, U.S. patent application Ser. No. 555,483, now U.S. Pat. No. 5,168,533 and WO/90/16036. More specifically, this information retrieval system features the presearch procedures referred to as a component character table search and a condensed text search, respectively, wherein the documents to be subjected to the text body search fare screened out (i.e. reduced in the number of documents) hierarchically, so to say, by executing stepwise the component character table search and the condensed text search. To say in another way, through the document screening or narrowing-down preprocessing, the number of the documents to be subjected to the text body search the time for which occupies a greater proportion of the whole search time can be decreased, which in turn means that the time taken for the search or retrieval processing as a whole can correspondingly be shortened, whereby the full text search can be realized at an equivalently increased speed.
According to the abovementioned hierarchical presearch featuring the system proposed by the inventors, the number of the documents is decreased first through the character-based search performed by consulting the component character table, which is then followed by second document number reduction through the word- or term-based search performed by using the condensed text table on the documents remaining even after the character-based search. In connection with the capacity of the database, it is to be mentioned that storage of a condensed text requires about 30% of the capacity for storing a text while the component character table requires 256 bytes per document.
In the information retrieval system mentioned above, however, no consideration is paid to the sentences or words in which the characters contained in the component character table are used, because the document screening or reduction is realized solely in dependence on whether or not a character constituting a part of the search term exists in the component character table. As a consequence, for an input search term composed of those characters which make appearance in the text at a high frequency, the component character table search can not afford a sufficiently high screening ratio for reduction of the documents, giving rise to a problem. In that case, the number of the documents to be subjected to the text body search will not be diminished to-such an extent which can assure a sufficiently high retrieval response.
As another approach for speeding up the full text search, there can be mentioned a method disclosed in an article entitled "Method of Speeding-Up Katakana Character Search in Full Document Retrieval By Using Character String Matching" contained in "Study Reports of The Information Processing Society of Japan: Database System 83-1" Vol. 91, No. 46 (May 24, 1991). According to this known method, positional information of all the characters appearing in a document is stored as the indexes on a character-by-character basis, wherein a document in which all the characters constituting a designated or inputted search term make appearance in succession is sought by reference to the indexes. This method requires, however, as many as about 40 KB for the indexes on the assumption that the positional information of four bytes is stored for each character in the case of a document containing ten thousand characters, by way of example. Accordingly, an attempt of structuring a text database containing such documents in a number of one hundred thousands or so will require a storage capacity of 4 GB for the indexes in addition to 2 GB for the storage of the documents themselves. Accordingly, it can be said by no means that such attempt is practical, in view of the enormous capacity demanded for the index storage.