1. Field of the Invention
The present invention relates to a method for creating an index fast for a full-text search through documents and a method for implementing fast searches by use of the index, and more particularly to the structure of the index.
2. Description of the Prior Art
A method using a data structure called a signature file is available for a full-text search through large volumes of documents. According to the method disclosed in Japanese Published Unexamined Patent Application No. Hei 7-244671, an index is formed to represent the occurrence of characters in documents by bits. This method allows relatively fast searches independently of the number of stored documents.
However, since one bit is assigned to several different words, there is a possibility that a document containing words not specified is searched, posing a problem that a correct search cannot be performed. Also, the complex algorithm of index creation and search has made it difficult to implement fast full-text searches on existing database management systems.
As a solution to such a problem, in the literature xe2x80x9cCompression and Fast Indexing for Multi-Gigabyte Text Databasexe2x80x9d, a method is proposed that implements the functions of a fast full-text search by using indexing methods such as a hash table and B+ tree, as offered by general database management systems. According to this method, an identification number is assigned to a word used as a key and a document used as a value of an index, respectively, and they are compressed before being stored.
This reduces the number of disk pages to be read that are required for searches, enabling fast searches. Also, since different identification numbers are assigned to different words, searches can be performed correctly. This literature shows that about seven hundred thousands of documents can be searched fast.
However, the method described in the above-mentioned literature has a problem that high performance cannot be obtained because the performance of index creation and update processing is not taken into account. Particularly, confirmation processing for avoiding the duplicate registration of an identical document for a certain word during updating cannot be implemented efficiently because it must be performed by repetitive processing on a collection of document identification numbers.
Also, when there are several millions of target documents as in a full-text search through Internet""s WWW pages, addition to the index and search processing cannot be performed efficiently even by the method described in the above-mentioned literature because a B+ tree becomes as large as tens of gigabytes.
There are two reasons for this. First, in order to add or search for a set of a word and a document with respect to the B+ tree, a hard disk must be accessed as many times as the height of the tree plus one. Second, if the B+ tree becomes huge, the effect of caching disk contents to a memory would not be obtained and almost all accesses to the hard disk would involve actual reading of data from the hard disk.
For example, if the overall size of a B+ tree is 10 GB, the size of each node of the B+ tree is 8 KB, and the number of branches of each node is 500, then the height (=log 500 (10 GB/8 KB)) of the B+ tree minus 1 is 2.16, indicating that the disk must be accessed two to three times on the average to add or search for a set of a word and a document. Since it takes several tens of milliseconds to several hundreds of milliseconds to access a hard disk once, it would take from 0.1 to 1 second to add or search for a set of a word and a document. Accordingly, if 100 different words are contained in one document, 10 to 100 seconds would be required to register one document.
For these reasons, as the number of data items stored in a B+ tree increases, the B+ tree becomes taller and larger and the number of pages to be accessed increases, so that time required for storage and search becomes longer. For example, as shown in FIG. 17, which shows a trend of change of storage time, storage time increases significantly as the number of documents stored increases.
The present invention has been made in consideration of the above-mentioned conventional circumstances, and it is an object of the present invention to offer a method for creating fast an index for a full-text search through a huge number of documents, for example.
Another object of the present invention is to offer a method for implementing fast searches by using an index thus created.
Specifically, the present invention uses B+ tree index keys with a word identification number followed by a document identification number, whereby the occurrence of a certain word in a given document is made possible by one search through the B+ tree index.
In this case, however, if the index is to be managed by a single B+ tree, the B+ tree would become huge when large volumes of documents are stored, so that, in an attempt to add one document, as many pages as different words contained in the document might have to be updated in the worst case.
To avoid this, the B+ tree index is split to a plurality of subindexes by hash values obtained by applying a certain hash function to word identification numbers and hash values obtained by applying another hash function to document identification numbers and the subindexes are placed in a two-dimensional array. By collectively performing registration for the same hash values of document identification numbers during creation or updating of an index, the number of pages to write to has been reduced and processing efficiency has been improved.
Also, an AND or OR search of a plurality of words is performed in such a way that groups are formed by the hash values of word identification numbers and the B+ trees grouped by the hash values of document identification numbers are collectively searched on a group basis, whereby a hit ratio of page cache during page reading has been improved, thereby achieving higher processing efficiency.
That is, this invention provides a method for creating an index in which keys (e.g., words) and values (e.g., one document containing the words) are associated to search for a value from a specified key, wherein the index to register sets of keys and values are constituted, for example, of a plurality of subindexes of B+ tree structure, each of which is stored in a two-dimensional array position referenced by a value determined by applying a predetermined function to a value to be registered and a value determined by applying a predetermined function to a key.
More specifically, the present invention assigns a document identification number and a word identification number to a document and a word, respectively, to uniquely identify them, provides, as a function to apply to documents, a hash function that maps a document identification number to a value indicating the position of one direction of a two-dimensional array, and as a function to apply to words, a hash function that maps a word identification number to a value indicating the position of another direction of the two-dimensional array, and registers the occurrence of a word in a document in a corresponding subindex by using values obtained by applying the hash functions to the document identification number and the word identification number, respectively.
Also, the present invention assigns a word identification number to a word to uniquely identify it, provides, as a function to apply to the occurrence of word, a hash function that maps the occurrence count or frequency of the word in a document to a value indicating the position of one direction of a two-dimensional array, and as a function to apply to words, a hash function that maps a word identification number to a value indicating the position of another direction of the two-dimensional array, and registers the occurrence of a word in a document in a corresponding subindex by using values obtained by applying the hash functions to the occurrence count of the word and the word identification number, respectively.
In the above registration, when the occurrences of words in a plurality of documents are registered as a single unit, groups are formed by the values determined by applying a hash function to the document identification numbers (or the occurrence count or frequency of the words) of the documents, and the occurrences of words are registered on a group basis.
Furthermore, in the above registration, when the occurrences of all words in documents collected into one group are registered, the occurrences of words are grouped by the values obtained by applying a hash function to the words so that they are registered on a group basis.
In the above registration, a page cache capable of storing at least one subindex provided in the main storage is used.
The present invention provides a search method for obtaining a corresponding document name from a word serving as a key by using an index with document names associated with words contained in the documents, wherein document identification numbers and word identification numbers are assigned to document names and words, respectively, to uniquely identify them, and a value with a document identification number concatenated with a word identification number is used as the key.
More specifically, the present invention provides a search method which constitutes an index with document names associated with words contained in the documents by a plurality of subindexes, assigns document identification numbers and word identification numbers to document names and words, respectively, to uniquely identify them, and obtains a corresponding document name from a word serving as a key by using an index registered in a subindex in a two-dimensional array position referenced by a value determined by applying a hash function to a document identification number and a word identification number, wherein a search by a plurality of words is performed in such a way that groups are formed by the values obtained by applying a hash function to the identification numbers of the words and the subindexes are searched on a group basis.
Also, the present invention provides a search method which constitutes an index with document names associated with words contained in the documents by a plurality of subindexes, assigns document identification numbers and word identification numbers to document names and words, respectively, to uniquely identify them, and obtains a corresponding document name from a word serving as a key by using an index registered in a subindex in a two-dimensional array position referenced by a value determined by applying hash functions to document identification numbers and word identification numbers, wherein an AND or OR search of a plurality of words is performed in such a way that subindexes for documents grouped by the values obtained by applying a hash function to document identification numbers are searched for the occurrence of each word and the AND or OR operation on the search results is performed.