With the popularity of the Internet and the rapid increase in the amount of information on networks, people increasingly rely on efficient full-text search engines. The full-text search engine, with text as its target, may build up inverted index of word-to-document for a large amount of documents. In response to a user's query for documents (e.g., web pages) with keywords, the search engine will return those documents that contain the keywords back to the user. To cut cost in hardware and software, most of current search engines simply store the inverted index as a file (referred to as inverted file) in the non-volatile storage (such as hard disk), and make search directly over disk index by accessing to the inverted file through file reading/writing operations. China patent application No. 03109847.9 (publication number CN1536509A, publishing date Oct. 31, 2004) disclosed a method for storing inverted index, a method for on-line updating inverted index and inverted index mechanism, which is hereby incorporated in full by reference.
There are two key performance indicators for full-text search, Indexing Building Speed (IBS) and Query Response Time (QRT). Improvement of the two indicators will largely depend on the index structure and the corresponding searching method used by the search engine. A good index structure should be able to increase IBS and reduce QRT.
A traditional indexing method uses two index files, namely a primary index file and a secondary index file, on the disk. In updating the index on the disk, for example, when new documents are indexed into the primary index file, the primary index file needs to be opened and closed frequently, leading to low updating speed.
Thus, in normal indexing processes, instead of performing a writing operation on the index file whenever a new document is indexed, indexing operations are first performed within the memory and file writing operations are performed on a basis of batch. The longer the interval of the batch is, the fewer the number of file writing operations would be and the more memory space would need to be used. On the other hand, the shorter the interval between batches is, the less memory space would be occupied; however, file I/O operations would be more frequent, resulting in slower indexing speed.
There is not an effective technique in the prior art that can efficiently control the number of such file writing operations.