The ability to locate relevant documents from a large pool of documents is becoming increasingly desirable. Programs which provide this capability are commonly known as search engines. Search engines typically process a pool of documents and build an index of words. A user can enter a search request, or query, seeking a list of documents that contain certain words. The search engine processes the index and returns a list of documents that satisfy the request. Search engines are used frequently to determine which web sites on the Internet contain relevant content. Search engines are also used to access information from intranets, file servers, and databases. Given the vast amount of data that is electronically available, search engines are becoming increasingly important as a mechanism for finding relevant documents from a large pool of documents.
Efficiency is very important in search engine technology. While inefficient indexing and/or search processing may not be noticeable when the relevant pool of documents is relatively small, inefficiency will quickly lead to excessive index and search processing times when the pool of documents is relatively large. Efficiency is also an important consideration for other aspects of full text indexes, such as processing complex queries, or processing natural language queries. A search engine typically implements natural language searching by breaking a search request into multiple sub-queries. Consequently, if the searching algorithm is inefficient, response time can be seriously degraded.
A search request typically takes the form of one or more words separated by one or more operators, such as AND, OR, or NOT, and proximity restrictions, such as word A within 10 words of word B. The search engine determines which documents satisfy the request, and returns a list of such documents.
A large number of documents may fulfill the search request where the pool, or set, of indexed documents is large. To help the user determine which documents will most likely contain relevant content, many search engines provide a `relevance` ranking for each document that fulfills the search request. The relevance ranking is an estimation provided by the search engine of the importance of the document in view of the particular search request. The ability to rank and present documents to a user in order of their relevance is becoming increasingly important to minimize the time a user must spend in determining which of the many documents that fulfill the search request are, in fact, relevant. Ranking documents by relevance adds additional complexity to the search engine, and presents another potential efficiency consideration. Ideally, the relevance determination will not add significantly to the overall response time of the search engine.
One of the best mechanisms for increasing the efficiency of a search engine is to minimize peripheral input/output (I/O) operations, and in-memory table accesses. A full text index is typically made up of several tables of information, including cross-reference information, and during a search request, many different tables are accessed to make pertinent determinations, including determining in what document a word is located. Full text indexes can be very large, and can take up hundreds of megabytes or more of space. Because of its size, an entire full text index typically will not fit in the memory of a computer, so a table index access will likely result in at least one I/O operation to disk, and, depending on the access methodology, can result in multiple I/Os. An I/O operation is an extremely time-consuming process. Moreover, a single search request may require hundreds of thousands of table accesses, depending on the commonality of the word. Since eliminating or reducing I/O operations can significantly reduce response time, it is beneficial to reduce table accesses. One mechanism for reducing table accesses would be to store word information in a manner that the information itself allows for document level determinations to be made without the need to access a separate document level table.