In the prior art, it has been well known that computer systems can be used to manage indices to records of databases. Many techniques are known to parse, index and search databases. However, managing extremely large databases presents special problems.
In recent years, a unique distributed database has emerged in the form of the World-Wide-Web (Web). The database records of the Web are in the form of pages accessible via the Internet. Here, tens of millions of pages are accessible by anyone having a communications link to the Internet.
The pages are dispersed over millions of different computer systems all over the world. Users of the Internet constantly desire to locate specific pages containing information of interest. The pages can be expressed in any number of different character sets such as English, French, German, Spanish, Cyrillic, Kanakata, and Mandarin. In addition, the pages can include specialized components, such as embedded "forms," executable programs, JAVA applets, and hypertext.
Moreover, the pages can be constructed using various formatting conventions, for example, ASCII text, Postscript files, html files, and Acrobat files. The pages can include links to multimedia information content other than text, such as audio, graphics, and moving pictures. As a complexity, the Web can be characterized as an unpredictable random update, insert, and delete database with a constantly changing morphology.
Prior art search engines are ill equipped to handle the formidable task of indexing the Web. Most database access tools are designed to be context dependent. Extant indexing systems such as Lexis/Nexis, Dialog, and EXCITE, index a limited number of pages either by choice or limitations in their browsing or indexing capabilities. Attempts to reduce the size of their indices have been made by excluding commonly occurring English words such as "a," "the," "of," and "in."
Other search engines only index abstracts of the Web pages, such as their titles, authors, and locations, and not the full content of the pages. These are severe limitations, particularly in an environment which permits the creation of pages in other linguistic and grammatical constructs.
It is an additional problem to present the search results in a usable manner, particularly if a search request can locate thousands of qualifying records. It is a burden to require the user to peruse all qualifying records.
It is also a problem to conduct a search in a timely fashion. In a commercial environment, users may be charged for connect time. Therefore, it is important that the searches be performed in a matter of seconds. As an additional problem, new records appear by the thousands, and incrementally updating an index is difficult, particularly if the index needs to be continuously accessible for searching.
Most conventional full text indices are commonly arranged as one or more sorted lists of unique valued words, and pointers to lists of locations where the words occur in the database. Typically, the lists of locations are maintained separately. This approach requires at least one level of indirection to access a word/location pair. In addition, this type of data structure implies that the words and their locations, together, are not ordered sequentially, but hierarchically.
It is also a problem to minimize the amount of physical media, e.g., memories, required to store the index. For example, in the prior art attempts are made to keep all of the first level pointers in dynamic random access memories (DRAM) which has relatively fast access latencies. The lists of pointers are typically maintained on slower disk storage. For these reasons, most large scale indices use a hierarchical architecture, such as multi-way trees or tries, while scanning the indexed entries.
Simplistically, the word/location pairs can be considered a large two-dimensional array, with the words along one axis, and the location of the words along the other axis.
This makes it relatively inexpensive to provide access to any of the words, since most words can be cached in DRAM. However, determining the locations associated with the selected words will typically require a relatively large number of expensive disk operations, reducing the throughput efficiencies of the index.
Also, most prior art full text indices inefficiently deal with indexing context-dependent attributes about the information being indexed. Some systems may maintain separate indices to store general information about data or records indexed. This approach increases the cost of searching the indices with queries having terms which combine words, and general descriptions about documents or part of documents, such as titles.
Furthermore, in conventional indices, ad hoc data structures are typically provided to allow updating, e.g., the addition and deletion of entries, concurrent with searching. For example, a journal or "stop-press" file may store new entries. The design and support of different data structures, and processes which operate thereon, degrades performance. Periodically, when the typically inefficient journal file becomes too large, it must be merged with the primary index, perhaps excluding the searching of the index during the update processes.
If the database from which the index is to be created is large, the documents to be processed are usually divided into several parts. The several parts are indexed separately, and when all parts have been indexed, the indexed parts are processed by a sort/merge operation. In a lexicon based partitioning approach, during a first pass of the database, only words having values in a predetermined range are indexed, for example, only the words beginning with the letters A through D, then during subsequent passes the rest of the words are processed. This type of approach is not well suited for a database such as the Web.
Therefore, it is desired to provide a search engine which can index large databases storing content in a number of different forms. The data structure of the index should be compact in order to reduce the cost of the search engine. In addition, the processes which operate on the data structure should be efficient so that search results can be presented in a reasonable amount of time and a meaningful manner.