In the prior art, it has been well known that computer systems can be used to manage indices to records of databases. Many techniques are known to index and search databases. However, managing extremely large databases presents special problems.
In recent years, a unique distributed database has emerged in the form of the World-Wide-Web (Web). The database records of the Web are in the form of pages accessible via the Internet. Here, tens of millions of pages are accessible by anyone having a communications link to the Internet.
The pages are dispersed over millions of different computer systems all over the world. Users of the Internet constantly desire to locate specific pages containing information of interest. The pages can be expressed in any number of different character sets such as English, French, German, Spanish, Cyrillic, Kanakata, and Chinese. In addition, the pages can include specialized components, such as embedded "forms," executable programs, JAVA applets, and hypertext.
Moreover, the pages can be constructed using various formatting conventions, for example, ASCII text, Postscript files, html files, and Acrobat files. The pages can include links to multimedia information content other than text, such as audio, graphics, and moving pictures.
Prior art search engines are ill equipped to handle the formidable task of indexing the Web. Most database access tools are designed to be context dependent. Extant indexing systems such as Lexis/Nexis, Dialog, and EXCITE index a limited number of pages either by choice or limitations in their browsing or indexing capabilities. Attempts to reduce the size of their indices have been made by excluding commonly occurring English words such as "a," "the," "of," and "in."
Other search engines only index abstracts of the Web pages, such as their titles, authors, and locations, and not the full content of the pages. These are severe limitations, particularly in an environment which permits the creation of pages in other linguistic and grammatical constructs.
It is also a problem to conduct a search in an efficient and timely manner. Since the content can be expressed in a number of different modalities, the interfaces to the indices can be complex and numerous.
It is desired to provide a small number of simple-to-use interfaces to an index which has an extremely large number of entries. The interfaces should allow the searching of the index using a variety of logically combined search terms.