In the prior art, it has been well known that computer systems can be used to index databases, and to search the index to locate records qualified by queries. In recent years, a unique distributed database has emerged in the form of the World-Wide-Web (Web). The database records of the Web are in the form of pages accessible via the Internet. Here, tens of millions of pages are accessible by anyone having a communications link to the Internet.
The pages are dispersed over millions of different computer systems all over the world. Users of the Internet constantly desire to locate specific pages containing information of interest. The pages can be expressed in any number of different character sets such as English, French, German, Spanish, Cyrillic, Kanakata, and Mandarin. In addition, the pages can include specialized components, such as embedded “forms,” executable programs, JAVA applets, and hypertext.
Moreover, the pages can be constructed using various formatting conventions, for example, ASCII text, Postscript files, html files, and Acrobat files. The pages can include links to multimedia information content other than text, such as audio, graphics, and moving pictures.
Search engines have been provided to allow users to locate Web pages of interest. These search engines typically have a query interface where the users specify terms and operators which they want to use to qualify pages.
There are a number of problems with locating pages using an index to the Web. First, the number of pages accessible through the Web is very large, so the number of potential qualifying pages is also going to be large. In addition, many Web users are unsophisticated, so in many instances queries are going to be loosely specified, potentially yielding many pages which may not be of interest to the users. The number of qualifying pages many number in the tens of thousands.
It is desired to minimize the number of index entries which need to be searched for query terms which are not likely to yield fruitful results, and maximize the search of the index using query terms that are more likely to locate records of interest to users.