For most users, a search of a database for documents related to a particular topic begins with the formulation of a search query for use by a search engine. The search engine then identifies documents that match the specifications that the user sets forth in the search query. These documents are then presented to the user, usually in an order that attempts to approximate the extent to which the documents match the specifications of the search query.
In its simplest form, the search query might be no more than a word or a phrase. However, such simple search queries typically result in the retrieval of far too many documents, many of which are likely to be irrelevant. To avoid this, search engines provide a mechanism for narrowing the search, typically by allowing the user to specify some Boolean combination of words and phrases. More complex search queries allow a user to specify that two Boolean combinations be found within a particular distance, usually measured in words, from each other. Known search queries can also provide wildcard characters or mechanisms for including or excluding certain word variants.
Regardless of its complexity, a search query is fundamentally no more than a user's best guess as to the distribution of alphanumeric characters that is likely to occur in a document containing the information of interest. The success of a search query thus depends on the user's skill in formulating the search query and in the predictability of the documents in the database. Hence, a search query of this type is likely to be most successful when the documents in the database are either inherently structured or under editorial control. Because of the necessity for thorough editorial review, such databases tend to be either somewhat specialized (for example databases for patent searching or searching case law) or slow to change (for example, CD-ROM encyclopedias).
Because of its distributed nature, the internet offers a breadth of up-to-date information. However, documents posted on the internet are often posted with little editorial control. As a result, many documents are plagued with inconsistencies and errors that reduce the effectiveness of a search engine. In addition, because the internet has become an advertising medium, many sites seek to attract visitors. As a result, proprietors of those sites pepper their sites with invisible (to the reader) words, as bait for attracting the attention of search engines. The presence of such invisible words thwarts the search engine's attempt to judge the relevancy of a document solely by the distribution of words in the document.
The unreliability associated with many documents on the internet poses a difficult problem when a search engine attempts to rank the relevance of retrieved documents. Because all the search engine knows is the distribution of words, it can do no more than indicate that the distribution of words in a document does or does not match the search query more closely than the distribution of words in another document. This can result in such a prolixity of search results that it is impractical to examine them all. Moreover, because there is no absolute standard for relevance on the internet, there is no assurance that the most highly ranked document returned by a search engine is even relevant at all. It may simply be the least irrelevant document in a collection of irrelevant documents.
Attempts have been made to improve the searchability of the internet by having human editors assess the reliability and relevance of particular sites. Addresses to those sites meeting a threshold of reliability are then provided to the user. For example, major publishers of encyclopedias on CD-ROM provide pre-selected links to internet sites in order to augment the materials provided on the CD-ROM. However, these attempts are hampered by the fact that internet sites can change, both in content and in address, overnight. Thus, a reviewed site that may have existed on publication of the CD-ROM may no longer exist when a user subsequently attempts to activate that link.
It is apparent that the dynamic and free-form nature of the internet results in a highly diversified and current storehouse of reference materials. However, the uncontrolled nature of documents on the internet results in an environment that is not readily searchable in an efficient manner by a conventional search engine.