The World Wide Web (“the web”) is a distributed international electronic library of documents and other data resources. A particular document is accessed on the web using a unique identifier for the document, called a “URL,” short for Uniform Resource Locator. If a user seeking to access a particular document has the URL for the document, s/he may simply type it into the URL field of a web browser. In many cases, the URL for a document may be obtained from a second, related document containing a link to the first document.
It is conservatively estimated that over a billion documents are available on the web. (Indeed, smaller “webs,” such as “Intranets” used only by the employees of a particular business, may themselves provide access to hundreds of thousands of documents.) For a particular user having a particular need, the web may contain several documents that address the need, all unknown to the user. For example, for a user interested in details of the 1955 grape harvest in Eastern Washington, 15 documents may be available on the web that contain such information, all unknown to the user.
In order to help users identify documents on the web relating to particular subjects, hierarchical web directories and web search engines have been developed. A hierarchical web directory is a set of human-compiled lists of documents available via the web each relating to a particular subject represented in a hierarchy of topics. Table 1 below shows a designation of a hierarchical web directory topic corresponding to a list of documents available via the web that includes documents containing information about the 1955 grape harvest in Eastern Washington.
TABLE 1Society and CultureFood and DrinkSpiritsWineRegionalEastern Washington
The topic corresponding to the list, Eastern Washington, is a subtopic of the topic “Regional,” which is a subtopic of the topic “Wine,” which is a subtopic of the topic “Spirits,” which is a subtopic of the topic “Food and Drink,” which is a subtopic of the topic “Society and Culture.” In order to provide a hierarchical web directory, its provider must create a hierarchy of topics, identify documents available via the web, and identify topics to whose lists the identified documents should be added.
A web search engine, on the other hand, allows users to type one or more key words and returns a list of documents containing those keywords. In particular, web search engines typically include documents in the list that have the highest percentages of occurrences of the key words among all of the documents. For example, to identify documents containing details of the 1955 grape harvest in Eastern Washington, a user might type the key word string “1955 grape harvest Eastern Washington.” The web search engine processes such queries against a database representing the contents of as many web pages as possible, typically gathered by “spidering,” or automatically traversing links from known web pages to new web pages.
Both of these conventional approaches to identifying documents on the web have significant disadvantages. Hierarchical web directories are extremely labor intensive, requiring human editors to review and categorize web documents. This reliance on manual processes often results in outdated or inaccurate content. Also, hierarchical web directories are only usable to identify web pages relating to topics created by human editors. Hierarchical web directories are also difficult for users to successfully use, as a user must typically select the exact same sequence of subtopics as the person that catalogued the web site.
Web search engines, while not typically suffering from the deficiencies of hierarchical web directories relating to their manual nature, have the disadvantage that they rely on the occurrence of particular key words in sought web pages. Because many words have multiple meanings, Web search engines often generate false positive matches, where a keyword appears in the Web page in a different sense than the sense intended by the user in formulating the query. On the other hand, because of the large number of words that can be used to get across the same idea, Web search engines also often generate false negative matches. Web search engines also typically filter out noise words that occur in most web pages, such as “an” or “if,” which make it impossible to search for web pages using these words when using a web search engine. Further, aside from applying certain frequency analysis techniques, web search engines typically ignore the specific usage and significance of particular key words in the searched web pages.
Accordingly, a more effective approach to identifying documents on the web and in other electronic libraries would have significant utility.