Information has recently undergone a transition from a scarce commodity to an overabundant commodity. With a scarce commodity, efforts are centered on acquiring the commodity, whereas with an overabundant commodity, efforts are centered on filtering the commodity to make it more valuable. The prime example of this phenomenon is the explosion of information resulting from the growth of the global Internetwork of networks known as the “Internet.” Networks and computers connected to the Internet pass data using TCP/IP (Transport Control Protocol/Internet Protocol) for reliably passing data packets from a source node to a destination node. A variety of higher level protocols are used on top of TCP/IP to transport objects of digital data, the particular protocol depending on the nature of the objects. For example, e-mail is transported using Simple Mail Transport Protocol (SMTP), while files are transported using the File Transfer Protocol (FTP).
Hypertext documents and their associated effects are transported using the Hypertext Transport Protocol (HTTP). When many hypertext documents are linked to other hypertext documents, they collectively form a “web” structure, which led to the same “World Wide Web” (often shortened to “WWW” or “the Web”) for the collection of hypertext documents that can be transported using HTTP. Of course, hyperlinks are not required in a document for it to be transported using HTTP. In fact, any object can be transported using HTTP, so long as it conforms to the requirements of HTTP.
In a typical use of HTTP, a browser sends a uniform resource locator (URL) to a Web server and the Web server returns a Hypertext Markup Language (HTML) document for the browser to display. The browser is one example of an HTTP client and is so named because it displays the returned hypertext and allows the user an opportunity to select and display other hypertext documents referenced in the returned document. The Web server is an Internet node which returns hypertext documents requested by HTTP clients.
Some Web servers, in addition to serving static documents, can return dynamic documents. A static document is a document which exists on a Web server before a request for the document is made and for which the Web server merely sends out the static document upon request. A static page URL is typically in the form of “host.subdomain.domain.TLD/path/file” or the like. That static page URL refers to a document named “file” which is found on the path on the machine which has the domain name host.subdomain.domain.TLD. An actual domain such as “www.” followed by “Xerox” followed by “.com” refers to the machine (or machines) designated “www” at the domain “xerox” in the “.com” top-level domain (TLD). By contrast, a dynamic document is a document which is generated by the Web server when it receives a particular URL which the server identifies as a request for a dynamic document.
Many Web servers operate “Web sites” which offer a collection of linked hypertext documents controlled by a single person or entity. Since the Web site is controlled by a single person or entity, the hypertext documents, often called “Web pages” in this context, have a consistent look and subject matter. Especially in the case of Web sites put up by commercial interests selling goods and services, the hyperlinked documents which form a Web site will have few, if any, links to pages not controlled by the interest. The terms “Web site” and “Web page” are often used interchangeably, but herein a “Web page” refers to a single hypertext document which forms part of a Web site and “Web site” refers to a collection of one or more Web pages which are controlled (i.e., modifiable) by a single entity or group of entities working in concert to present a site on a particular topic.
With all the many sites and pages that the many millions of Internet users might make available through their Web servers, it is often difficult to find a particular page or determine where to find information on a particular topic. There is no “official” listing of what is available, because anyone can place anything on their Web server and need not report it to an official agency and the Web changes so quickly. In the absence of an official “table of contents”, several approaches to indexing the Web have been proposed.
One approach is to index all of the Web documents found everywhere. While this approach is useful to find a document on a rarely discussed topic or a reference to a person with an uncommon first or last name, it often leads to excessive numbers of “hits.” Another approach is to categorize web documents and make them searchable by category.
Although the use of the Internet search engines/agents to gather information from the Internet reduces the voluminous amount of information on the Internet, the search engines still return a very large number of Internet sites (URLs), which the person searching must tediously “visit” to extract applicable information and then make a determination to do more searching. Often the person finds an “alternate link” which may be interesting yet not in the search criteria, and the person spends time visiting other sites which are not directly applicable. This results in wasted time, and longer overall information collection times. Therefore, there is a need to decrease the overall information collection time by extending the use of Internet search engines by providing an alternate methodology for information extraction.