1. Field of the Invention:
The present invention is generally related to systems for discriminating and organizing informational locators or key references obtained from source information and, in particular, to a system and process for expediently developing locators of independently distributed information accessible through a heterogeneous protocol network, such as the Internet.
2. Description of the Related Art:
The national and international packet switched public network generically referred to as the Internet has existed for some time. Although often referred to as a single technological entity, the Internet is represented by a substantial complex of communication systems ranging from conventional analog and digital telephone lines through fiber optic, microwave and satellite communications links. The physical structure of the Internet is logically unified through the establishment of common information transport protocols and addressing and resource referencing schemes that allow quite disparate computer systems to communicate both locally and internationally with one another.
Common information transport protocols include the basic file transfer protocol (FTP) and simple mail transfer protocol (SMTP). Other information transport protocols that are progressively more interactive, particularly in a visual manner, include the comparatively simple telnet protocol and the typically telnet based gopher information request and retreieval service.
Recently, a new information transport protocol, known as the hypertext transfer protocol (HTTP), has been widely accepted on the Internet. This transport protocol is utilized to support a graphically interactive distributed information system variously known as the World Wide Web (WWW or W3) or simply as "the Web." The HTTP protocol provides for the transfer of both textual and graphical information via the Internet in a coordinated manner based on a system of client web page browser requests and remote web page server information responses. An HTTP session is established between a client browser and page server based on an HTTP transaction initiated in response to a browser reference to a uniform resource locator (URL). The URL system was comparatively recently established to provide a convenient and de-facto standardized format by which different Internet based or accessed information sources can be identified by type, and therefore inferentially by access transport protocol. In general, URLs have the following form: EQU &lt;protocol identifier&gt;://&lt;protocol server address&gt;/&lt;qualifier&gt;
Typical protocol identifiers include FTP, Gopher, HTTP, and News. The protocol server address typically is of the form "prefix.domain," where the prefix is typically "www" for web servers and "ftp" for FTP servers. The "domain" is the standard Internet sub-domain.top.sub.13 level-domain of the server. Optional qualifiers may be provided to specify, for example, a particular hypertext page maintained by a web server or a sub-directory accessible through an FTP server.
Internet protocols such as FTP, Gopher and HTTP provide access typically to generally static information sources. The information is not entirely static, but rather typified by a static basic URL that provides referential access to information that is substantially persistent and typically updated or expanded on a periodic basis. Other Internet transport protocols exist to support dynamic information sources. These dynamic information sources are typified as highly fluid streams of information, often defined as articles or messages, exchanged via the Internet. In general, the content of these information streams is not persistent at least in the sense that the information is not immediately organized and accessible, if ever, through generally static URLS.
A principle dynamic information source is the network news as transported over the Internet using the network news transfer protocol (NNTP). The network news system, historically referred to as Usenet, provides for the successively up stream and down stream propagation of news articles between interconnected computer systems. Specifically, news articles are posted to logically defined news groups and are propagated generally via the Internet to other computer systems that temporarily store the articles subject to expiration rules. Each participating computer system also serves to propagate the articles to other computer systems that have not previously received the propagating news articles.
Another and again historically older dynamic information source is provided by independently operating list servers (ListServ) residing on computer systems that are, in general, connected to the Internet. A list server is a typically automated service that functions autonomously to repeat electronic mail messages received by a publicly-known list server E-Mail account to an established list of subscribers known to the list server by explicit or fully qualified E-Mail addresses. The list server is thus an automated electronic remailer that allows a one to many distribution of E-Mail messages through the indirection operation of the list server. The remailing of E-Mail messages is typically dynamic and, therefore, persistent messages are maintained, if at all, selectively by the subscribers of a particular mailing list. Furthermore, the list servers are themselves subject to extreme variability in location and operation since only a publicly available dedicated E-Mail address is required in substance to operate a list server.
The ability to simply track if not expediently search for information available via the Internet has not kept pace with the rapid expansion of information available via the Internet. One predominant source of new information appears as essentially static web pages. Various automatons, often generally referred to as "web crawlers," have been developed to incrementally trace through URLs embedded in the various web pages and thereby develop an information map of available information resources within the logical web space. Since the Web is not entirely static, but rather greatly increasing in its extent and complexity on a continuing basis, web crawlers face a daunting task in repeatedly tracing out and maintaining a web space map of URLs.
Simply tracing through all URLs available via the web is not practical if only in terms of the time and cost required to actually complete a trace before substantial portions of the map are antiquated by the addition and gradual revision of web URLs. Some estimates of the size of the Web place the number of presently active URLs at greater than about 50 million and growing rapidly. Furthermore, any such incremental tracing must be, by any practical definition, incomplete. A URL trace must contend with problems of infinite depth due to URL mutual references and reference looping, made further complex by the existence of URL aliases. A trace must also deal with discrete discontinuities that inherently exist at any given time in the basic structure of the URL defined web space. Normally a self contained or only outwardly directed island (connected group) of URL references may exist either by choice or as a consequence of the delay in the ponderous operation of web crawlers before discovering a URL trace that leads to a URL island. This tracing delay is conventionally reduced by trimming the depth at which URLs are traced from a base URL. However, this strategy actually results in an increased likelihood of more islands existing with a greater distribution of and even larger islands of URLs being excluded from the URL map created by a web crawler.
A class of Internet business services (IBS) has developed to deal with the problems of locating information available through the Internet. These business services characteristically utilize web crawlers to establish searchable web space maps. These maps, in turn, are made available on the Internet typically through an advertising supported or user-fee based search engine interface accessible via a defined web page. One well-known and one of the oldest Web searching systems is provided by Lycos.RTM., Inc. (www.lycos.com). Completeness and timeliness of the listing of information resources available through the Internet is of paramount concern to such Internet business services. These problems are of particular importance since the newest sources of information are often the most important to subscribers of such Internet business services. A related problem is in identifying for the subscriber the most active of current interest information sources. The ability to ensure the completeness, timeliness and currentness of the searchable information available through an Internet business service is therefore highly desirable. However, because of the fundamental nature of web crawlers and the fully distributed nature of the web space, no direct method or system of achieving these goals is conventionally known. For example, Lycos has developed a search strategy based on conducting an essentially random search of URLs tempered by preferences. These preferences allow for the explicit or manual specification of starting URLs to include in the search and generally automated efforts by the search engine to identify and traverse Web server home pages, Web pages with substantial external links, user home pages and URL that are short, suggestive of a logical if not actual server hierarchy of Web pages. However, the Lycos search system is otherwise limited to the identification of URLs from the pages selected for traversal. The application of these preferences, the practical limitation of the depth of URL search and the randomness of the URL tracing operation may all act to inadvertently limit or at least substantially delay the inclusion of new Web URLs and even entire Web islands into the Web map space traced by the Lycos Web crawler.