1. Field of the Invention
The invention relates to techniques for collecting, arranging, and coordinating information pertaining to the connectivity of Web pages and, more particularly, to the construction of a connectivity server, including a data structure incorporating a URL Database, a Host Database and a Link Database, the connectivity server for facilitating efficient and effective representation and navigation of Web pages.
2. Description of the Related Art
The World Wide Web (Web) is constituted from the entire set of interlinked hypertext documents that reside on Hypertext Transfer Protocol (HTTP) servers that are globally connected by Internet. Documents resident on the Web (Web pages) are generally written in a mark-up language such as HTML (Hypertext Markup Language) and are identified by URLs (Uniform Resource Locators). In general, URLs correspond to addresses of Internet resources and serve to specify the protocol to be used in accessing a resource, as well as the particular server and pathname by which the resource may be accessed.
Files are transmitted from a Web server to an end user under HTTP. Codes, called tags, that are embedded in an HTML document associate particular words and images in the document with URLs, so that an end user can access other Web resources, regardless where physically located, upon the activation of a key or mouse.
Users of client computers use Web browsers to locate Web pages that, as indicated above, are identified by URLs. Specialized servers, called search engines, maintain indices of the contents of Web pages. The browsers may be used to pose textual queries. In response, the search engines return result sets of URLs that identify Web pages that satisfy the queries. Usually, the result sets are rank ordered according to relevance.
In this regard, information related to the connectivity of Web pages, such as the number of links to or from a page, can be used as a tie-breaking mechanism in ranking the result sets or as an input in deciding the relative importance of result pages.
The URL names of the result sets may then be used to retrieve the identified Web pages, as well as other pages connected by xe2x80x9chot links.xe2x80x9d
However, many users are interested in more than merely the content of the Web pages. Specifically, users may be interested in the manner in which Web pages are interconnected. In other words, users may be interested in exploring the connectivity information embedded within the Web for practical, commercial, or other reasons.
The connectivity information provided by search engines exists largely as a byproduct of their paramount function. Although an unsophisticated user may easily follow a trail between connected Web pages, the extraction of global view of connectivity quickly becomes tedious. The connectivity representation in the search engines serves a single purpose: to provide answers to queries. However, determination of all pages that are, for example, two links removed from a particular page may require thousands of queries, and a substantial amount of processing by the user. Without a separate representation of the Web, it is very difficult to provide linkage information. In fact, most search engines fail to provide access to any type of connectivity information.
This is a significant drawback, because linkage information between Web pages is a valuable resource for Web visualization and page ranking. Several ongoing research projects use such information. Most connectivity information is obtained from ad-hoc Web xe2x80x9ccrawlersxe2x80x9d that build relatively small databases of local linkage information.
A database may be constructed on the fly or statically. When constructed on the fly, each new page is parsed as it is accessed in order to identify links. The linked neighboring pages are retrieved until the required connectivity information is gathered. When statically constructed, a connectivity database is essentially rebuilt from scratch whenever updates are required. For example, the service designated Linkalert(trademark) provided by Lycos (see http://www/lycos.com/linkalert/Overview.htm), uses static databases specifically designed to offer linkage information for particular Web sites. Earlier implementations of both on-the-fly and static approaches have proven inefficient and clumsy to use, and do not comprehend to the entire Web and a large number of clients. Consequently, prior-art implementations of connectivity databases generally perform poorly and/or are limited in scope.
Accordingly, U.S. Pat. No. 6,073,135, entitled xe2x80x9cConnectivity Server for Locating Linkage Information Between Web Pages,xe2x80x9d hereby incorporated by reference, is directed to a server that enables convenient and efficient representation and navigation of connectivity information of Web pages. The server described therein (hereinafter xe2x80x9cCS1xe2x80x9d) maintains accurate linkage information for a significant portion of the Web and supports a large number of client users that desire numerous variants of connectivity information. In addition, the system dynamically updates the connectivity information so that the linkage information is current.
FIGS. 1 through 9 of the Drawings depict the implementation of CS1 in accordance with U.S. Pat. No. 6,073,135.
As depicted in FIG. 1, the Web is shown to comprise a widely distributed network of computers 100 that include numerous client computers 110 connected to server computers 120 by a network 130. Generally, servers 120 provide information, products, and services to users of the clients 110.
Client computers 110 may be personal computers (PCs), workstations, or laptops. Typically, clients are equipped with input/output devices 115, such as a keyboard, mouse, and display device 115. Software in the form of a Web browser 111 interacts with devices 115 to provide an interface between the user and the Web.
The server computers 120 are usually larger computer systems, although this does not always need to be so. Some of the servers, also known as xe2x80x9cWeb sites,xe2x80x9d maintain a database (DB) 121 of Web pages 122. Each Web page 122 is identified and can be located by its URL 123. Web pages are usually formatted using HTML, which establishes links to other pages. A user is afforded the opportunity to xe2x80x9cclickxe2x80x9d on a link within a page viewed with the browser in order to access a xe2x80x9cpointed toxe2x80x9d page.
Search engines, in the form of servers 140, maintain an index 141 of the contents of Web pages. Using a search engine application programming interface (API) 142, client users may locate pages having specific content of interest to the users. The user specifies pages of interest to the API of the search engine 140 by composing queries that are processed by the search engine""s API 142.
A specialized, xe2x80x9cconnectivityxe2x80x9d server 150 is also provided. Connectivity server 150 maintains a connectivity database 151. Using a connectivity server API 152, users may locate pages (URLs) according to the definition of the interconnection between pages.
As shown in FIG. 2, a graph 200 is built to represent the connectivity of web pages. In the graph 200, each node (A, . . . , G) 210 represents a Web page 122. Each edge, for example an edge (AB) 220 represent a link from one page to another, for example, with edge AB representing a link from page A to page B. The connectivity API 152, in various forms, enables client users to xe2x80x9cexplorexe2x80x9d or navigatexe2x80x9d graph 200 to extract connectivity information.
It is readily appreciated that the data representation of graph 200 in memory must be carefully designed to minimize memory storage requirements. Assuming the graph contains approximately 100M Web pages with an average outdegree of seven, then the graph will have about 700M edges. A rudimentary implementation would store two pointers per edge. Furthermore, given that the average size of a URL is about 80 bytes, the uncompressed URLs of the nodes depicted in the rudimentary implementation will occupy about 8 Gb (Gigabytes). From another perspective, storage of 1 B (uncompressed) edges will similarly require 8 Gb of storage, even if the endpoints are susceptible of representation as 4-byte integers. Because currently, 1 B edges may typically be captured in a single week""s web crawl, the demand for storage capacity quickly becomes extraordinary.
Graph 200 is built, maintained, and traversed as follows. Preferably, the input utilized in building the graph is provided by the search engine 140. However, it should be understood that the input for constructing the graph may also come from other sources.
As shown in FIG. 3, the input for constructing graph 200 is a set of URLs {URL A, . . . , URL Z} 310. URL set 310 identifies known Web pages 122. The URLs or names of the set 310 are first lexicographically sorted in module 320. Next, the sorted URLs are delta encoded in module 330 to produced a list 340. In list 340, each entry 341 is stored in as a difference (delta) between the current URL and a previous URL. Because pages maintained at the same site are likely to have fairly large prefix portion in common, storage reduction due to delta encoding is considerable. For 100 million URLs, storage may be reduced by about 70%.
For example, if the input URLs 310 are:
www.foobar.com/
www.foobar.com/gandalf.html
www.foograb.com/,
then the output, delta-encoded URLs 340 are:
0 www.foobar.com/
14 gandalf.html
7 grab.com/
More precisely, each entry 341 of the list 340 includes the following fields: a size field 342 that indicates the number of common bytes with the previous URL; a delta field 343 that stores the bytes that are different than the shared prefix, terminated by a zero byte 344; finally, a field (Node ID) 345 identifies the node that represents the corresponding page.
Delta encoding URL values comes at a price, namely an increase in the processing required to perform during an inverse translation to recover a full URL. In order to recover a complete URL, one must start with the first entry of the list 340 and linearly apply all delta values 342 until the URL under consideration is reconstructed.
This situation may be ameliorated by periodically remembering an entire URL as a checkpoint URL entry 350. The checkpoints 350 can be maintained as a separate sorted list 360 on which a binary search can be applied. Thus, once the closest preceding checkpoint URL 350 has been located, only the delta values from that point on need be applied. The cost of inverse translation can be controlled by the number of entries 350 in the checkpoint list 360. In one embodiment, a checkpoint entry may be maintained for approximately every thousand bytes of URL data in the list 340.
Referring now to FIG. 4, the edges of the graph 200 are constructed from a list of pairs 410. Each pair 420 includes the node ID of a first (URL1) 421, and a second node ID (URL2) 422 of a second page that contains a link to the first page. The pairs 420 essentially indicate the connectivity of the pages. The pairs may be obtained from a search engine 140 or from other sources.
The list 410 is sorted twice (431, 432), first according to the first node ID 421 to produce an inlist table 441, and, second, according to the second node ID 422 to produce an outlist table 442. The inlist table contains only the second node ID from each pair: the high order bit (bit 32) 450 of a list entry is set to indicate the end of a group of common connected nodes, that is a group of nodes that point to the same page P. The entry 510, described below and illustrated in FIG. 5, corresponding to P contains a field 512 that points to the beginning of the group of nodes within the inlist that point to P. The outlist table is organized in a similar way. In other words, each edge 220 of the graph 200 is represented twice to indicate pages pointing to a particular page, and to indicate pages pointed to from a particular page.
As shown in FIG. 5, graph 200 itself is maintained as an array 500. The nodes of the graph are represented by elements 510 of the array 500. Each element 510 includes three fields 511, 512 and 513. Field 511 stores a pointer (URL pointer) to the delta-encoded list 340 of FIG. 3. Fields 512 and 513 point to the corresponding respective inlist 441 and outlist 442. In other words, field 511 points to the node name, field 512 points to the incoming edges, and field 513 points to the outgoing edges.
As shown in FIG. 6, a user is able to explore the connectivity of the Web by supplying an input URL (URL in) 601. The input URL 601 is used to binary (or interpolation) search 610 the checkpoint list 360 to locate the closest delta checkpoint 350. Subsequently, delta values 343 are applied in a Delta Scan module 620 until a full URL 621 equal to the input 601 is recovered. The associated node ID 345 is used to index, via module 630, the array 500. Indexing the array 500 locates a start node 631 from which connectivity can be explored in step 640. Graph 200 can be navigated to the depth desired using the inlist table 441 and outlist table 442, thereby producing an output list of URLs (URLs out) 609.
FIG. 7 depicts in greater detail a data structure (ID-to-URL Array) 511 that is used to recover a full URL from a node ID. In the array 511, one entry exists for each node 210 in graph 200. Entries 701 point to the nearest checkpoint URL 350 for each node in the checkpoint list 360. Subsequent delta values 343 are applied until an entry with a matching node ID 345 is found. At this pint, the full URL 709 has been recovered.
The above-referenced process is illustrated in FIG. 8. The input to the process is one of the output URLs 609 of FIG. 6. The node ID is used as an index in the ID-to-URL table 511 to determine a closest checkpoint 350. Delta values are decoded until the matching node ID in field 345 is found, at which point the full URL 709 has been recovered.
The overall structure of the connectivity server 150 is shown in FIG. 9. The connectivity data structures 151 may, in one embodiment, be stored in a hard disk, or disk array, associated with server 150. The connectivity structures 151 include the delta encoded list 340 of URLs, including checkpoints, as well as inlist and outlist tables 441 and 442, the node ID array 500, and the ID-to-URL array 511. Connectivity processes 910 are operable to locate a starting node in the graph 200 for a given URL. The processes 910 can also navigate the graph 200 to locate connected nodes. Data structure 151 may be updated to add new nodes and edges that correspond to newly found pages and links, or to delete portions of the graph for which Web pages are no longer accessible.
Connectivity server 150 includes the following APIs. A first API 911 interfaces to the search engine 140. This interface is used to obtain the URLs of Web pages that are represented by the nodes of the graph. A Web API 912 is connected to a conventional Web HTTP server 920 to provide a World Wide Web interface 921.
In addition, a public API 913 is provided for public clients 930, and a private API 914 is provided for private clients 940. The private API 914 allows access to more efficient data structures and processes for privileged users. A user may gain access to the APIs with the browser 111 of FIG. 1.
A basic connectivity query assumes the form: xe2x80x9cList L,xe2x80x9d where L is the URL of a Web page. In response, the connectivity server supplies a list of all URLs pointing to Web page L, as well as all Web pages pointed to by page L.
A neighborhood query assumes the form: xe2x80x9cList L, D,xe2x80x9d where D specifies the degree of connectivity to be explored. Here the connectivity server""s response will be a list of URLs at a distance D from page L. It should be understood that more complex queries may be composed specifying logical combinations of URLs and distances. A private query allows users to pose queries in an internal format of the connectivity server; and the server""s response may include more detailed information, such as names of the servers storing the connected pages.
As described above, the connectivity server provides linkage information for a significant portion of the Web. The information can be used by applications that rank Web pages according to their connectivity. For instance, pages with many connections may be considered authoritative pages, or xe2x80x9chubs.xe2x80x9d The information can be used to build Web visualization and navigation tools, and can be used in conjunction with search engine results to lead users to portions of the Web that store content that may be of interest. In addition, the technique may be used to optimize the design and implementation of Web crawlers based on statistics derived from the in degrees and out degrees of nodes.
In one embodiment, the connectivity server described above may be implemented on Digital Equipment Corporation 300 MHz Alpha processors configured with 4 GB of RAM and a 48 GB disk. Graph 200 included 230M nodes with about 360M edges. The average storage space for each URL is approximately 25 bytes for a total of 5.6 Gigabytes for the delta compressed URL database. The connectivity server responds to user queries at the rate of about one URL every 0.1 millisecond.
Although the connectivity server described above may fairly be viewed as a substantial advance in the techniques formerly available for extracting connectivity information related to Web pages, there remain opportunities for further significant advances that are addressed by the subject invention. For example, further compression of both URLs and links results in the ability to store appreciably more information in the same quantity of physical storage media. In addition, the subject invention enables connectivity information to be extracted more rapidly than heretofore, thereby facilitating applications such as the static ranking of pages (eigenranks), query precomputation, mirror site detection and related-page identification.
The above and other features, capabilities and advantages as achieved, in one aspect of the invention, by a connectivity server that comprises a URL Database that stores URLs and that associates a fingerprint and a CS_id with each stored URL; a Host Database that associates a Host_id with each distinct hostname in the URL Database; and a Link Database that stores links between a source URL and a destination URL. The URL Database interface is operable to translate between any two of a URL, a fingerprint, and a Host_id. The Host Database interface is operable to accept a Host_id and return a number equal to the number of URLs on the respective host and to return the CS_ids of those URLs. The Link Database interface is operable to retrieve, for a given CS_id, the number of inlinks to and outlinks from the URL corresponding to the CS_id.
In another aspect, the invention is embodied in A computer program product for efficiently arranging and storing information regarding the World Wide Web (Web). The computer program product may be used in connection with a computer system, including but not limited to a connectivity server, and comprises a computer readable storage medium onto which is written information and instructions in the form of a URL Databases that comprises: a plurality of URLs, a fingerprint associated with each of the URLs, and a URL Interface for translating a URL to a fingerprint or to a CS_id, a fingerprint to a URL or to a CS_id, and a CS_id to a URL or to a fingerprint. In a more specific realization of the computer program product, the URL Database comprises at least three partitions. A first partition is occupied by URLs with a respective indegree or outdegree that is greater than a first number; a second partition is occupied by URLs with a respective indegree or outdegree that is greater than or equal to a second number but less than the first number; and a third partition is occupied by URLs with a respective indegree or outdegree that is less than the second number.
A further aspect of the invention may be apprehended as a method for obtaining data defining the connectivity of pages on the Web. The method comprises obtaining access to a URL Database that stores URLs and that associates a fingerprint and a CS_id with, each stored URL, wherein the URL Database comprises a URL Database API and a URL Index Array; obtaining access to a Host Database that associates a Host_id with each distinct hostname in the URL Database, wherein the host comprises a Host Database API that operates to accept a CS_id and return a corresponding Host_id and to accept a Host id and return the CS_ids on the corresponding host; and obtaining access to a Link Database that stores links between source URLs and destination URLs, wherein the Link Database comprises a Link Database API that operates to retrieve, for a given CS_id, the number of outlinks from the URL corresponding to the CS_id and the number of inlinks to that URL.