1. Field of the Invention
The present invention relates, generally, to content delivery networks and, in preferred embodiments, to systems and methods employing random walks for mining web page associations and usage (data mining), and to optimize user-oriented web page refresh and pre-fetch scheduling.
2. Description of the Related Art
Web performance is a key point of differentiation among content providers. Snafus and slowdowns with major Web sites demonstrate the difficulties companies face when trying to scale large Web traffic. As Internet backbone technologies develop, many innovations, such as quality of service management, have been used to improve network bandwidth and improve Web content retrieval time. These improvements to infrastructure, however, cannot solve traffic problems occurring at any one point in the Internet. For example, in FIG. 1, an end-user 10 in a network 12 in Japan wants to access a page in a content provider original Web site 14 in a network 16 in the U.S. The request will pass through several Internet Service Provider (ISP) gateways 18, 20, and 22 before it reaches the content provider original Web site 14. Because of gateway bottlenecks and other delay factors along the Internet paths between the end-user and the content provider original Web site 14, a content pre-fetching and refreshing methodology utilizing a proxy server on the end-user side of the gateways could provide faster response time.
FIG. 2 illustrates a typical Web content delivery and caching scheme 24 which includes a caching system 26 connected to multiple non-specific Web sites 28 and 30. The caching system 26 is comprised of a proxy server or cache server 32, and cache 34. It should be understood that the cache 34 may be proxy cache, edge cache, front end cache, reverse cache, and the like. Alternatively, the caching system 26 of FIG. 2 can be replaced by a content delivery services provider and mirror sites, which would be connected to Web sites that have entered into subscriber contracts with the content delivery services provider. These subscriber Web sites will deliver content to the content delivery services provider for mirroring, but will not necessarily notify the content delivery services provider when the content has changed.
In FIG. 2, when content is delivered from a Web site to cache 34, a header called a meta-description or meta-data is delivered along with the content. The meta-data may be a subset of the content, or it may indicate certain properties of the content. For example, the meta-data may contain a last-modified date, an estimate that the content will expire at a certain time, and an indication that the content is to expire immediately, or is not to be cached. After the content and meta-data are delivered, if storing the content in cache 34 is indicated by the meta-data, the content will be stored in cache 34.
When a user 36 (user 1) requests access to a page (e.g., index.html) from a Web site 28 (Web site 1), the Web browser of user 1 will first send a request to a domain name server (DNS) to find the Internet Protocol (IP) address corresponding to the domain name of Web site 1. If, as in the example of FIG. 2, a caching system 26 is employed, the Web browser may be directed to the proxy server 32 rather than Web site 1. The proxy server 32 will then determine if the requested content is in cache 34.
However, even though the requested content may be found in cache 34, it must be determined whether the content in cache 34 is fresh. This problem can be described as database synchronization. In other words, it is desirable for the cache 34 and Web site 1 to have content that is the same. As described above, however, subscriber Web sites may not notify the proxy server 32 when their content has changed. Thus, the proxy server 32 may examine the meta-data associated with the requested content stored in cache 34 to assist in determining if the content is fresh.
If the requested content is found in the cache 34 and the meta-data indicates that the estimated time for expiration has not yet occurred, some caching systems will simply deliver the content directly to user 1. However, more sophisticated caching systems may send a request to Web site 1 for information on when the desired content was last updated. If the content was updated since the last refresh into cache 34, the content currently in the cache 34 is outdated, and fresh content will be delivered into the cache 34 from Web site 1 before it is delivered to user 1. It should be understood, however, that this process of checking Web sites to determine if the content has changed will also increase bandwidth or system resource utilization.
Similarly, if the requested content is found in the cache 34 but the content was set to expire immediately, some caching systems will simply fetch the content from Web site 1 and deliver it to user 1. However, if the end-user requests a validation of data freshness, some caching systems may send a request to Web site 1 for information on when the desired content was last updated. If the content was last updated prior to the last refresh into cache 34, the content is still fresh and the caching system will deliver the content to user 1, notwithstanding the xe2x80x9cexpired immediatelyxe2x80x9d status of the content.
If the requested content is not in the cache 34, the proxy server 32 will send the request to Web site 1 to fetch the text of the desired Web page (e.g., index.html). After user 1""s Web browser receives index.html, the browser will parse the html page and may issue additional requests to Web site 1 to fetch any embedded objects such as images or icons. However, if a caching system 26 is employed, the proxy server 32 will first determine if the embedded objects are available in the cache 34. All traffic (i.e., data flow) is recorded in a log file 38 in the proxy server 32. The log file 38 may include the IP addresses of the location from which requests are issued, the URLs of objects fetched, the time stamp of each action, and the like. Note that a proxy server 32 is usually shared by many end-users so that the content in the cache 34 can be accessed by end-users with similar interests. That is, if user 1 accesses a page and the page is stored in the cache 34, when another user 40 (user 2) requests the same page, the proxy server 32 can simply provide the content in the cache 34 to user 2.
In some caching systems a refresh may be performed even when there is no end user request for content. Without any user request being received, the cache will send a request to the Web site that delivered content into the cache to determine when the content in the Web site was last updated. If the content has changed, the content will be refreshed from the Web site back into cache. Thus, when a request for content is received from an end user, it is more likely that the content in cache will be fresh and transmitted directly back to the end user without further delay.
Network bandwidth resources and system resources are important for end users and proxy servers connected to the Internet. The end users and proxy servers can be considered to be xe2x80x9ccompetingxe2x80x9d with each other for bandwidth and connections resources, although their goals are the samexe2x80x94to provide users with the fastest response time.
FIG. 3 illustrates the connections available for a typical proxy server 42. The fastest response time for an individual request can be achieved when the requested content is located in the proxy server cache and is fresh, so that the proxy server 42 does not need to fetch the content from the Web site through the Internet. This situation is known as a cache xe2x80x9chit.xe2x80x9d System-wide, the fastest response times are achieved with a very high cache hit ratio. Thus, it would seem clear that more pre-fetching 44, refreshing, and pre-validation will lead to more fresh content, a higher cache hit ratio, and faster response times for an end user. However, there is a trade-off. To achieve a very high cache hit ratio, the proxy server 42 may need to utilize a high percentage of network bandwidth for content refreshing, pre-fetching, fetching, or pre-validation 44 into cache. Nevertheless, despite a large amount of refreshing, there will be occasions when an end user will request content that has not been refreshed into cache, or is simply not in the cache. In such a circumstance the proxy server 42 must issue a request fetch 46 to request the content from the Web site. However, if an excessive amount of bandwidth is currently being used to refresh other content, there may be insufficient bandwidth available for the cache to fetch the requested content from the Web site, and the response time of the content fetch may actually increase substantially. Thus, it should be understood that cache refreshing and pre-fetching competes with, and can be detrimental to, Web site content fetching.
Of course, if there is unused bandwidth at any moment in time, it makes sense to pre-fetch the highest priority content into cache so that it can be available for a requesting end user. For example, assume that 20% of the bandwidth is used for fetching content from a Web site when an end user requests the content and there is no cache hit. If 20% of the bandwidth is used for such fetches, then 80% of the bandwidth is unused. This unused bandwidth can be used to pre-fetch other content into cache so that when end users request that content it will be available to them. However, because only a percentage of the content stored in cache can be refreshed or pre-fetched due to network bandwidth limitations, a method for selecting the content to be refreshed or pre-fetched is desired.
Depending on the circumstances, the selection of which content to pre-fetch may not be a trivial task. In the simplest case, for example, assume that a single end-user is currently accessing a particular Web page in a Web site. Shortly, this end-user may navigate to another Web page. By pre-fetching those Web pages most likely to be navigated next, it may be possible to improve that end-user""s response time. Because of the likelihood that the end-user will use a hyperlink on the current Web page to navigate to another Web page, it may make sense to pre-fetch Web pages according to the hyperlinks (link structure) found at the current location of an end-user. However, if two or more end-users are navigating one or more Web sites, and only a limited number of Web pages may be pre-fetched, the determination of which Web pages to pre-fetch becomes more difficult.
One way to determine the priority of Web pages to be pre-fetched is based on update frequency and query frequency. However, although the home page in a Web site may be queried more frequently than any other Web page, end-users currently navigating the Web site may not return to the home page for some time, and thus query frequency may not be the best determining factor in deciding which Web pages to pre-fetch. Furthermore, because end-users typically enter the Web site from the home page, the home page may already be available in cache. In addition, the update frequency of a Web page is not necessarily related to the likelihood that it will be accessed next, given the current location of end-users navigating a Web site.
The challenge of identifying a Web page that has a high probability of being accessed next can also be viewed as one of xe2x80x9cassociationsxe2x80x9d between Web pages. For example, two Web pages may be associated with each other because they both contain information about the stock market. Generally speaking, given the current location of an end-user, it is more likely than not that the next Web page to be accessed will somehow be associated with the current Web page. Thus, understanding something about the associations between Web pages may provide some insight in determining pre-fetching priorities.
When an author prepares a Web document, primary information is provided directly within the Web page, while related information on other Web pages is linked using anchors. In traditional information retrieval systems, the association between a given set of documents is determined by comparing keyword vectors that represent the content of the primary information provided directly within the Web page. These document associations are used for providing users with pages relevant to what they are currently viewing. However, such systems do not take link structure into consideration.
Therefore, it is an advantage of embodiments of the present invention to provide a system and method employing random walks for mining web page associations and usage, and to optimize user-oriented web page refresh and pre-fetch scheduling that takes both link structure and Web page content into consideration.
It is a further advantage of embodiments of the present invention to provide a system and method employing random walks for mining web page associations and usage to optimize user-oriented web page refresh and pre-fetch scheduling that includes link analysis derived based on solving equations rather than using iteration-based methods.
It is a further advantage of embodiments of the present invention to provide a system and method employing random walks for mining web page associations and usage to optimize user-oriented web page refresh and pre-fetch scheduling that allows a set of Web pages to be specified to focus the reasoning.
These and other advantages are accomplished according to a method for estimating an association between the media objects and the seed Web page. The method is employed in the context of a Web space having a set of Web pages V and a set of links between those Web pages E modeled as a directed graph G(V,E). Each Web page vixcex5V comprises a pair (Ov,av), where Ov is a set of media objects (including a main HTML file) and av is a page author. Each object oxcex5Ov has a known size size(o), an end-user preference upref(u) for an end-user u, and a page author preference apref(av) for a page author av. The Web space further includes an end-user u currently located at a seed Web page vc and an available pre-fetch bandwidth P.
The method first calculates a page preference weight pref(u,v) for each Web page vi by applying preference rules defined by upref(u) and apref(av) to the contents of Ov, and calculates an object preference weight pref(u,o,v) for each object oxcex5Ov by applying the preference rules defined by upref(u) and apref(av) to the contents of Ov.
Next, a random walk graph is generated, and a page gain gain(u,v) is calculated by finding a steady state distribution (convergence vector) of the random walk graph. An object gain gain(u,o) is then calculated for each object as             gain      ⁢              xe2x80x83            ⁢              (                  u          ,          o                )              =                  ∑                  o          ∈                      O            v                              ⁢              xe2x80x83            ⁢                        gain          ⁡                      (                          u              ,              v                        )                          xc3x97                  pref          (                      u            ,            o            ,            v                    )                      ,
wherein the object gain represents an association between the object and the seed Web page.