1. Field of the Invention
The present invention relates to distributed computing networks, and deals more particularly with techniques for addressing the name space mismatch between content servers, which typically store content using file names, and content caching systems, which typically store content using Uniform Resource Locators.
2. Description of the Related Art
The popularity of distributed computing networks and network computing has increased tremendously in recent years, due in large part to growing business and consumer use of the public Internet and the subset thereof known as the “World Wide Web” (or simply “Web”). Other types of distributed computing networks, such as corporate intranets and extranets, are also increasingly popular. As solutions providers focus on delivering improved Web-based computing, many of the solutions which are developed are adaptable to other distributed computing environments. Thus, references herein to the Internet and Web are for purposes of illustration and not of limitation.
Millions of people use the Internet on a daily basis, whether for their personal enjoyment or for business purposes or both. As consumers of electronic information and business services, people now have easy access to sources on a global level. Similarly, an enterprise's web-enabled applications may use information and services of other enterprises around the globe. When a human user is the content requester, delays in returning responses may have a very negative impact on user satisfaction, even causing the users to switch to alternative sources. Delivering requested content quickly and efficiently is critical to the success of an enterprise's web presence.
An additional concern in a distributed computing environment is the processing load on the computing resources. If a bottleneck occurs, overall system throughput may be seriously degraded. To address this situation, the content supplier may have to purchase additional servers, which increases the cost of doing business.
One technique which has been developed to address these problems is the use of content caching systems, which are sometimes referred to as “web caches”, “cache servers”, or “content caches”. The goal of a caching system is to store or “cache” content at a location (or at multiple locations) in the computing network from which the content can be returned to the requester more quickly, and which also relieves the processing burden on the back-end systems by serving some requests without routing them to the back-end. Two basic approaches to caching systems are commonly in use. These are called (1) proxies, also known as “forward proxies” and (2) surrogates, also know as “reverse proxies”. Each of these will now be described.
A forward proxy configuration is shown in FIGS. 1A and 1B. Forward proxies function in what is known as a “client pull” approach to content retrieval. That is, the forward proxy functions on behalf of the client (for example, an end user's browser or other user agent) to either deliver content to the client directly from the proxy's accessible cache storage, if the requested content is already in the cache, or to request that content from a content server otherwise. FIG. 1A shows a client 100 requesting 105 some content, where this request 105 travels through the Internet 110 and reaches a forward proxy 115. In FIG. 1A, it is assumed that the requested content is not yet available from proxy 115's cache storage 120. Therefore, proxy 115 sends 125 its own request for that content to a content server 130. (For purposes of illustration but not of limitation, a content server is also referred to herein as a “web server”). It may happen that proxy 115 also functions as a load balancing host or network dispatcher, whereby it selects a content server 130 from among several content servers 130, 131, 132 that are available for servicing a particular request. The WebSphere® Edge Server which is available from the International Business Machines Corporation (“IBM”) is an example of a solution providing both load balancing and proxy caching. (“WebSphere” is a registered trademark of IBM.) A separate load balancing host might be placed in the network path between proxy 115 and content servers 130, 131, 132 as an alternative. This has not been illustrated in the figures, as the load balancing function is not necessary to an understanding of the present invention.
Returning to the description of the content request scenario, content server 130 obtains the requested content and returns 135 that content to the proxy 115. To obtain the requested content, a particular content server may invoke the services of an Application Server (such as a WebSphere® application server which is available from IBM), where this application server may be co-located with the content server 130 in a single hardware box or may be located at a different device (not shown). The Web server may also or alternatively invoke the services of a back-end enterprise data server (such as an IBM OS/390® server running the DB/2 or CICS® products from IBM), which may in turn access one or more databases or other data repositories. These additional devices have not been illustrated in the figure. (“OS/390” and “CICS” are registered trademarks of IBM.)
After proxy 115 receives the content from the content server 130, proxy 115 returns 140 this content to its requesting client 100. In addition, proxy 115 may store 145 a locally-accessible copy of the content in a data store 120 which is used as cache storage. (There may be cases in which content is marked as “not cachable”, and in these cases, the store operation 145 does not occur.) The benefit of using this forward proxy and its data store 120 is illustrated in FIG. 1B.
FIG. 1B illustrates a scenario in which a different client 101 (or perhaps the same client 100) which accesses proxy 115 makes a request 150 for the same content which was requested in FIG. 1A. This request 150 again travels through the Internet 110 and reaches the forward proxy 115. Now, however, assume that the requested content was stored in proxy 115's cache storage 120 following its earlier retrieval from content server 130. Upon detecting that the requested content is locally-accessible, proxy 115 retrieves 155 and returns 160 that content to the requesting client 101. A round-trip from the proxy 115 to the content server 130 has therefore been avoided, saving time and also freeing content server 130 to perform other functions, thereby increasing the efficiency of the back-end resources while providing a quicker response to the requesting client.
As a forward proxy continues to retrieve content for various requests from content servers, it will populate its cache storage with that content. Assuming that the system has sufficient storage to accumulate a proper “working set” of popular content, the ratio of requests which can be served from cache should grow after the initial populating process, such that fewer requests are routed to the back-end.
A surrogate configuration is shown in FIGS. 2A and 2B. Surrogates function in a “server push” approach to content retrieval. That is, a content server pushes content to a surrogate based upon determinations made on the back-end of the network. For example, a content creator might know that certain content is likely to be heavily used, and can configure a content server to push that content proactively to the surrogate, without waiting for clients to request it. Then when requests do arrive from clients, the requests can be served directly from the cache storage without making a request of the back-end resources and waiting for a response. In addition, a content creator using a content management system (“CMS”) to create new content or to revise existing content can cause the CMS to notify the content server of the presence of this content, and the content server may push the content to the surrogate in response. FIG. 2A shows a CMS 220 pushing content 215 to content servers 130, 131, 132. A selected one of these content servers 130 is depicted as notifying 210 the surrogate 200 of the new content, which the surrogate then stores 205 in its cache 120. The benefit of using this surrogate and its data store 120 is illustrated in FIG. 2B.
FIG. 2B illustrates a scenario in which a client 100 which accesses surrogate 200 makes a request 230 for content which was pushed out to the surrogate's cache as shown in FIG. 2A. This request 230 travels through the Internet 110 and reaches the surrogate 200. Upon detecting that the requested content is locally-accessible, the surrogate 200 retrieves 235 and returns 240 that content to the requesting client 100. As with the scenario illustrated in FIG. 1B, a round-trip from the surrogate 200 to the content server 130 has therefore been avoided, decreasing response time to the requesting client 100 and reducing the processing load on the back-end system.
In some cases, the functions of a proxy and surrogate are combined to operate in a single network-accessible device. IBM's WebSphere Edge Server is a caching solution that can be configured to function as either a forward proxy or a surrogate, or both. Hereinafter, the term “caching system” is intended to refer to both forward proxies and surrogates.
Content management systems use a content distribution protocol to publish notifications to content servers, and use a protocol such as Hypertext Transfer Protocol (“HTTP”) or File Transfer Protocol (“FTP”) to move content from an authoring machine to a staging server (or to a production server). Typically, a notification as described herein comprises a content distribution protocol message which is sent from the CMS to one or more content servers, notifying them of new content or changes to previously-distributed content. A CMS knows the file system path with which that content is stored (e.g. at a staging server). This external, file-oriented identifier is the only identifier by which a CMS typically identifies content: the CMS does not typically know the Uniform Resource Locator, or “URL”, by which the content will be requested from user agents (referred to equivalently herein as client browsers). This is because of a name space mismatch between the identifiers used at content servers and those used at content caching systems.
Content caching systems use a name space view based on URLs. Content servers and staging servers, on the other hand, store their content using a traditional directory structure (i.e. file path and file name) view wherein the file names relate to physical locations on storage devices. Content servers translate from an incoming URL to a local path and file name (referred to hereinafter simply as a file name for ease of reference) through the use of rewrite rules. The content server retrieves the content from its physical location using the translated file name, and returns that content to the requesting client as a response to the client's request for a particular URL. Because caching systems are positioned in the network path between the requesting client and the content server, they know only the URL for the content, and do not know the associated file name used to store that content at the content server. This is referred to herein as the name space mismatch.
FIG. 3 illustrates the flow of messages in this name space mismatch situation. A user of a content authoring tool 360 may create new content, or revise existing content, and/or determine that previously-published content should be invalidated for one reason or another. That information is then conveyed 350 to a CMS 320 for distribution into the content network. CMS 320 therefore sends a notification message 310 to a content distribution client on content server 130, specifying a file name of the corresponding content. Suppose, for purposes of this example, that caching system 300 also has a content distribution client, and that CMS 320 also sends a notification message 310 to this client. The CMS may also place 330 the new or revised content onto a staging server 340 (which may be located on a separate physical device from the CMS or may be located on the same device.) (Note also that the CMS may have the content authoring tool built in, or may have hooks to invoke it as necessary.) If notification 310 indicates that new or revised content is available, content server 130 uses the file name from the notification to retrieve 370 the content from the staging server 340. However, caching system 300 has no knowledge of how this file name is related to the content stored in its URL-based storage 120, and therefore cannot process the notification message.
Or, if content is to be invalidated (i.e. deleted), the CMS 330 issues an appropriate invalidation message notifying the content server to remove the content from its storage. Caching system 300 cannot process this message, and therefore cannot remove the invalidated cached content from its cache storage.
This prior art approach does not allow caching systems to use notification messages sent by a CMS, and thus the efficiency of the caching system and of the CMS's content distribution operations is impacted. More serious problems may result from the inability of the caching system to respond to content update and invalidation notifications: the caching system may continue to serve content that should have been replaced or removed.