A distributed application is a software system that runs on two or more computers connected by a computer network. Client-server computing is a special case of distributed application computing. With the growth of the World Wide Web (WWW), interactive distributed applications have become a substantial part of popular computer usage. Web services based on Hypertext Markup Language (HTML) and Hypertext Transfer Protocol (HTTP) represent one type of distributed application. Other kinds of distributed applications include instant messaging, streaming media, and automated teller machines used by banks. Electronic mail (email) is an example of a non-interactive distributed application. Distributed applications are commonly implemented using the Internet, but can also be implemented using private wide area networks (intranets), virtual private networks (VPNs), or local area networks (LANs).
A significant problem for users and providers of network services can be the slow or poor performance of a distributed application. Software that enables the performance of distributed applications to be monitored is thus an important tool in analyzing such performance issues. However, measuring the delivery via protocols such as HTTP of content over the Internet is complicated by the federated nature of the Internet (compared to LANs or intranets), because the overall performance of the system depends not only on infrastructure that is directly controlled by the application provider, but also by a multitude of third parties. These third parties include the providers of collocation and hosting services (e.g., Rackspace, Netinfra, Digex), providers of Internet network connections (e.g., InterNAP, UUNet, and Cable & Wireless), multiple backbone providers (e.g., ATT, Sprint, MCI, UUNet, and Cable & Wireless), content delivery networks (CDNs) (e.g., Akamai, Mirror Image Internet, and Digital Island—a subsidiary of Cable & Wireless), advertising networks (e.g., Double-Click and Avenue-A), and consumer Internet service providers (ISPs) (e.g., AOL, Earthlink, MSN, and ATT Broadband Internet). Problems in any of these third party providers can lead to distributed application service degradation or failure, but the number of providers involved (literally thousands) and the limited visibility that an application provider generally has into these independently administered systems commonly makes service problems particularly difficult to detect and diagnose.
A critical aspect of addressing performance problems is measurement, so that problems can be detected quickly when they occur, and so that their specific domain of impact can be identified in support of problem diagnosis. A useful measurement technique is described in commonly assigned U.S. patent application Ser. No. 09/991,127, titled “METHOD AND SYSTEM FOR MONITORING THE PERFORMANCE OF A DISTRIBUTED APPLICATION,” filed on Nov. 14, 2001, the specification and drawings of which are hereby specifically incorporated herein by reference. The above-noted patent application describes a technique for collecting certain “application level” metrics that provide an end-user's view of the performance of a distributed application. However, the above-noted patent application does not disclose details for the analysis of large bodies of data collected from a potentially large population of end-user access points, which may potentially be collected over an extended period of time. It would be desirable to provide a technique capable of monitoring an internetwork that is viable, but arbitrarily complex in terms of topology, geographic distribution, and fragmentation of administrative control. Today's Internet is exemplary of such a complex internetwork.
Before desirable characteristics of a technique capable of monitoring an internetwork can be defined, more detail regarding the characteristics of an internetwork needs to be provided. Conceptually, an internetwork generally assumes an Internet-style, host-addressing scheme, where communication end points are identified using positive integers as addresses. In the case of the Internet Protocol, version 4 (IPv4) based Internet, which is the version currently in use, these integers are 32 bits in length. IPv6 networks use 128-bit addresses. In general the addresses can be of any length, provided they are of sufficient length to uniquely identify all the hosts on the internetwork. In the context of a communications internetwork, the term “subnet” is commonly employed (relative to the IPv4 and the Internet) to describe a consecutive sequence of end point addresses. These address blocks may be aligned on boundaries that are exponential powers of 2, but this detail is not a requirement.
Two useful, related abstract concepts in the discussion of internetwork performance are the “core” and the “edge” of an internetwork. A modern internetwork such as the Internet consists of some number of well-connected, high capacity links. Commonly, in the United States, these links are operated by major telecommunications carriers, for example, AT&T, Sprint, and MCI/WorldCom. Also, these links are typically very well interconnected, with multiple redundant very-high-capacity connectivity points. These connectivity points are commonly called Network Access Points (NAPs). High performance links form the core of the internetwork. The modern Internet is commonly described as having a single core, although in reality, economic and geographic boundaries tend to complicate the topology somewhat. It should be recognized that direct or proximal connections to this core are the best choice for a commercially relevant distributed application, but such connections are expensive, and are generally therefore inappropriate for an end point connection, such as an end user's connection to their home. End users will typically connect through a “Consumer ISP” such as AOL or ATT Broadband Internet. The Consumer ISPs aggregate multiple end-users in special network centers commonly known as Points Of Presence (POPs). These POPs in turn are connected either directly or after additional aggregation to the core. The term “edge” is used herein to refer to the aggregation of all POPs and subsequent last-mile connections.
By way of analogy, the core is to an internetwork what the U.S. Interstate system is to the U.S. system of public roadways. The Interstate system is an aggregation of high capacity roads with high volume inter-connective points where the roads intersect. Although congestion does occur, in most cases these roads have been carefully designed and engineered to minimize such problems. Following this analogy, the Edge of the public road system comprises all of the roads that provide direct access to locations of interest, such as homes and offices. The Edge includes metropolitan surface streets, twisting residential grids, narrow twisting mountain roads, and limited access gravel roads. Just as it is impractical to have every home situated on an Interstate highway, it is generally impractical to have every internetwork end point directly connected to the internetwork core. As with the road system, connection end points are aggregated through a complex and arbitrary mesh of local connectivity with widely varying capacity and quality characteristics. When a person travels on a road, they often use major streets that are not directly connected to the destination and are not a part of the Interstate system; such roads represent a gray area between the core and the Edge of the road system. Similarly, by analogy, there is a gray area between the internetwork core and the edge. An important difference between the internetwork and the road network is that speeds on the computer network approach the speed of light; thus, “transcontinental” travel occurs with much greater frequency on the internetwork.
There are a number of specific problems that are currently faced by the administrators of distributed applications that rely on such a complex internetwork. Such distributed applications typically employ a number of third-party service providers to implement effective network communication between the application data center and the end point. These third-party service providers commonly include:    ISPs.    CDNs.    Advertising Networks.
ISPs provide connectivity between the application data center and the internetwork edge. Their role in the performance of the site is fundamental, providing the distributed application with the ability to establish a network connection with end users, as well as with the CDNs and other services on the internetwork. These carriers commonly operate large overlay networks with arbitrarily complex patterns of connectivity between the ISP and other parts of the Internet. Even though end point connectivity is the fundamental value of engaging an ISP, it is difficult or impossible in the status quo to evaluate an ISP based on the quality of end point connectivity it provides. More specifically, it is not possible to evaluate an ISP based on the quality of connectivity it provides to specific subnets on a commercial internetwork.
The concept of Autonomous System (AS) is related to the notion of an ISP. Whereas an ISP is fundamentally a unit of business organization, autonomous systems are a structural element of computer internetworks, corresponding to a set of links that are centrally managed and controlled. A reference useful for understanding autonomous systems and their role in the modem Internet is: BGP4: Inter-Domain Routing in the Internet. By John W. Stewart III, published by Addison-Wesley in 1999. It is easy to recognize that the modem Internet comprises many different autonomous systems, as each ISP manages and controls the network links it owns. It is very common for ISPs to be organized as multiple autonomous systems. In the status quo, it is difficult or impossible to evaluate an ISP based on the quality of connectivity it provides to the specific autonomous systems that form the Internet. As a consequence of the general inability to manage either subnets or autonomous systems, it is difficult or impossible to compare two ISPs based on their performance, or more generally, on their service, in terms of end point connectivity. Thus, it would be desirable to provide tools to enable users to quantitatively measure performance differences between two or more ISPs.
CDNs provide a service that magnifies the delivery power of a Web site by serving content components through a large distributed network of servers. In this way, they both increase the server resources available to a Web site and shorten the effective network distance between the content components and the communication end points. In such a deployment, the servers that are owned, operated, and dedicated to a specific Web site are commonly known as “origin servers,” and such terminology is employed herein. Although end point connectivity is the fundamental value provided by CDNs, it is generally not currently possible to evaluate a CDN based on the quality of connectivity it provides using conventional techniques. Even though the fundamental value of a CDN is improved end point connectivity, it is not generally currently possible, with existing techniques, to evaluate the quality of service provided by a CDN based on performance delivered to specific subnets. It would therefore be desirable to provide tools to enable users to quantitatively measure performance differences between two or more CDNs.
Additionally, there are certain factors relating to CDNs that impact both the performance and the cost of CDN services, but these factors are generally not available in an operational way to site administrators who control an application's use of the CDN. To understand these considerations, it should be understood that CDNs function by providing a cache of distributed application content on a distributed network of servers that are generally closer to the communication's end point than the main distributed application data center(s). Those of ordinary skill in the art will appreciate that the effectiveness of these caches, as with all caches, depends fundamentally on the property of locality of reference, the basic notion being that if a particular item is referenced at time t, then it is likely that it will be referenced again soon thereafter. On an internetwork, popular content has better locality of reference than content that is rarely accessed. Because of the properties of caches, the CDN will generally be effective and beneficial for delivering content with good locality. However, when content has poor locality, the CDN will generally be less beneficial, and in some common cases, can even negatively impact the performance of content delivery. The negative impact can occur because in the case of a cache miss, the CDN creates a middleman that imposes an additional step, the cache miss, to the process of retrieval of content components by the communications end point. When poor-locality content is delivered by a CDN, the distributed application suffers a double penalty—not only is application performance made worse, but the application managers must pay the CDN a fee for this reduced quality of service.
Typically, CDN customers are unaware of these subtle points of CDN performance. They arbitrarily use the CDN to deliver as much of their content as possible, with little genuine knowledge as to when the CDN is beneficial and when it is not. However, for those with a deeper understanding of CDN performance, the situation does not significantly improve. Knowledgeable users may understand that some content is very popular and is more likely to benefit from the CDN, while other content that is unpopular is less likely to benefit from a CDN. However, between “very popular” and “unpopular,” there is a vast continuum where most content lies. Currently, CDN customers have no effective means to measure the content that benefits from a CDN and the content that does not. It would thus be desirable to provide tools to enable users to quantitatively determine the content that can benefit from employing the services of a CDN and how much of a benefit will be provided by using a CDN.
An Advertising Network is similar to a CDN in that content components (in this case advertisements) are delivered by a third party. Commonly, the advertising networks will either use a CDN or will have a structure similar to a CDN to insure that the content is delivered with acceptable performance. As with CDNs, the same lack of performance insight generally applies. It would again be desirable to provide tools to enable users to quantitatively measure performance differences between two or more different advertising networks, streaming media delivery networks, and other third-party content delivery systems.
In context of this disclosure, the end point of a distributed application could be an end-user operating a Web browser, as with the modem Internet. Alternatively, as used herein, it will be understood the end point may take a number of other forms. For example, an end point may comprise an “agent,” i.e., a computer system acting with some degree of autonomy on behalf of an end user. An end point may also be a component of another distributed application, as suggested by the architectural description of the Java 2 Enterprise Edition (J2EE) environment or of Microsoft Corporation's “.NET” initiative.
Although numerous performance management solutions are available in the status quo to monitor distributed application performance, they all have major shortcomings with respect to the problems discussed above. Most performance management software products can be categorized as “system monitoring” products. While such system monitoring products are quite useful and effective for monitoring and maintaining a Web server or a database, they are generally ineffective at addressing the internetwork problems described above. Referring once again to the highway analogy, tools for diagnosing an automobile system problem such as the engine are generally not useful for diagnosing congested highway conditions. There are many popular sources for system monitoring solutions, such as Freshwater SiteSeer™ and Keynote Red Alert™, and numerous product offerings from Hewlett Packard, NetIQ, BMC Software, Tivoli, Computer Associates, Precise Software, and Quest Software.
Web Site Perspective™ software from Keynote Systems is one of the most popular tools for monitoring CDN and ISP performance. Substantially similar offerings include Topaz Active Watch™ from Mercury Interactive, and the Gomez Performance Network™, all of which are delivered as managed services. These systems work by generating a stream of test traffic from load-generation robots. The robots are placed at locations distributed throughout the internetwork. The reliable streams of test traffic generated by these robotic systems are very effective for monitoring Web site availability and for obtaining a baseline measure of system performance. These systems have serious limitations in terms of the problems described above, all of which are ultimately related to the cost of generating artificial test traffic from robots. The cost of generating artificial traffic from a robot is related directly to the cost of deploying and maintaining a remote robot. There are significant capital outlays for highly reliable hardware, software, and networking equipment. There are also significant operating expenses for hosting these systems and for providing network connectivity. As a result, the cost per managed Web page for these systems is quite high, commonly from $150–$1,000 per month per managed document. More specifically, the following are general limitations for Keynote-style performance management solutions:    Monitoring all content. To solve the content-related problems described above, it is generally necessary to collect management data for all the content delivered by a distributed application. For a typical Web sites with several hundred distinct documents, the management cost for a Keynote-style solution would be $15,000–$100,000 per month. Such fees are not generally supported by the economics of a modem Web site.    Collecting adequate metrics. Keynote-style management solutions does not provide a number of useful metrics, such as render time and sizes for document components. Metrics that are available from Keynote-style solutions are generally limited either by manual “experiments,” or by significant additional expense.    Limited number of robots. Due to cost considerations, the size of a robot network is usually limited to several hundred locations. When one recognizes that Akamai started its CDN with over 2,000 caches, and that there are over a million independently administered subnets on the Internet, it is clear that such a small network of monitoring robots will be severely limited, and most problems will be concealed by the vastness of the internetwork edge.    Monitoring from the network edge. Keynote-style networks typically locate monitors within or near the internetwork core. This characteristic is motivated in part by the availability of access to multiple locations on the network topological from a single physical location, and also by the availability of high-quality hosting services. The vastness and relative remoteness of the network edge makes it economically very difficult for a robot network to monitor from the edge.It would therefore be desirable to provide tools to enable users to quantitatively measure performance issues across a distributed network that monitors from the edge of such a distributed network.
The Topaz Active Watch™ product from Mercury Interactive represents another popular approach to distributed application performance monitoring. Similar offerings are available from Adlex. Like the site monitoring tools mentioned above, Topaz Active Watch™ is a software product that is deployed within the data center of the distributed application. However, rather than monitoring server internals, the Topaz Active Watch™ system monitors network traffic on the data center LAN. By reconstructing TCP/IP connections, it is frequently able to extract useful performance measurements for content delivered from the data center. It avoids the cost of the Keynote-style solutions, removing an economic barrier to managing all the content for the distributed application. Topaz Active Watch™ is limited in that it is unable to measure CDN performance, which cannot generally be observed from within the data center or from the data center LAN. It is further limited in that it is generally unable to measure and compare the performance experienced by an arbitrary internetwork subnet for content served from two distinct ISPs. It would be desirable to provide tools to enable users to quantitatively measure performance issues across a distributed network, working from the edge of such a distributed network, such that CDN performance can be measured.
Other solutions, some of which are of a complementary nature to the present invention, include services and tools such as those from Vividence, which provide subjective information on the experience of an end-user of a distributed application (e.g., “I liked this, but I didn't like that.”). Although performance has impact on this subjective experience, it becomes difficult to isolate and act on performance issues when quantitative performance data are blended indiscriminately with subjective opinions about experience. A further problem is that the cost of interrogating end-users makes it impractical to apply a Vividence approach to performance measurement of a wide-scale detailed internetwork measurement. Solutions such as Web Trends™ products and services from NetIQ, Hit BOX™ services from Web Side Story, and Funnel Web™ from Quest Software provide ample statistics on usage of a Web site, but do little to expose network performance. Offerings from WebHancer are expected to be useful as a data source to be utilized in accord with the present invention, as will be described in more detail below. It would be desirable to provide tools to enable users to quantitatively measure performance issues across a distributed network by analyzing data obtained from other sources.
Finally, it should be noted that the technology described in the above-referenced patent application is employed in the invention described in the patent application that has been incorporated herein by reference, which provides limited internetwork analysis functionality. However, the invention in that other patent application does not provide the full range of desirable utility described above. It would therefore be desirable to provide tools to enable users to quantitatively measure performance issues across a distributed network, working from the edges of such a distributed network, such that CDN performance, ISP performance, advertising network performance, streaming media delivery network performance, and third-party content delivery system performance can be measured in a cost effective manner.