A distributed application is a software system that runs on two or more computers connected by a computer network. Client-server computing is a special case of distributed application computing. With the growth of the World Wide Web (WWW), interactive distributed applications have become a substantial part of popular computer usage. Web services based on Hypertext Markup Language (HTML) and Hypertext Transfer Protocol (HTTP) represent one type of distributed application. Other kinds of distributed applications include instant messaging, streaming media, and automated teller machines used by banks. Electronic mail is an example of a noninteractive distributed application. Distributed applications are commonly implemented using the Internet, but can also be implemented using private wide area networks (intranets), virtual private networks (VPNs), or local area networks (LANs).
A significant problem for users and providers of network services can be the slow or poor performance of a distributed application. Software that enables the performance of distributed applications to be monitored is thus an important tool in addressing this problem. However, measuring the delivery via protocols such as HTTP of content over the Internet is complicated by the federated nature of the Internet (compared to LANs or intranets), because the overall performance of the system depends not only on infrastructure that is directly controlled by the application provider, but also by a multitude of third parties. These third parties include the providers of collocation and hosting services (e.g., Rackspace, Netinfra, Exodus, Digex), providers of Internet network connections (e.g., InterNAP, UUNet, and Cable & Wireless), multiple backbone providers (e.g., ATT, Sprint, MCI, UUNet, and Cable & Wireless), content delivery networks (e.g., Akamai, Mirror Image Internet, and Digital Island), advertising networks (e.g., Double-Click and Avenue-A), and consumer Internet service providers (ISPs) (e.g., AOL, Earthlink, MSN, and @Home). Problems in any of these third party providers can lead to distributed application service degradation or failure, but the number of providers involved and the limited visibility that an application provider generally has into these independently administered systems commonly makes service problems particularly difficult to detect and diagnose.
A critical aspect of addressing performance problems is measurement, so that problems can be detected quickly when they occur, and so that their specific domain of impact can be identified in support of problem diagnosis. For these measurements, application level metrics, which indicate the performance experienced by an application end user, are the most direct measure of successful application delivery. Secondary measures, such as network level and system level metrics, can be useful in diagnosis when a problem has been detected. Examples of network level metrics are network packet counts and link errors. Examples of system level metrics include central processing unit (CPU) and memory utilization. Although secondary metrics can be very informative, they do not enable an administrator to understand the level of service that the application end users have experienced.
In current practice, the distinction between application level metrics vs. secondary metrics is often blurred or confused. To provide an example of application level information, it is necessary to consider a specific distributed application, such as a book shopping application implemented on the Internet. In this example of a distributed application, relevant application specific Web pages might include a home page, a search page, numerous catalog pages, a shopping cart Web page, and a sequence of Web pages to implement a checkout process. Also, for this example, application level performance information might include an average response time, i.e., the average wait time experienced by an end user for a specific Web page such as the home page or the search page to be fully rendered in the user's browser program. In addition, other measures of application level performance will also be of interest to those managing the book shopping service.
Generally, the delivery system for a modem distributed application can be simplified if viewed as comprising three major components. The first of these components, the “first mile,” commonly includes a multitier server farm or application server where the content of the distributed application is generated (or stored) and served. In the case of a Web-based distributed application, this first component might include HTTP servers, application servers, and database servers. In addition, the first component commonly includes load-spreading devices and firewalls. Also often included in the first component are private networks that provide interconnection of server-side systems and connect the server ensemble to the larger external network.
The third component, the “last mile,” includes the end user's system (commonly a desktop computer running a browser program) and its connection to the inter-network. The domain between the first and third components comprises the second component, which includes the inter-network that enables clients to communicate with servers.
Although those responsible for maintaining a distributed application are generally concerned with the performance delivered to end users, they are typically severely restricted in doing so, because of the limited resources at their disposal for detecting and diagnosing the full range of performance problems that impact end users. Substantial information is readily available about the performance of the first component to those who directly administratively control and manage this level; yet, little or no information is available for systems that are administered by others in the second and third components. For systems comprising the first component, administrators having direct control can employ management frameworks and server monitors. Examples of such programs include NetIQ's AppManager™, BMC's Patrol™, Hewlett Packard's OpenView™, Quest's Spotlight on Web Servers™, and Topaz Prizm™ from Mercury Interactive. These management tools are effective for delivering system and network metrics, but they are generally not able to deliver application level metrics. As a result, the administrators of the distributed applications typically do not have adequate information to detect or diagnose performance problems experienced by end users, or other service problems experienced by end users, or to evaluate the health and performance of the inter-network through which the application servers are connected to the end users.
In spite of this lack of visibility and control, application administrators are still generally motivated to do what they can to monitor and improve an application's performance because of the significant impact that the performance has on their business. This need has fostered the development of a number of technologies (along with companies to deliver them) that provide approximate measures of application level metrics. The most common approach for Web sites involves using artificially generated traffic from “robots” at a small number (typically tens or hundreds) of locations that periodically request Web pages as a test of the performance of a Web site. Example of this technique include Keynote Perspective™ from Keynote Systems, ActiveWatch™ from Mercury Interactive, the Gomez Performance Network™ from Gomez Networks, as well as solutions by Appliant Inc. This type of performance monitoring system is sometimes referred to as “active monitoring.” Active monitors enable periodic experiments in a relatively stable, controlled environment. Because the number of robots and the frequency of experiments is very small compared to the size of the end user population or the complexity of the Internet, active monitors at best provide an approximation of the performance experience of actual end users.
Another solution for obtaining application level metrics is possible in the case where the administrators of both the first component and the third component cooperate in implementing a monitoring system. A special case occurs when both the first component and third component are organized under the same administrator, as is commonly the case with a corporate intranet. In this case, the administrators have the option of installing software components in both the first and third components of the system. Example vendor solutions of this type include NetIQ's End2End™ and Mercury Interactive's Topaz Observer™. However, this solution is frequently inappropriate for the following reasons:                For Web browsing on the Internet, end users commonly prefer not to download and install desktop performance monitoring programs, due to security and privacy concerns. These concerns apply both to executable content, such as Active-X controls, as well as to Java applets. Yet, there is an important distinction between Java applets and JavaScript, since JavaScript is widely accepted/allowed by browser program security settings, while Java applets are not.        Even when the first and third components are in the same administrative domain, a solution with no desktop installation requirement is often preferred due to the complication of installing and maintaining an additional desktop monitoring component.        
Accordingly, data collection techniques for determining the performance of a distributed application should preferably use a different approach that does not require the active cooperation of the end user. Specifically, it is important to develop a technique for collecting application level metrics from the end user's computing devices without requiring the active installation of software components by the end user. In this manner, collection of a broad set of application level performance metrics from the end user perspective can be accomplished in a manner that is transparent to the end user and without requiring the end user to participate in the software installation on the end user's computing device.
With respect to application level information, three specific metric collection techniques are relevant. Compound metrics are collected using a mechanism that maintains per-user state across multiple application requests. For example, the latency or interval of time required to react to an end-user request to navigate from a document A to a document B can be measured as the latency or interval between a time that a request to fetch document B was made while document A is being displayed until the time the HTML file corresponding to document B has been downloaded by the browser program. Measuring this fetch latency in a non-intrusive manner requires maintaining and associating state information collected in the context of both document A and document B. However, there is generally no provision (except a browser register) provided for maintaining state information between Web documents displayed by a browser program unless the state information is retained as a cookie. The prior art does not teach or suggest how to determine compound metrics. Correlated metrics are derived from measurements on both the client and the server. More specifically, they require the comparison of the server measurement and the client measurement for a specific end user request as a part of their computation. Event-based metrics indicate or characterize an event (such as an error) that occurred in responding to a request for a distributed application or in rendering an image.
Solutions exist that collect limited application level information. However, although these solutions deliver ample usage information, they fail to deliver the performance information that is required to support more effective detection and deeper diagnosis of distributed application service problems. An example of a prior art solution that exhibits this limitation is HitBoX™, which is available from WebSideStory (www.websidestory.com). HitBoX™ uses JavaScript annotations to HTML Web pages to collect page usage metrics, but does not determine or collect performance metrics. Another relevant offering is the WebTrendsLive™ service from NetIQ (www.webtrendslive.com). These software solutions are limited in that they cannot determine or collect compound metrics or correlated metrics. They are also limited in the scope of their event-based metric collection. Finally, they are unable to tolerate common error or environmental conditions, such as network partitions.
A system and method for monitoring a distributed application is disclosed in U.S. Pat. No. 5,958,010. In this prior art approach, each computer on a client-server network has a Mission Universal Monitor (MUM) agent installed on it that monitors the data being exchanged over the network. The MUM agents can be installed as software modules, hardware modules coupled to the backplane of each managed node, or as a combination of hardware and backplane elements. The MUM agent can collect data regarding business transactions, databases, systems, systems and networks, and events, and can report the information to a MUM console module for subsequent review. However, a MUM agent must be explicitly installed on each monitored computer or node and is not capable of being implemented without having administrative control over the entire network, or the cooperation of the end users in installing the agents.
U.S. Pat. No. 6,006,260 discloses a method and apparatus for evaluating service to a user over the Internet at the browser program level, which can be done without requiring that the user actively install the code to do browser monitoring. In this approach, a user requests a desired Web page from a Web server, for example, with a selection made by the user in a browser program. The Web server sends back the requested Web page, which contains code to execute a browser agent. Either the user selects a hyperlink in the returned Web page that is rendered in the browser program to download a test page, or the browser monitor automatically sends a request to the Web server to download the test page. In response, the Web server sends the test page back to the browser program, enabling the browser monitor to calculate a download interval for the test page. The download interval is encoded into a request for a third Web page that is directed to a relay server, which returns a blank Web page signifying that the download interval was received. The patent also discloses that other performance parameters can be determined by the browser agent, but does not provide any details about what those performance parameters are or how they are determined. Moreover, the invention disclosed by this prior art reference does not enable a correlated or compound performance metric to be determined for the distributed application, because it does not disclose determining a performance component for the Web server that might be combined with a performance parameter determined by the browser monitor. Furthermore, the prior art approach is deficient, because it is not transparent to (i.e., hidden from) the end user.
From the preceding discussion, it will be apparent that it is important to collect a broader range of application level metrics than is permitted by the prior art, including compound, correlated, and event-based metrics. In addition, the collection of application level information should be robust in the presence of common error and environmental conditions. The present invention addresses these problems and is specifically able to determine a correlated performance metric that includes performance information determined at each end of a distributed application data transfer.