1. Technical Field
The present application relates generally to a system and method for automatically diagnosing information systems that suffer from degradations in performance and service availability and, more particularly, to a system and method for automatically identifying bottleneck resources in a computer network using inference methods for analyzing end-to-end performance data without the need for detailed information about individual resources and processes.
2. Description of Related Art
A critical task in managing distributed applications is the process of diagnosing the cause of degradation in quality of service to end-users. For mission critical applications, the ability to resolve problems in an expedient manner is particularly important. Due to the complexity of distributed applications, problem diagnosis requires skills across multiple disciplines. Unfortunately, problems often get routed to the wrong department or the departments themselves do not agree on who should accept the responsibility. Therefore, it would greatly enhance productivity and reduce the time and cost of problem resolution if the scope of the problem could be automatically isolated to a small subset of bottleneck resources.
The process of identifying bottlenecks, however, is a difficult task when a large number of resources are involved. This is indeed the case even for a simple business transaction such as purchasing items over the Internet. Indeed, the application supporting the business transaction typically requires services from multiple servers. The servers may include name servers, proxy servers, web servers, mail servers, database servers, etc. In addition, the underlying application may require the service of connectivity resources to transfer data between the user machines and the servers as well as among the server machines. These connectivity resources typically include routers, switches, gateways and network bandwidths. Moreover, at the software level, the application may require services from various functional components such as file systems, directories, communication facilities, transaction middleware and databases.
Conventional approaches to problem determination include monitoring detailed metrics from individual resources. For instance, counters and meters are instrumented into various hardware and software entities to measure utilization, contention, data rates, error rates, etc. These metrics reveal the internal workings of each component. If any metric exceeds its predefined threshold value, an alarm is generated.
There are various disadvantages to using the conventional diagnostic approach. One disadvantage is that the method requires the constant monitoring of multiple metrics of potential bottleneck resources, thereby generating a large data volume and traffic and imposing an excessive workload on the information analysis system. Another disadvantage is that resource metrics may carry a large amount of redundant information, as well as apparently conflicting information. In addition, the metrics cannot reveal all possible problems.
Another disadvantage is that an excessive value of a particular resource metric at any point in time does not necessarily imply a bottleneck condition because the adverse effect of one metric is often compensated by the favorable conditions of other metrics. Indeed, in systems having built-in redundancy (e.g. alternate paths), the deficiency of one resource instance can also be absorbed by other resource instances, thereby reducing the impact on overall performance due to the temporary local anomaly. Consequently, the extensive monitoring of individual resources tends to generate large amounts of false alarms. Therefore, the aforementioned disadvantages associated with the quantitative resource metric approach may lead to scalability and accuracy problems in bottleneck identification using resource metrics from medium to large enterprises.
Another conventional technique for diagnosing problems is referred to as the xe2x80x9cevent-basedxe2x80x9d method, which involves correlating events or alarms from resources. In particular, this method involves detecting xe2x80x9cpatternsxe2x80x9d in an event stream, where a xe2x80x9cpatternxe2x80x9d is generally defined as the occurrence of related events in close proximity of time. With the event-based approach, events that are part of any recognizable pattern are considered to be part of an event group. Each pattern has a leading event and the resource that originates the leading event is considered the root cause of other events in the group and the root cause of the problem associated with the pattern.
The effectiveness of the event-based approach is limited to problems arising from serious failures and malfunctioning for which explicit alarm mechanisms have been instrumented. Other disadvantages to the event-based approach is that it requires the analysis of large amounts of event or alarm data from each resource. As such, it suffers from the same scalability and accuracy problems as the resource metric based approach.
Another conventional method for identifying bottlenecks places emphasis on collecting quality of service data such as end-to-end response times and end-to-end availability measures. This data is effective for detecting problems from the end-user""s perspective and provides a valid basis for suspecting that a bottleneck condition exists. The end-to-end data by itself, however, does not exactly identify the bottleneck resource. Indeed, this approach cannot be used for diagnosis in the absence of intelligent interpretations by human experts.
To overcome this problem, a more direct conventional approach involves producing a detailed breakdown of the end-to-end data into components. The component with the largest response time is deemed a bottleneck that causes problems in end-to-end response time. Unfortunately, such component level data is not always readily available from most network and server products deployed in a network configuration. Moreover, a detailed response time decomposition process requires instrumentation at each network or server resource. It often requires modifications to the application, the middleware and software modules running in the network devices.
For certain network protocols, a trace analysis approach may be used wherein response time components can be deduced from traces of low-level events by recognizing the time instants when a request or reply is sent or received by a host. Again, the analysis of protocol traces involves a great deal of reverse engineering and guess work to correlate events because the beginning and the end of each response time component is not always clearly demarcated in the trace. In addition, trace analysis poses a great challenge when the data over the network is encrypted for security reasons since the data necessary for correlation is not visible. On top of all these issues, the decomposition approach runs into scalability problems because large amounts of data have to be collected and correlated at the per resource level. As a result, the trace-based decomposition approach is used mostly for application debugging during the development stage and is not recommended for regular quality of service management after the deployment of the application.
Accordingly, a simplified system and method that provides automatic identification bottleneck resources in a computer network is highly desirable. A simplified bottleneck identification process should use only end-to-end quality of service data and eliminate the need for monitoring detailed internal resource metrics, monitoring and correlating events from resources, and measuring or estimating component response times, such as required by conventional techniques.
The present invention is directed to a system and method for providing automated bottleneck identification in networks and networked application environments by processing using end-to-end quality of service measurements in combination with knowledge of internal resource dependency information generated by a network administrator. Advantageously, the present invention utilizes end-to-end data that represents an external view of the application and makes inferences about resources that are internal to the system. The system and methods described herein do not require detailed measurements and/or estimates of parameters or data at the per resource level. Instead, a simplified heuristic approach is employed to accept or reject the hypothesis that any particular resource is a bottleneck based on evidence from various probe transactions.
In one aspect of the present invention, a method for identifying bottleneck resources in a network having a plurality of resources comprises the steps of:
specifying a plurality of probe transactions in the network;
generating resource dependency information representing dependent resources for each of the probe transactions;
executing the probe transactions; measuring end-to-end quality of service data resulting from the executing of the probe transactions; and
processing the resource dependency information and measured end-to-end quality of service data to identify bottleneck resources in the network.
In another aspect, probe transactions can be specified by allocating probe stations at desired nodes in the network and configuring the probe stations to execute various probe tests for initiating execution of service functions in remote servers in the network.
The resource dependency information for a given probe transaction may be defined by information such as (1) each important entity in a path between the probing station and the remote server,(2) the service function initiated in the remote server; (3) each function that is related to the service function initiated in the remote server,(4) additional servers that provide the related function, and (5) each important entity in a path between the remote server and the additional servers.
In another aspect, the resource dependency information of the probe transactions may be modeled by a dependency matrix D having a plurality of rows i, each row representing each resource in the network that a system administrator considers a potential bottleneck, and a plurality of columns j, each column representing a probe transaction, wherein a matrix element D[i,j] is assigned a value representing the dependency of a given probe transaction j on a given resource i.