This pertains to responsiveness of systems and, more particularly, to the measurement of the sensitivity of a system's responsiveness to changes in attributes of elements that make up the system. Advantageously, the measurement may be made while the system is in normal operation mode.
Many systems exist whose normal operation involves responding to stimuli, such as requests to do something, where the stimuli are substantially random. One example of such systems is network-based applications where users connect to a network and enter a request that may require numerous hardware devices to participate in order to satisfy the request. Those may be geographically dispersed devices that are interconnected through a network, for example, the world-wide-web (www) IP network, and they interact with each other through network links. Those devices, together with the links that interconnect them, are elements of a network-based system. Such a system differs from non-network-based systems in that the links that interconnect the elements are shared with other users of the network that have nothing to do with the system.
In the context of this disclosure, the term “link” without the qualifier “physical” refers to the logical connection from component A to component B. Such a logical connection typically comprises routers and physical links that interconnect the routers. Link latency is a measure of the delay that a message which is sent out by component A experiences as it flows to its component B destination via the link between components A and B.
Network-bases systems that provide network-based applications are becoming quite prevalent. Their proliferation makes it increasingly important to be able to predict and control responsiveness. Unfortunately, the complexity inherent in current software, hardware, and network architectures makes this task extremely difficult. For example, an enterprise application structured as a multi-tier architecture may involve complicated collections of interconnected and replicated web servers, application servers, and databases, not to mention load balancers, firewalls, routers and physical links. Other aspects of modern enterprise practices only exacerbate the problems. For example, an application may be running on a third party platform in remote hosting centers or on hardware that is shared in a utility model using virtualization technology. All of these factors make it virtually impossible to even detect all the dependencies that affect the end-to-end performance of a system (i.e., availability of a distributed application as seen by the end user); much less manage them. Without such information, it is difficult to predict the ultimate effect of, for example, moving a function to another machine, upgrading a network link, or replicating a server for increased availability.
In many applications the typical interaction is a request that triggers a response. The responsiveness of a network-based system to requests depends on many parameters, such as network topology, bandwidth, link latency, and processing resources; but it is generally considered that link latency is one of the more important attributes, especially if some of the links span wide area networks.
While this invention is generally directed to measurement of sensitivity of a system's responsiveness to changes in attributes of elements that make up the system, for illustrative purposes the following focuses on the task of measuring the sensitivity of a network-based system's end-to-end response time to changes in a link's latency.
Measuring the dependency of a system's responsiveness to link latency is not easy. Simply measuring a link's latency is not enough, because two components that communicate over a link that is characterized by high-latency may not have a large impact on the overall response time of an application if, for example, the communication over that link is required only for a small fraction of the user requests. Clearly, therefore, it is necessary to take into account the relationship between the inputs applied to the system and how the link under consideration affects the system's responsiveness. Randomness of the particular requests that appear at the system's inputs, and randomness in the arrival times of these requests make it extremely difficult to establish this relationship in any form that permits analysis.
An article by A. Brown, et al, “An active approach to characterizing dynamic dependencies for problem determination in a distributed environment,” Proc. of the 7th IPIF/IEEE International Symposium on Integrated Network Management pages 377-390, May 2001, introduces a dependency determination based on active perturbation of system components and observation of their effects. System dependencies are modeled as a directed acyclic graph where nodes are the system components and weighted edges represent dependencies. An edge is drawn from component A to component B if the failure of component B can affect component A. The weight of the edge represents the impact of the failure's effect on A. While the described approach is generic in a sense, the experiment that the article describes employs periodic database table locking (effectively a failure to respond) as the perturbation method.
In an article titled “Dependency analysis in distributed systems using fault injection: Application to problem determination in an e-commerce environment,” Proc. of the 12th International Workshop on Distributed Systems: Operations & Management, October 2001, S. Bagchi, et al employ the same approach in combination with fault injection as the perturbation method, and Candea et al in “Automatic failure-path inference: A generic introspection technique for Internet applications,” Proc. of the 3rd IEEE Workshop on Internet Applications (WIAPP), June 2003, describe an Automatic Failure-Path Interference (AFRI) technique that combines pre-deployment failure injection with runtime passive monitoring.
Generalizing, it can be said that the above articles describe an approach for identifying relationships. They tell whether a failure in component A has an effect at component B. None of the above, however, provides a measure of the sensitivity of an application to changes of some attribute, such as link latency, and none of the above deal with injected changes that induce a measure of degradation short of a failure.