A system that provides services to clients may implement some mechanism to protect itself from a crushing load of service requests that could potentially overload the system. For example, for a web-based service or remote procedure call (RPC) service, the service provider system is considered to be in an “overloaded” state if it is not able to provide an expected quality of service for some portion of client requests it receives. One solution employed by service provider systems to deal with request overload is to deny service to throttle a certain proportion of incoming client requests until the system recovers from the overloaded state.
In some systems, throttling may be implemented in a pseudorandom fashion to reduce loads by returning failure results to random requests. In larger systems, throttling may be implemented separately for different subsystems of the system. For example, an e-commerce website may implement a number of backend application servers, each of which is associated with a different function and a different throttling policy. In such a system, a single top-level request, such as a page load directed to a web server, may induce numerous subrequests to a subsystem. When overloaded, the subsystem may simply throttle subrequests randomly. However, random throttling of just a small percentage of subrequests may induce a much larger effective throttling rate for top-level requests.
Moreover, large scale service provider systems are often implemented using collections of software modules that may be developed by different teams, sometimes in different programming languages, and could span thousands of compute nodes across multiple physical locations. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such environments. Some of these tools implement facilities to generate, capture, and log trace data that reflect the operation flow of a system during the servicing of a request. Trace data may include, for example, messages generated by the system modules, the state of a call stack during execution, and certain operational metrics, such as execution time.
In these large system, traces may be acquired via a pseudorandom sampling of subrequests handled in each subsystem. In some cases, the sampling rates may vary based on the subsystem. In systems where the subsystems independently collect trace samples of random subrequests, it is extremely unlikely that a complete trace (i.e., all transactions on all subsystem interactions that result from a single top-level request) will be captured, even where the sampling percentage for each subsystem is relatively high. This problem thus prevents a complete review of the system's behavior in the handling of a top-level request.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.
It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact could be termed a second contact, and, similarly, a second contact could be termed a first contact, without departing from the scope of the present invention. The first contact and the second contact are both contacts, but they are not the same contact.