Performance issues in critical business applications are typically hard to reproduce in test environments. In at least some cases, this may be because the applications are tested and tuned for performance in a pre-production environment and are released for production after the performance test passed. It is often hard or even impossible for test environments to test all conditions under which the application will be executed in a production environment, when the application is faced with a real load and real data.
Often, performance issues in production are exposed only at infrequent intervals, such as when a certain combination of specific data, specific user interactions, specific database states, and/or specific timing is encountered. As a result, detecting and identifying conditions causing performance issues or problems can be challenging.
A common architectural model for server-side applications today is based on a reactive principle. Specifically, the server performs some work upon receiving an external event (request) through a well-defined protocol (HTTP, RMI, SOAP, etc.), and may send a response (result) to the requesting party. The unit of work performed upon receiving a single event is sometimes called a “server request”. A major indicator of server performance is server request latency, i.e., the time elapsed between receiving the requesting event until a response is sent.
A performance analyst may be interested in determining which server requests took a long time to execute. There are a number of tools available that are directed at addressing this need. A performance analyst may also be interested in seeing internal details of long running server requests to understand and pinpoint root causes of delay. This can pose severe technical challenges. There is generally a strong trade-off between the amount and quality of collected data about a server request's execution and die performance overhead of the data collection. The trade-off is made even more profound by the fact that the server request latency is not known when the server request starts and most data collection techniques are not capable of capturing data retroactively. Thus, prior systems may force data collection for all server requests or for a randomly selected subset of requests and risk missing server requests which are causing issues.