The present disclosure generally relates to performance engineering, and more particularly relates to adaptive monitoring of application performance.
Performance engineering in today's large, distributed systems is a complex endeavor. Testing systems to meet performance requirements of response time, memory consumption, and CPU (central processing unit) time, all the while ensuring adherence to service level agreements, requires proficient use of monitoring tools. Performance test tools are used to record logs, or audit traces, and metrics such as CPU utilization, are used to record event details and application and/or system behavior during a run. A vital part of performance assessment, logging represents a real and substantial data overhead in this field, in the range of petabytes of data. As an example, a performance test run for an enterprise-grade architecture spans several days, monitoring the performance of several hundred servers handling 500 transactions per second, involving 100,000 concurrent users. The act of producing monitoring data including logging output and metrics adds to the workload and can affect application and/or system performance. When you consider that each server can feature multiple logging points and each logging point can generate terabytes of data, per day, it is easy to understand why log overhead has become a critical issue.
One cost-saving solution is to reduce monitor levels, or log or trace, levels. Monitoring levels refer to the level of detail in the output generated at logging points. Reducing monitoring levels can represent significant and substantial cost savings, cutting down on the monitoring data that must be stored, processed, and analyzed. Another benefit to reducing monitoring levels is lessening the impact of monitoring on application and/or system performance because less CPU cycles are diverted to monitoring functions. The downside to setting low monitoring levels is obvious—the loss of detail hinders the ability to debug any performance problems.
Customarily, in the event a performance issue is found, it becomes necessary to repeat the faulty scenario, albeit with higher monitoring levels, to identify and track the root cause of the problem. The general procedure is to stop the workload, increase the monitoring level, and rerun the hours-long or days-long test, hoping that the same problem will occur in the same node. This doesn't always happen because runs are randomized in a production environment.