Detecting bottlenecks and identifying root causes of such bottlenecks are fundamental issues for system management. Automatically detecting application bottleneck points is a difficult problem in a large-scale and complex computing environment including cloud-based multi-tier applications. Typical cloud-based multi-tier applications include multiple software components (i.e., tiers) that are connected over inter- and/or intra-communication networks in data centers. Some components are serialized to process user requests step by step, and some components are configured to process user requests in parallel. For example, a web transaction application typically includes the front-end web server, the middle application server, and the back-end database server tiers. Each tier can be configured with multiple servers to process requests in parallel (e.g., database server cluster). In this situation, many system resources (e.g., CPU, memory, disk I/O, and network I/O) of distributed components of the multi-tier application can be used to handle user requests (e.g., web transactions). Precisely identifying bottleneck causes among such distributed resources is a burdensome and time consuming task. One attempted overall solution would be to monitor the application performance changes (e.g., application throughput) and then, to correlate system resource usages of all components into the application performance saturation for system diagnosis. It can be done by pinpointing a bottleneck starting point of the application performance and correlating it into bottleneck starting points of all system resources that are used for the application across tiers and servers.
However, automatically pinpointing and correlating bottleneck starting points is not trivial. It is very hard to unify some system resource usages (i.e., how much percentage disk I/O, system cache, or network bandwidth is used). Additionally, because there are usually some performance noises the pinpointing effort can get even harder. These noises can generate a number of false positives (e.g., false alarms) and consume some costs to resolve such false positives. The noise gets severe once the application reaches around a bottleneck point, and it makes the problem (i.e., pinpointing the bottleneck starting point) even more difficult. Meanwhile, missing the application bottleneck for a long time interval can lead to a false negative that leads to losing the chance to diagnose system behaviors and then, resolve the real application bottleneck.
Thus, there is a need for better methods and systems to automatically detect bottleneck points in application resources for identifying bottleneck causes in cloud-based multi-tier applications for system diagnosis.