Internet services today, such as those provided by Google™, Facebook™, Amazon™, and the like, tend to be built upon distributed server stacks. Understanding the performance behavior of distributed server stacks at scale is a non-trivial problem. For instance, the servicing of just a single request can trigger numerous sub-requests across heterogeneous software components. Further, it is often the case that many similar requests are serviced concurrently and in parallel. When a user experiences poor performance, it is extremely difficult to identify the root cause, as well as the software components and machines that are the culprits.
For example, a simple Hive query may involve a YARN (Yet Another Resource Negotiator) “ResourceManager,” numerous YARN “ApplicationManager” and “NodeManager” components, several “MapReduce” tasks, and multiple Hadoop Distributed File System (HDFS) servers.
Numerous tools have been developed to help identify performance anomalies and their root causes in these types of distributed systems. The tools have employed a variety of methods, which tend to have significant limitations. Many methods require the target systems to be instrumented with dedicated code to collect information. As such, they are intrusive and often cannot be applied to legacy or third-party components. Other methods are non-intrusive and instead analyze already existing system logs, using either use machine learning approaches to identify anomalies or relying on static code analysis. Approaches that use machine learning techniques cannot understand the underlying system behavior and thus may not help identify the root cause of an anomaly. Approaches that require static code analysis are limited to components where such static analysis is possible, and they tend to be unable to understand the interactions between different software components.