Production applications running in a large-scale data center sometimes encounter performance degradation, which may lead to a complete system halt. Large data centers typically employ monitoring tools to monitor system resources used by the production applications and to raise red flags if the application consumes too many system resources. These monitoring tools detect potentially troublesome situations and perform various tasks, such as sending alerts or executing recovery jobs, to prevent a complete system halt. Conventionally, these tools can be configured to automatically take actions (e.g., restarting applications) to restore application availability and to return resource consumption to normal levels.
With the introduction of various component level architectures (e.g., applications executing within containers, where each container is allocated system resources), the relationship between various production applications in a data center is becoming more difficult to determine. For example, a typical scenario may be one in which applications execute inside multiple containers, and each application uses services available to other applications executing inside a different container or inside a cluster of containers. Further, each container may be executing several threads associated with the application in the container. In some instances, the application may run out of threads (i.e., all the threads are consumed in processes and are not returned to the thread pool), causing the container to lock up (i.e., the container cannot respond to any incoming requests due to the lack of threads).
A conventional mechanism for monitoring applications in component level architectures is profiling. Profiling involves the ability to trace executing functions and identify resources used by these functions. Typically, a profiler is called by the system during the execution of a particular function or process. The profiler subsequently receives notices every time an event of interest occurs within the executing function or process. The profiler then gathers statistical data on various events. By gathering statistical data on executing functions, the profiler can build a comprehensive picture of which functions or processes use the most system resources, which functions or process use the least amount of system resources, etc.
Production issues are often non-deterministic, impacting system performance and availability without warning. Almost invariably, these issues occur under peak system load. In some instances, production issues may occur when it is too late to turn on profiling; other times it is difficult to determine which events of an application to trace using profiling. To complicate matters further, in many cases, symptoms of the problem show up in one application while the root cause of the problem may be hidden in another application.