The complexity of distributed systems and their testing mechanisms have been widely explored for many years. There are many challenges inherent in distributed systems, such as latency of asynchronous communications, error recovery, clock drift, and service partitioning, leading to numerous problems including deadlocks, race conditions, and many other difficulties. Testing and monitoring of such complex systems presents big challenges. Over the years, many automatic test generation, deployment, and execution methods have been investigated and implemented. However, great efforts are still demanded in the area of monitoring and maintaining such systems.
It is very useful to be able to monitor the performance and behavior of servers in a datacenter or production setting. Most contemporary operating systems have event tracing or performance counter mechanisms for this purpose, but turning them on will usually incur some performance overhead—consuming CPU and disk/network IO to write their log files. If this has not been taken into account when capacity planning, this can cause the production servers to be overloaded resulting in failures and/or poor performance. Hence, tracing mechanisms are rarely used in production.
Many datacenters now use virtual machines to allow multiple production applications to run on a single server, each within a virtual environment so that each application thinks it has exclusive use of the machine. Instead, the application typically has exclusive use of the virtual machine. The virtual machine provides the application with a guaranteed amount of hardware resources, such as CPU speed, memory size, disk capacity, network bandwidth, and so forth. Hypervisors are widely available for commodity hardware that can allow multiple virtual machines to run side by side on the same computer (e.g., Xen, Hyper-V, and VMware). The hypervisor multiplexes (and sometimes schedules) access to the physical resources such as CPU, memory, disk, network. The hypervisor provides schedulers for both CPU and IO resources that are capable of providing a fixed partition of all resources between two or more VMs. This can be done in many ways, e.g. using hard-real time scheduling algorithms.
Monitoring performance within a virtual machine suffers the same challenges as monitoring within a physical machine. Turning on performance counters, logging, and other monitoring tools will consume some amount of the virtual machine's resources, and may potentially change the behavior of the application. For applications with high guaranteed uptime, the operator of the application cannot risk this kind of interference. In some cases, applications have solved this by turning monitoring on all the time, developing and testing the system under the assumption that the monitoring burden will constantly be there. While this can add predictability, the impact may still cause the application to run slower than expected and involves planning for and purchasing more hardware resources than would otherwise be used. Many applications are left choosing between constant monitoring with the accompanying slow application performance or no monitoring with the accompanying difficulty of diagnosing problems and monitoring behavior.