Data center management is based upon monitoring the servers in a data center. For example, monitoring of the servers provides the information needed to determine the health of systems, often providing the first warning that problems are occurring, and also assisting in the localization and debugging of those problems. Monitoring also provides information regarding the utilization of servers, which figures into capacity planning and provisioning decisions.
Successfully monitoring servers depends on instructing the servers to measure desired characteristics in a manner that does not overwhelm the servers' resources. The instructions the servers use to measure themselves often need to be customized.
Servers are well instrumented, producing far more data about their status than can realistically be stored locally on the server or sent elsewhere for analysis. As a result, processes called server monitoring agents are typically deployed to or deployed nearby the servers, with the agents responsible for extracting the part of the data deemed interesting by the data center operators, and forwarding this part or a summary for further analysis. However, because there is still too much data, a large amount of this data is lost.
Contemporary monitoring agent processes may consume so many server resources (e.g., CPU, memory, disk space, I/O bandwidth and so forth) that the primary functionality of the server (e.g., serving content) is adversely impacted. Resources consumed by monitoring can, for example, distort SLA (service level agreement) measurements. As a result, many of the agents that are deployed are typically extremely limited in the processing they perform, thereby limiting the value of the information they can provide. Data that is needed for anomaly detection, debugging, and system management are often not available, especially as the developers and operators may not realize what information is important until after the system is deployed and experience with operating it is obtained.
Because of their potential impact on server performance, agents and their processing rules typically need to undergo extensive qualification testing before deployment is allowed. Having to re-qualify an agent every time a change is made to its processing rules makes it difficult to refine the agents, even though such refinement is highly desirable.
Further, monitoring a large set of servers creates additional challenges. These challenges typically need to be overcome by relying on the experience of a system administrator, e.g., to identify unusual or potentially performance-threatening situations in the system. By way of example, consider monitoring to identify unusual or potentially performance-threatening situations; such situations may differ significantly depending on the underlying architecture, processing mode (batch, transaction, failover), time of day (peak, off-peak) and so forth. For example, detecting ninety percent processor utilization for several minutes may trigger an alert for most transaction-processing applications. However, the same level of processor utilization is normal in batch processing, and indeed for some types of batch processing, any lower utilization should trigger an alert, as utilization below ninety percent may suggest that the application stopped is not performing the expected work or even stopped working.
At the same time, system behavior is described by hundreds of variables, and any combination of them may need to be used to spot and alert on the occurrence of some problem. For example high CPU utilization may be a problem only when occurring simultaneously with lower than usual utilization of a disk drive with database logs.
Because of these difficulties, known approaches to server monitoring do not scale well to a large number of servers, because of their numbers (possibly on the order of hundreds to one-hundred thousand servers), load patterns and the dynamic nature of a contemporary data center. Modern data centers may have tens of thousands of servers, for example, running hundreds of differing applications, serving a load coming from (e.g., Internet-connected) clients in a mostly uncontrollable fashion. In addition, the servers may be frequently re-purposed to serve a different application, which completely changes the load pattern on the re-purposed server. Setting individual alerts on each server by a system administrator is not a practical solution.
Yet another problem with conventional monitoring approaches is that they concentrate and report the performance metrics directly available from the system at the moment. The monitor does not have other information, such as what is considered a normal situation based upon the given time of the day or other knowledge (e.g., a holiday). Instead, the alerts and/or data collection rules are set for some ‘average’ situation, like weekday or weekend, but that does not account for differences between weekends during holiday period and other weekends, for example. It is sometimes technically possible to create such a multitude of parameter settings, but it is presently impractical to apply them as the load patterns vary from time to time.