1. Technical Field
This disclosure is generally related to server clusters. More specifically, this disclosure is related to determining the health of a particular component of a server farm comprising a multitude of servers.
2. Related Art
A server farm is a collection of computer servers configured to provide reliable computing service. An organization can use a server farm to provide networked search services, computing resources, etc. Generally, the server farm comprises a set of monitored computers organized into clusters controlled by a cluster scheduler. Each of the monitored computers provides operational state signals that indicate the state of its components. Thus, if a cooling fan fails, a temperature at a given sensor exceeds a limit, a disk drive fails, etc. for a monitored computer, the cluster scheduler can ignore the monitored computer until it is repaired. For small server farms, the cluster scheduler can request the state from the monitored computers it is scheduling. However, this approach does not scale. In addition, this approach does not use historical data about the monitored computer. Many processes other than the cluster scheduler often need to know about the health of a monitored computer in the server farm (one example of such a process can be a file system that strives to make sure that file data exists in three different monitored computers in the server farm; if one of these monitored computers should fail, its portion of the file data is replicated to another monitored computer to maintain the file data redundancy).
It is known in the art how to obtain operational state signals for a monitored computer. For example, a monitored computer can periodically send its operational state signals to a monitoring system, a monitoring system can poll each monitored computer for its operational state signals, etc. Some monitoring systems gather signals from the monitored computers and trigger rules based on those signals. However, operational state signals for a single monitored computer only provide a snapshot into the current state of the monitored computer. While snapshots of operational state signals can indicate that the monitored computer has had an error, it is difficult to determine whether such an error is transitory, or is a predictor of a component failure. In addition, these systems need to provide timely indicators of the status of the monitored computers to processes in the server farm (for example, the cluster schedulers and/or file system) and are thereby limited by how much data can be retained and the number and complexity of rules that can be applied to the data without becoming untimely. One example of the known art is the open-source Nagios® program.
Many server farms do not have enough monitored computers to provide statistically significant amounts of data to reliably predict component failures. While very large server farms (containing tens of thousands (or more) of servers) can provide statistically significant amounts of data, the prior approaches for obtaining and analyzing historical operational state signals (and repair history) do not scale well, and it is difficult to efficiently detect when a monitored computer has failed or is ready to fail. In addition, ad-hoc approaches to determining the wellness state of a server lead to a proliferation of performance data formats and a diversity of programs used to determine the status of a particular monitored computer.
The existing monitoring systems show operational or non-operational status, but do not provide information regarding the likelihood of imminent failure. Measurements of temperatures, fan speeds, internal device status (such as SMART status for disk drives, as well as disk read retries) can be useful in determining the present state of the computer. However, the present states of these measurements often do not indicate the imminence of failure. For example, one model of disk drive can have intermittent access failures with successful retries without these failures being a predictor of imminent disk drive failure, while similar failures in another model of disk drive can be a powerful predictor of imminent drive failure; one disk drive model may be, for example, more sensitive to ambient temperature conditions than another disk drive model.
It is difficult to efficiently, quickly and reliably locate failed and/or ready-to-fail computers in a server farm that includes tens of thousands (or more) of monitored computers.