Modern companies operate servers, personal computers, and other computing devices as a part of their day-to-day operations. In many cases, a significant portion of the company's mission involves the operation of such devices. For example, financial institutions provide customers with up-to-the-minute details about their accounts. Downtime can be inconvenient for customers and cause complaints which reflect poorly on the company. Content providers rely on the operation of their servers to deliver content to their customers. Downtime for these companies can reduce customer interest, potentially reducing demand for advertisements which are often the sole source of the company's revenue. Similarly, network retailers rely on their servers to process orders. Any downtime experienced by these businesses may not just impact their reputation or potentially affect advertising rates, it can also affect revenue when potential customers go elsewhere to make their purchases. These are just a few examples of companies which typically have large groups of servers that are required to operate around the clock.
One problem, among others, that arises when relying on large groups of servers is that it can be difficult to monitor the key operating parameters of each individual machine and determine when an anomaly has occurred or is occurring. Even when the key operating parameters are monitored, it can be difficult to determine which measurements are normal and which measurements are anomalous. Some companies utilize monitoring systems that require them to specify the normal range for measurements of the key operating parameters and the absolute thresholds beyond which the measurements may indicate anomalies. System administrators who configure these monitoring systems determine the normal operating range based on their own anecdotal evidence or on recommendations from other system administrators, whose systems may be operating in an entirely different environment.