In many systems—physical, biological, or informational—when trouble occurs, various symptoms of component fault or failure are periodically observed, yet rarely provide a direct pointer to what lies beneath: the root common cause of the different symptoms. For complex systems, such as the human body, an airliner, an automobile, the Internet, the national power grid, financial systems, etc., the sheer size or detail of the system may provide a daunting challenge to tracing or surmising a “true cause” underlying an assortment of currently occurring symptoms or component failures. Even in a system as commonplace as a desktop computer, even virtuoso troubleshooters can have difficulty bridging the gap between the symptoms of a fault (system misbehaviors and/or cryptic error messages) and the responsible component causing the failure (hard drive, memory, operating system, or software application).
The Internet has become vast and wild in its global reach and its obscure levels of lower-layer plumbing. A significant failure, even on a higher layer of Internet architecture, can be hard to diagnose, that is, to diagnose the true root cause manifested by thousands of symptoms.
In some ways similar to its cousins, the electric power grid and public telephone system, the Internet is not monolithic but consists of numerous autonomous systems, such as Internet Service Providers (ISPs) for example, AT&T and COMCAST. These autonomous systems are connected together with each other and with content providers. If one autonomous system fails, it may affect its own end-users, but may also affect secondary autonomous systems upstream and downstream that depend on it, and subsequent autonomous systems that depend on the secondary autonomous systems, and so on, in a domino effect.
An Internet content provider typically has one or more data centers with many web servers (examples of large content providers include MSN, YAHOO, GOOGLE, etc.). Large and small content providers are typically in business, and expect to receive a certain number of requests for content from different autonomous systems over time. Most content providers may not know and cannot control the many autonomous systems that lie between themselves and their end-users or customers. But when failure affects one of the autonomous systems in the content provider's service chain, the content provider loses money—from lost advertising, lost purchasing, lost dissemination of information, etc.
When a problem is occurring, it takes some sophistication even to be able sense the symptoms of a failure. Conventionally, the courses of action are limited when a problem reaches a human level of awareness. One course of action is to wait until the autonomous system that has the failure fixes itself. The other course of action is to dispatch an engineer to explore each individual failure symptom, many though they be.
Ideally, the choice between multiple courses of action in response to failure symptoms depends on a prioritization of the underlying problem, based on its magnitude, duration, and frequency of recurrence, as well as the expected cost of each course of action. While conventional methods can evaluate the priority of individual failure symptoms, the significance of the underlying problem is manifest only by the combined (e.g., sum) significance of its associated failure symptoms.
What is needed is a way to automatically group together the failure symptoms that occur in complex systems, with little or no a priori knowledge of the system structure, in order to quickly identify high-priority problems, as well as to point up root causes so that when a rash of failures occurs, a root cause of many simultaneous failure symptoms may be quickly located and resolved.