Diagnosing component failures in distributed systems and detecting topology mis-configurations are important goals in the computing field. Health monitoring, automated diagnosing and localizing failures are important in large-scale distributed systems. Existing solutions on automated failure diagnosis require complete knowledge of the component association in the system. Examples of known solutions are:    Reference 1: R R Kompella, J Yates, A Greenberg, and A C Snoeren. “IP Fault Localization via Risk Modeling.” In Proceedings of Networked Systems Design and Implementation (NSDI), 2005.    Reference 2: Minaxi Gupta and Mani Subramanian. “Preprocessor Algorithm for Network Management Codebook.” USENIX 1st Workshop on Intrusion Detection and Monitoring (ID) 1999    Reference 3: Srikanth Kandula, Dina Katabi and Jean-Philippe Vasseur. “Shrink: A Tool for Failure Diagnosis in IP Networks.” ACM SIGCOMM Workshop on mining network data (MineNet-05), Philadelphia, Pa., August 2005
These known solutions rely heavily on completely known component associations to diagnose component failures. However, part of this information is often unavailable; for example, in many real-world distributed systems, topologies or failure associations are often incomplete, if not entirely missing. Existing solutions cannot be directly applied in such scenarios. Even if the complete association information is given, they are usually manually or semi-manually configured so that mis-configuration is inevitable due to human errors. A new solution is needed to cope with missing information in the association information, to enable failure diagnosis, and detect potential mis-configurations.
It would thus be desirable to overcome the limitations in previous approaches.