The present invention relates to network debugging and, more particularly, to network debugging for a data center.
Modern data centers are very large in the order of thousands of servers and network components. They all claim on-demand resource provisioning and 24×7 system availability to the users though it is very hard to eradicate all hardware and software failures in such complex and large scale infrastructures. There have been many incidents recently where failures has lead to unavailability of services and the lack of efficient debugging has caused heavy commercial losses to the businesses. The fact is, efficient debugging of failures and misconfiguration in such large scale data centers is very difficult. Applications not only interact with each other but also interact with infrastructure services in diverse ways. Moreover, there is a strict line of visibility restriction between the application layer and the infrastructure layer, which limits the data center operators to look inside the applications for any debugging purposes. There is currently also no easy way for the operators to verify the success of management operations in the data center. Therefore, efficient data center wide debugging is still an open research area in computer science.
Existing commercial or academic [1, 2, 5, 6] solutions have taken a microscopic approach, where people try to diagnose issues on specific servers or processes using domain knowledge (like agents) or statistical techniques. Data center wide debugging using coarse-grained and light-weight monitoring remains a challenge. The previous techniques are focused on extracting per application dependency graph (in most cases, using network flow concurrency or delay properties) and use it for diagnosis purposes. Commercial solutions have been relying on enterprise management solutions, which require agents to be installed with semantic knowledge of application protocols and applications configuration files. Efforts have been ongoing to apply model checking to distributed states. Furthermore, people have tried instrumentation for tracing [3] requests and use of record and replay using distributed system logging [4] and using network traffic [7]. The current approaches are far from practically deployable. Typically, the solutions require heavy instrumentation resulting in a lot of overhead. Also the commercial cloud is heterogeneous, which poses additional problems for instrumentation. To sum up, intrusive monitoring, scalability issues in deployment, network overhead and insufficient data availability are some of the challenges in data center debugging.
OFDiff approaches the problem from a unique angle and takes advantage of OpenFlow's monitoring capabilities built upon message exchange in its control plane. Basically OFDiff captures network traffic from the OpenFlow controller for debugging purpose. The debugging is done by using logs of working and non-working states. To compare them, OFDiff models application and infrastructure level behavior of the data center of the corresponding logging period. Any changes in the application signatures (e.g., a different connectivity graph, or change in application response time) captured from those logs are considered to explain using operational tasks. The operational tasks are also identified from the traffic logs using a pattern matching algorithm to the previously known tasks' patterns (learned offline also from OpenFlow traffic logs). Application layer changes (detected by OFDiff), which cannot be attributed to well known operational tasks are further correlated to the infrastructure level changes to identify problem class in the data center. Finally we correlate the problem class to the system components for further troubleshooting purposes.
[1] P. Bahl, R. Chandra, A. Greenberg, S. Kandual, D. Maltz, and M. Zhang, “Towards highly reliable enterprise network services via inference of multi-level dependencies,” in Proc. SIGCOMM'07, August 2007, pp. 13-24.
[2] X. Chen, M. Zhang, Z. Morley, and M. P. Bahl, “Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions,” in Proc. of OSDI, 2008.
[3] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, “X-Trace: A Pervasive Network Tracing Framework,” in Proc. USENIX NSDI, Cambridge, Mass., USA, 2007.
[4] D. Geels, G. Altekar, S. Shenker, and I. Stoica, “Replay debugging for distributed applications,” in Proc. Proceedings of the annual conference on USENIX '06 Annual Technical Conference.
[5] S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl, “Detailed diagnosis in enterprise networks.” in Proc. SIGCOMM, 2009.
[6] L. Popa, B.-G. Chun, I. Stoica, J. Chandrashekar, and N. Taft, “Macroscope: End-Point Approach to Networked Application Dependency Discovery,” in Proc. ACM CoNEXT, 2009.
[7] A. Wundsam, D. Levin, S. Seetharaman, and A. Feldmann, “OFRewind: Enabling Record and Replay Troubleshooting for Networks,” in Proc. Proceedings of Usenix ATC 2011.