The present invention relates to methods and systems for monitoring and detecting failure of nodes in a data center environment by using a software defined failure detector that can be adjusted to varying conditions and data center topology.
Modern data centers typically contain large numbers of computer systems organized and connected using a number of interconnected networks. In turn, each computer system may implement containers, virtual machines, and processes. Monitoring and detecting failures of such large numbers of processes, containers, virtual machines, or physical computers is a necessary component of any distributed and/or fault-tolerant system. A monitoring system for data centers is required to continuously monitor all machines in the datacenter and to quickly and accurately identify failures.
Performance requirements for failure detection include factors such as the speed of detection (how fast is a failure detected and reported), the accuracy of detection (minimal false positives, the ability to detect complex failures like partial network failures, etc.), and scalability (how many nodes can be monitored, and what is involved in increasing or decreasing the number of nodes monitored.) There are many conventional solutions for failure detection and monitoring for large clusters of hardware and software objects. For example, many conventional approaches require that the system topology is fixed and coded into the implementation. Once deployed the topology cannot be easily changed. Many conventional solutions are also targeted at a setting where the network is quite flat; that is, monitoring a node on a single local area network (LAN) or network interface controller (NIC). In modern data centers a node maybe connected to many networks (for example: Ethernet on separate LANs, a torus, wireless, etc.). The fact that one route to a node is down does not mean that the node itself, or all routes to the node are down. However, mapping the monitoring topology to the underlying structure is difficult because every deployment is different.
Accordingly, a need arises for techniques for flexible, scalable monitoring of nodes and networks that can be adjusted to varying conditions and data center topology.