The susceptibility to failures is a key factor for the prosperity of any network technologies. Despite all the efforts that have been put into adding reliability and high availability to networking systems, errors such as software bugs, hardware failures, and configuration errors are inevitable, and could severely impact the performance of a network. Therefore, parallel to developing new debugging tools, it is necessary to improve the support for recovery from errors.
Checkpoint and rollback recovery is a powerful approach for eliminating transient errors in servers and distributed systems. In this approach, a system periodically records its state during normal operation (checkpointing) and stores the state in some non-volatile storage (the state may be referred to as a checkpoint). Upon failure, a previous correct state is restored, and execution restarts from this intermediate state (referred to as a rollback process), thereby reducing the amount of lost computation. This avoids restarting from the beginning for long-running applications which can be costly.
SDN as a network architecture has gain significant interests in networking industry. A SDN system offers programmability, centralized intelligence, and abstractions from the underlying network infrastructure. It will be advantageous to be efficiently recover from errors in a SDN system.