Computer systems can perform many operations. In many cases, it is important that the operations run uninterrupted to the extent possible. For example, a network device, such as a switch or a router, may play a critical role in establishing and maintaining connections between nodes in a communications network. Therefore, it is important that the network device remain in operation.
In data centers housing multiple system components (e.g., routers, switches, host servers), numerous software modules run in each component and communicate with different processes. Occasionally, a module residing in a system component fails. For instance, a software bug in the module may cause inter-process communication (IPC) buffers to leak, such that the module is unable to drop IPC messages received (until a buffer overflow occurs resulting in an exception or core dump).
Typically, if a system component in a data center enters into an error state (e.g., due to a software bug), administrators have a limited number of options. One option is that the administrator can restart the process that caused the problem. However, some system products might not support process restarts without rebooting the entire system component. In that case, the administrator reports the issue with the vendor, providing the vendor with logs to allow an engineering team to diagnose and debug the issue. Once the vendor identifies the issue as being caused by a particular module, the vendor can provide a new image or updated version of the module to the administrator. If the vendor provides a new image, the administrator shuts down the system, replaces the image, and reboots the system. If the vendor provides an updated version of the module, the administrator replaces the module before restarting the component. This involves terminating any processes running on the system component. Both options may disrupt the computing environment and also yield unwanted consequences for the environment. This is especially undesired in a large computing environment where the downtime of any component or subsystem can result in long processing delays for other systems in the data center.
Further, in some circumstances, software bugs are difficult to reproduce locally with the vendor, and debugging logs sent by the user can sometimes provide the vendor with insufficient information to diagnose a software bug. For example, after receiving a report from a user, the vendor may request that the user issue commands to a process running inside the system component to help diagnose the problem. When the commands do not yield the desired information or values to aid in diagnosing the bug, the vendor can run a debugger on the live process instead. However, doing so is often disruptive because other system components or subsystems trying to communicate with the process may be unable to do so because of the debugger.