The present invention relates, in general, to converged networks, and in particular, to soft error recovery.
There are two generally accepted definitions of errors in computer hardware and networks: soft errors and hard errors. Hard errors are the result of broken hardware, e.g. hardware with defects for one reason or another. These errors are repeatable. Soft errors are also know as transient errors and are usually not repeatable. Soft errors are random in nature and are caused by noise in the system such as high energy particles (alpha, beta, gamma, etc.), electrical interference, clock jitter, etc. Hardware's and network's susceptibility to soft errors is determined by the robustness of the design.
One major concern with errors, particularly in a datacenter network, is “silent data corruption” (SDC), which may be caused by either soft or hard error. The SDC refers to altered data that was undetected either due to insufficient or lack of checking mechanisms. In other words, SDC is the same as an undetected error that leads to data corruption. It should be noted that some undetected errors cause no problems, and are still considered SDC.
Current industry standard approaches for converged datacenter networks are susceptible to soft errors due to a variety of factors, including the high cost of radiation chamber testing and radiation hardening. This includes many of the new cloud data centers. Soft errors may occur because of radiation events, such as particle strikes, e.g. cosmic rays and alpha particles, interfering with the network. These radiation events may lead to transient errors in hardware and may lead to undetected state changes in software.
Soft errors in network switches may affect both the data plane, such as crossbar/shared memory and input/output switch ports, and the control plane, such as switch operating system (OS), of the switch. This may lead to multiple errors, including misrouting for gateway routers in a datacenter that may send packets to erroneous external locations, misclassification of packets, and misclassification of the availability of switches. Soft errors may also affect packet processing, compute and memory elements of a switch.