A multiprocessor system may face a critical situation when one or more of its processors spends an inordinate amount of time handling external interrupts. An interrupt storm on a processor or in an operating system kernel is generally defined as the condition where the processor spends such a large amount of its processing time in an interrupt context that processes or lower priority interrupts are blocked from normal execution. Two other indicators of this situation are when the processor receives multiple contiguous interrupts or a few interrupts each of which spends an extended amount of time being processed.
In the past the mindset has been that when a processor was spending an inordinate amount of its processing power on servicing interrupts, then a good design choice would be to substitute a more powerful processor for the slower processor. Because processors have been steadily increasing in processing power and relatively decreasing in cost, this problem of excessive interrupts has been solved by applying more computing power. Unfortunately, increasing the processing power often increases the cost of a system by adding another processor to the system or including a more powerful and costly processor. Another option has been to prioritize incoming interrupts so that the most important interrupts are processed first. Regardless of the amount of increased processing power available, there may be situations where the number of interrupts can overwhelm even the fastest processor.
An interrupt storm may happen due to an excessive amount of device interrupts from one or more devices or due to an error condition in a device. I/O (Input/Output) interrupts can present a particular problem to an operating system kernel because of the longer time period required to service such interrupts. Even with interrupts that take a shorter time to process, a large number of interrupts can block out processes that would otherwise be executed by the kernel.
An example of a situation that can cause an interrupt storm is a network router that is configured incorrectly or may be having a hardware failure. If the network router is receiving packets from one or more networks and then incorrectly sending all or a large part of the packets back to a single network, then the receiving network server will be overwhelmed by interrupts for the network packets. Particularly, a network card for the network server will generate an overwhelming amount of I/O interrupts to the server processor and its operating system kernel.
In a similar situation, a network router may be misconfigured in a loop back situation where the network packets sent to the router by a server are bounced directly back to the network or server that sent the packets. This can also cause an interrupt storm. Further, a denial of service attack or flood of communication packets illustrates other situations where an interrupt storm may take place. Of course, there may simply be peripheral devices or network components that require a significant amount of interrupt attention or I/O.
Whatever the reason, such a situation can result in other important processes being blocked from executing on the processor. If the blocked process in question is a heartbeat timer or any time sensitive process, the operating system or diagnostic software may be misled into flagging an error condition on the system, which in turn may activate unnecessary correction triggers. In some situations, specialized diagnostic software may be executing on the computer and when the diagnostic software does not receive the appropriate heartbeat and other processing signals, then the diagnostic software may reboot the server because the server appears to the diagnostic software to have crashed. Thus, an interrupt storm may result in the constant rebooting of the server if the interrupt storm cannot be accurately detected and properly dealt with.