Computer systems need to ensure that unexpected failures do not result in data being written to permanent storage without an error indication. If corrupted data are written to disk, for example, the application data can be corrupted in a way that could be nearly impossible to detect and fix.
Many containment strategies employed in computer systems are premised on the prevention of bad data infiltrating permanent storage. The usual way to prevent this is through a process called containment. Containment means that the error is contained to portions of the system outside of the disk. Typically, systems maintain containment by stopping all Direct Memory Access or “DMA” traffic after an error is signaled. More specifically, the standard technology for error containment in most bus-based computers is provision of a wire (or set of wires), which signal the errors among the devices on the bus. The error signal might be a special signal on the bus which all the bus agents can read, or a special transaction that is sent on the bus so that all units in the system may be notified. As such, units in the prior art detect the error signal within a cycle or two of the source unit asserting it. Thereafter, the receiving units perform whatever action is needed for containment.
The primary reason for effectuating error containment in this manner is that it is very inexpensive and relatively easy, given that it generally only requires one wire, all agents can read the error simultaneously, and act on it. Typically, such systems can also have different severities of error indicators when more than one wire is used. Most systems which utilize this type of error containment have at least one fatal error type indications which reflects that some containment has been lost, that both the normal system execution and DMA should stop, and the processor should go to recovery mode.
A large, distributed system cannot use this type of signaling, however. Too many dedicated wires are required to interconnect each component of the system, leading to a system which is too complex for routine use. As such, the prior art methodologies of the type described above are useful primarily for small scale system schemes which use dedicated wires to signal errors, and are not suitable for large scale situations. Further, timing problems result if prior art systems have different clock domains. Moreover, this type of error containment is not suited for use in systems which (1) are not on a shared bus system or (2) where cells or agents communicate on a packet basis. The prior art methodologies are not well suited to packet based systems because there is no simple way to propagate an error, given the use of a shared wire. If one were to implement the prior art methodology on a large distributed system, it would entail many shared wires with a central hub agent, collecting error information together and redistributing it. The end result of this is a structure that adds complexity to system infrastructure and is substantially more expensive than implementation thereof on a small scale bus-based system.
Another potential solution taught in the prior art, for packet-based systems, is a special, packet type which indicates an error, which is then sent around the system to each agent with which the system was communicating. Such strategies, however, involve complexity in that they require the system to send an extra error indication packet, and require all the receivers to then act on the packet in a very specific way. As such, there is a need in the prior art for a practical, large scale methodology for error containment in packet-based communication systems between computers.