The invention relates in general to error containment in computer systems, and in particular, to error containment for large scale, packet based systems.
Computer systems need to ensure that unexpected failures do not result in data being written to permanent storage without an error indication. If corrupted data are written to disk, for example, the application data can be corrupted in a way that could be nearly impossible to detect and fix.
Many containment strategies employed in computer systems are premised on the prevention of bad data infiltrating permanent storage. The usual way to prevent this is through a process called containment. Containment means that the error is contained to portions of the system outside of the disk. Typically, systems maintain containment by stopping all Direct Memory Access or xe2x80x9cDMAxe2x80x9d traffic after an error is signaled. More specifically, the standard technology for error containment in most bus-based computers is provision of a wire (or set of wires), which signal the errors among the devices on the bus. The error signal might be a special signal on the bus which all the bus agents can read, or a special transaction that is sent on the bus so that all units in the system may be notified. As such, units in the prior art detect the error signal within a cycle or two of the source unit asserting it. Thereafter, the receiving units perform whatever action is needed for containment.
The primary reason for effectuating error containment in this manner is that it is very inexpensive and relatively easy, given that it generally only requires one wire, all agents can read the error simultaneously, and act on it. Typically, such systems can also have different severities of error indicators when more than one wire is used. Most systems which utilize this type of error containment have at least one fatal error type indications which reflects that some containment has been lost, that both the normal system execution and DMA should stop, and the processor should go to recovery mode.
A large, distributed system cannot use this type of signaling, however. Too many dedicated wires are required to interconnect each component of the system, leading to a system which is too complex for routine use. As such, the prior art methodologies of the type described above are useful primarily for small scale system schemes which use dedicated wires to signal errors, and are not suitable for large scale situations. Further, timing problems result if prior art systems have different clock domains. Moreover, this type of error containment is not suited for use in systems which (1) are not on a shared bus system or (2) where cells or agents communicate on a packet basis. The prior art methodologies are not well suited to packet based systems because there is no simple way to propagate an error, given the use of a shared wire. If one were to implement the prior art methodology on a large distributed system, it would entail many shared wires with a central hub agent, collecting error information together and redistributing it. The end result of this is a structure that adds complexity to system infrastructure and is substantially more expensive than implementation thereof on a small scale bus-based system.
Another potential solution taught in the prior art, for packet-based systems, is a special, packet type which indicates an error, which is then sent around the system to each agent with which the system was communicating. Such strategies, however, involve complexity in that they require the system to send an extra error indication packet, and require all the receivers to then act on the packet in a very specific way. As such, there is a need in the prior art for a practical, large scale methodology for error containment in packet-based communication systems between computers.
These and other objects, features and technical advantages are achieved by a system and method which is generally directed to a simple mechanism for use by a distributed system as an error indication in an existing protocol packet, for informing receiving units in a distributed system that a particular unit may have corrupted data. Proper handling by receiving units of the inventive indication prevents corrupted data from being propagated to a permanent storage medium, thereby maintaining containment.
The invention provides containment for, e.g., arbitrary usage patterns, when preferably installed within the framework of directory based coherent systems, and requires no application level code changes, and can be implemented completely in the fabric without any changes to the processor core. In order to accomplish this, the present invention preferably uses a fixed position bit in the packet header to indicate an error status of the source in a non bit or hardware intensive manner. The protocols of the invention are further defined in such a way that all receivers of packets must respond to packets with this error indicator in order to achieve containment. The invention thus forces a system to pass on an error indication to other receiving units and requires the CPU to both stop processing and to implement the recovery handler in order that the packets can flow freely.
This methodology provides for error containment by using just one bit in the packet header. As a result, there is no need to install wires or to involve other cumbersome prior art structures in order to provide effective error containment. By using a bit which is already used to communicate between units in the protocol, a scalable, simple error containment strategy results. Moreover, the invention is easily expanded to provide containment for less severe errors, such as those involving shared memory regions used between systems in high-speed communications. Because the data being processed within the system can never be processed by an agent which does not read the error indication on the data, the present invention provides for near perfect containment, and offers flexibility for a system such that it can be scaled up to larger numbers of units easily.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel features which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present invention.