1. Field of the Invention
The present invention generally relates to a multi-processor system. More particularly, the invention relates to the detection of corrupted data in a multi-processor system. More particularly still, the invention relates to the detection of corrupted data and replacement of the corrupted data with a predetermined value to indicate to the rest of the system that a transmission error has occurred that has already been detected.
2. Background of the Invention
It often is desirable to include multiple processors in a single computer system, This is especially true for computationally intensive applications and applications that otherwise can benefit from having more than one processor simultaneously performing various tasks. It is not uncommon for a multi-processor system to have 2 or 4 or more processors working in concert with one another. Typically, each processor couples to at least one and perhaps three or four other processors.
Such systems usually require data and commands (e.g., read requests, write requests, etc.) to be transmitted from one processor to another. For the data or commands to pass from the source to the destination, the transmission may have to pass through one or more intervening processors interconnecting the source and the destination processors. Accordingly, messages can be passed from one processor to another and another with the intervening processors simply forwarding the message on to the next processor in the communication path.
A desirable feature of such systems is to be able to detect the presence of corrupted data and, if possible, correct the corrupted data. A data packet might have one of its bits reverse logic state (i.e., switch from a 0 to a 1 or 1 to a 0) at some point between the source and the destination. Further, more than one bit in a data packet might improperly change state. A single bit in a data packet that becomes corrupted is referred to as a “single bit error” and more than one bit becoming corrupted is a “multi-bit error.” There are a variety of causes of such corruption. For example, cosmic radiation can change the state of individual gates causing a bit to change state. Further, it is possible for electromagnetic interference generated by nearby electronics to effect the electrical state gates in a multi-processor system. Regardless of the source of the data corruption it is desirable to be able to detect that the corruption has occurred and, if possible, correct the problem.
A variety of error detection schemes have been suggested and used. Some techniques are capable of only detecting single bit errors, while other techniques can detect double-bit errors. Further, some techniques also include error correction to permit the corrupted bit or bits to be corrected. Such error correction techniques generally require detecting, not only that an error has occurred, but also the identification of which bit(s) is erroneous. Some systems will be able to detect that a multi-bit error has occurred, but not be able to determine which bit is erroneous and thus be unable to correct the problem. There is a tradeoff between the capabilities of an error detection and correction scheme and its complexity. For instance, single bit detection and correction schemes are generally less complex than multi-bit error detection and correction schemes but cannot correct more than one corrupted bit at a time.
Whatever type of error detection and correction scheme is chosen for implementing in a given multi-processor system, a problem still remains as to what to do with those errors that can be detected, but not corrected. In conventional systems, there generally have been two choices. On one hand, the message containing the detected, but uncorrectable, error can be halted and not retransmitted to the next processor in the communication path. This approach advantageously isolates the error, but can cause the system to “deadlock” meaning that the system generally becomes unusable. Deadlock can occur when future tasks that the processors are to perform are contingent upon a particular data message. If that message is stopped due to a corrupted bit or bits, the system will not be able to determine what action to perform next.
Alternatively, the message with the corrupted bit can be forwarded on to the next processor in the communication link. Deadlock is avoided in this case, as the message is sent. However, each processor that receives the message will detect the error and signal an error event (typically by asserting an error flag). For a message with an error that passes through 10 processors, all 10 processors will signal an error. With 10 processors all indicating the same error, error isolation becomes problematic. That is, determining the source of the error becomes difficult, if not impossible.
Accordingly, a need exists to efficiently and effectively handle errors in a multi-processor system that can detect, but not necessarily correct the error. Such a system should be able to detect the error, preclude the system from becoming deadlocked and permit the error to be efficiently isolated. To date, no such system is known to exist.