Not applicable.
1. Field of the Invention
The present invention generally relates to a multi-processor system. More particularly, the invention relates to the detection of corrupted data in a multi-processor system. More particularly still, the invention relates to the detection of corrupted data and replacement of the corrupted data with a predetermined value to indicate to the rest of the system that a transmission error has occurred that has already been detected.
2. Background of the Invention
It often is desirable to include multiple processors in a single computer system. This is especially true for computationally intensive applications and applications that otherwise, can benefit from having more than one processor simultaneously performing various tasks. It is not uncommon for a multi-processor system to have 2 or 4 or more processors working in concert with one another. Typically, each processor couples to at least one and perhaps three or four other processors.
Such systems usually require data and commands (e.g., read requests, write requests, etc.) to be transmitted from one processor to another. For the data or commands to pass from the source to the destination, the transmission may have to pass through one or more intervening processors interconnecting the source and the destination processors. Accordingly, messages can be passed from one processor to another and another with the intervening processors simply forwarding the message on to the next processor in the communication path.
A desirable feature of such systems is to be able to detect the presence of corrupted data and, if possible, correct the corrupted data. A data packet might have one of its bits reverse logic state (i.e., switch from a 0 to a 1 or 1 to a 0) at some point between the source and the destination. Further, more than one bit in a data packet might improperly change state. A single bit in a data packet that becomes corrupted is referred to as a xe2x80x9csingle bit errorxe2x80x9d and more than one bit becoming corrupted is a xe2x80x9cmulti-bit error.xe2x80x9d There are a variety of causes of such corruption. For example, cosmic radiation can change the state of individual gates causing a bit to change state. Further, it is possible for electromagnetic interference generated by nearby electronics to effect the electrical state gates in a multi-processor system. Regardless of the source of the data corruption it is desirable to be able to detect that the corruption has occurred and, if possible, correct the problem.
A variety of error detection schemes have been suggested and used. Some techniques are capable of only detecting single bit errors, while other techniques can detect double-bit errors. Further, some techniques also include error correction to permit the corrupted bit or bits to be corrected. Such error correction techniques generally require detecting, not only that an error has occurred, but also the identification of which bit(s) is erroneous. Some systems will be able to detect that a multi-bit error has occurred, but not be able to determine which bit is erroneous and thus be unable to correct the problem. There is a tradeoff between the capabilities of an error detection and correction scheme and its complexity. For instance, single bit detection and correction schemes are generally less complex than multi-bit error detection and correction schemes but cannot correct more than one corrupted bit at a time.
Whatever type of error detection and correction scheme is chosen for implementing in a given multi-processor system, a problem still remains as to what to do with those errors that can be detected, but not corrected. In conventional systems, there generally have been two choices. On one hand, the message containing the detected, but uncorrectable, error can be halted and not retransmitted to the next processor in the communication path. This approach advantageously isolates the error, but can cause the system to xe2x80x9cdeadlockxe2x80x9d meaning that the system generally becomes unusable. Deadlock can occur when future tasks that the processors are to perform are contingent upon a particular data message. If that message is stopped due to a corrupted bit or bits, the system will not be able to determine what action to perform next.
Alternatively, the message with the corrupted bit can be forwarded on to the next processor in the communication link. Deadlock is avoided in this case, as the message is sent. However, each processor that receives the message will detect the error and signal an error event (typically by asserting an error flag). For a message with an error that passes through 10 processors, all 10 processors will signal an error. With 10 processors all indicating the same error, error isolation becomes problematic. That is, determining the source of the error becomes difficult, if not impossible.
Accordingly, a need exists to efficiently and effectively handle errors in a multi-processor system that can detect, but not necessarily correct the error. Such a system should be able to detect the error, preclude the system from becoming deadlocked and permit the error to be efficiently isolated. To date, no such system is known to exist.
The problems noted above are solved in large part by a multi-processor system in which each processor can receive a message from one or more other processors in the system. The message may contain corrupted data that was corrupted during transmission from the preceding processor. Upon receiving the message, the processor detects that a portion of the message contains corrupted data. The processor then replaces the corrupted portion with a predetermined bit pattern that is known to or otherwise programmed into all other processors in the system. The predetermined bit pattern indicates that a data transmission error has occurred in the corresponding portion of the message. The processor that detects the error in the message preferably alerts the system, for example by setting an error flag, that an error has been detected. The message now containing the predetermined bit pattern in place of the corrupted data can be retransmitted to another processor. The predetermined bit pattern will indicate that an error in the message was detected by the previous processor. In response, the processor detecting the predetermined bit pattern preferably will not alert the system of the existence of an error. The same message with the predetermined bit pattern then can be retransmitted to other processors which also will detect the presence of the predetermined bit pattern and in response not alert the system of the presence of the error. As such, because only the first processor to detect an error alerts the system of the error and because messages containing uncorrectable errors still are transmitted through the system, fault isolation is improved and the system is less likely to fall into a deadlock condition.
Each processor preferably includes a memory controller for connection to a memory device, an interface to an input/output controller, a router for connection to one or more other processors, and other components. The router transmits and receives messages to and from other processors in the system. The router also detects transmission errors and replaces the erroneous portion with the predetermined bit pattern.
Each message preferably includes multiple xe2x80x9cticksxe2x80x9d of data with each tick comprising multiple bits of information including error check bits. The error check bits permit the router to detect transmission errors and may permit correction of the erroneous bits. Some types of errors, however, are uncorrectable given the number of error check bits. These uncorrectable errors can be detected but cannot be corrected. Upon detecting an uncorrectable error in a tick, the router replaces all of the bits in the corrupted tick with the predetermined bit pattern. Data ticks include multiple data bits and multiple error check bits. An exemplary predetermined bit pattern includes all 1""s to replace the data bits and an otherwise unused value to replace the error check bits.