This invention relates generally to the field of computing, and more particularly, to error detection in a computer system.
As it is known in the art, a computer system may include a central processing unit (CPU), and system resources, such as memory and I/O devices. In a network, two or more computer systems or nodes are connected and communicate through a network communication medium for internode communication. A first process executing in one node may send a data transmission to another process executing in another node of the network. Problems may be encountered in either transmitting or receiving the data transmission in a network. A hardware and/or software mechanism is generally used to detect and handle data transmission errors.
In internode communication, a sending or transmitting node typically requires confirmation that the data transmission to a receiving node has completed reliably, that is that the receiving node has successfully received the data uncorrupted, and the sending node does not need to retransmit the data. The receiving node generally examines the data when all data has been transmitted uncorrupted and complete.
Using one technique, the sending node sends a data transmission to a receiving process in accordance with the pseudo-code description below:
______________________________________ Sending node: write first portion of data request acknowledgment from receiving node(s) that the first portion of data is received successfully write last portion of data request acknowledgment from receiving node(s) that the last portion of data is received successfully : write DONE flag /* indicating transmission is complete */ request acknowledgment from receiving node(s) that the flag is received successfully Receiving node: send acknowledgments for each portion of data received wait for DONE flag /* indicating data is ready for processing */ send acknowledgment for DONE flag data process data ______________________________________
In the foregoing description, any individual write operation of a portion of data may have been unsuccessful. Accordingly, the sending node verifies that the receiver has received the data by requesting an acknowledgment signal from the receiving node as each data portion is successfully received. The DONE flag is set when the sending node completes transmission of the data and no errors are encountered by the sending node.
One drawback of this first technique is the length of time for an entire data transmission to complete due to the transmit time or latency for the acknowledgments sent from the receiving node for each data portion. Additionally, a large portion of network bandwidth is devoted to transmitting acknowledgments thereby decreasing the data carrying capacity of the network.
As a result of the first technique's drawback, a second technique does not require the acknowledgment requests from the receiving node for each data portion transmitted. However, this second technique as will be described is generally used in networks having an alternative internode communication mechanism to message passing to communicate error counter values among nodes in the network.
In this second technique, a global error counter is used in error detection. Generally, the sending node and receiving node examine a global error counter value "before" and "after" a data transmission. An error handler routine on each node increments a global error counter when an error occurs on that node. Thus, if the global error counter is the same value both "before" and "after", the sending node and receiving node conclude that there has been no sender or receiver data error. The following pseudocode description summarizes this second technique:
__________________________________________________________________________ sending node: L1: read global Error Counter (GEC) /* this is the "before" GEC */ &lt;write all data portions&gt; write "before" GEC to receiving process write DONE flag /* to indicate that data transmission is complete */ compare "before" GEC to current GEC /* current GEC is "after" GEC */ if "before" GEC = current GEC then done with transmission no errors */ else retransmit by going to L1 /* errors */ receiving node: L1: wait for DONE flag read "before"GEC /* guaranteed ordering ensures that when DONE flag is read by the receiving node, GEC has been sent */ if "before" GEC = current GEC then successful transmission and process data else start over and go to L1 error handler: when an error occurs mutex begin /* mutually exclusive access to error counter during update */ update GEC to reflect current error mutex end /* end mutually exclusive access to error counter __________________________________________________________________________ */
In the foregoing description, there is an assumption that the ordering of write operations by the sending node are observed and received by the receiving node in the order in which they are dispatched by the sending node.
There are several drawbacks with the foregoing second technique. One drawback is relevant when there is more than one sending or receiving node. In this instance, the global error counter only captures summary information that an error has occurred in data transmission. Thus, it is not possible to determine which data transmission from which sending node to which receiving node has failed.
Another drawback is that the second technique requires a locking mechanism to ensure that the error handler has mutually exclusive access to the global error counter. Generally, mutually exclusive access to a shared resource, such as the global error counter, is synchronized by using a locking and an unlocking operation. Use of this locking mechanism also has several drawbacks. One drawback is that the lock and unlock operations tend to be expensive in terms of computer resources. Another drawback is that implementations of the lock and unlock operations are generally written to operate in an environment in which an error state is not being processed. In other words, lock and unlock implementations typically are dependent upon various system hardware and software states. This is in direct opposition to what is required to synchronize access to the global error counter since the error counter is updated when an error is being processed. Since the global error counter is updated when an error has occurred, some of the underlying assumptions and dependencies of commonly used lock and unlock implementations do not permit these locking implementations to be used when synchronizing access to the global error counter.
Thus, there is required an efficient technique for performing error detection in a network which provides for accurate error detection and reporting and localized error information to minimize error recovery actions without requiring use of locking and unlocking operations.