1. Field of the Invention
This invention relates generally to the field of computer systems and, more particularly, to communication protocols within computer systems.
2. Description of the Related Art
High speed, low latency communications networks that include unreliable transport media often rely on a communications protocol to implement a reliable message transport.
Examples of such communications protocols include TCP, NGIO 1.0, and PCI 2.x. In some of these protocols, a request can be sent from a sending device to a target device and an acknowledgment (ACK) can be sent in response from the target device back to the sending device. The sending device may include a timeout mechanism such that it can resend the request if an ACK is not received from the target device within a timeout duration set by properties of the communications network.
Some protocols may use a negative acknowledgement (NAK) to indicate that the target device or the communications network has detected an error. Errors can include data corruption, an illegal packet type, etc. The NAK can give a positive indication that an error has occurred and may also indicate the type of error that occurred. A sending device may, depending on the communications protocol, resend the request in response to a NAK.
In some communications networks, certain types of errors may temporarily prevent a target device from processing an incoming request. These types of errors can include a temporary loss of system resources (e.g., a dynamic reconfiguration of a node), a temporary lack of processing resources on the target device, or a lack of a valid virtual to physical address translation in cases where the contents of the request are to be written in the virtual address space of the target device's node. While these errors may be temporary, the time required to resolve them can vary widely. For example, a dynamic reconfiguration of system resources in a server may take on the order of hundreds of milliseconds to resolve, a page miss in the virtual memory system may take on the order of tens of milliseconds to resolve, and a temporary resource unavailability in the network interface may take on the order of hundreds of microseconds to resolve. Thus, the time that the temporary unavailable condition persists may vary by four orders of magnitude or more.
When a target device is temporarily unable to process a request, it can send a NAK to the sending device. The sending device can later resend the request, but it may again receive a NAK from the target device if the temporarily unavailable condition has not been cleared. This process could potentially repeat a large number of times and result in a large increase of traffic on the communications network. Alternatively, the sending device may delay the resending of the request too long (i.e. well beyond the time needed for the target device to resolve the temporarily unavailable condition). As a result, unnecessary latencies may result in the sending device as the processing of its request is delayed. A system and method is needed to more efficiently handle conditions where a target device may be temporarily unavailable.