1. Field
This disclosure generally relates to the field of communication. More particularly, the disclosure relates to managing communication failures between devices.
2. General Background
It is often necessary to send data between devices in a computer system. For example, it is often necessary to connect a processing device to a plurality of input and output devices. Appropriate data communication is achieved by connecting the devices in such a way as to allow them to send data packets to each other over a link, which may be a wired link or a wireless link.
It is increasingly important for systems of connected devices to be capable of managing unanticipated failure or unavailability of one of those connected devices. For example, systems need to be resilient (continue providing their designated service) when cables are accidentally disconnected (thereby disconnecting a connected device from a server, for example) or if a remote device fails. While, in general, operating systems provide some support for managed (i.e., planned) removal and replacement of Input/Output (I/O) adaptors in a running (“hot”) system (i.e. hot-plugging/hot-swapping), little or no support is provided for unexpected removal of devices in a hot system.
Previously, PCI Express, for example, was used only for permanent connection of I/O adaptors within the server (i.e. “in-box”). For such configurations, in which an I/O device is permanently installed within the server, failure of the I/O device represents a failure of the server as a whole and as such, continued operation of that server is not considered to be necessary. An increasing interest in remotely connected PCI Express I/O devices (i.e. I/O devices connected outside a server), means that the need for servers to gracefully support unanticipated hot-removal of connected devices is becoming critical for system resilience.
The sending and receiving of data packets is often described in terms of transactions. A transaction involves one or more data packets being sent between devices. PCI Express implements a split transaction model, wherein a source device transmits a request data packet to a destination device, and awaits a completion data packet from the destination device in response. In general, operating systems are not adapted to handle failed PCI Express transactions gracefully. For example, if a server sends a request data packet to a connected device and, unexpectedly, receives no completion data packets in response to that request, the operating system of the server is likely to crash. As such, current connected systems based on PCI Express are likely to crash when a connected PCI Express resource becomes unexpectedly unavailable.
Standard implementations of PCI Express do not provide adequate means of dealing with failure of PCI Express subsystems or connected PCI Express devices. As a result, difficulties may arise in building remote or shared I/O systems which have an acceptable level of resilience to failure.