In recent years there has been a proliferation in the networking of computer systems. The recent expansion of the Internet is just one example of the trend toward distributed computing and information sharing. In most forms of computer or communication networking there are communication paths between the computers in the networks. These paths may include multiple links or hops between intermediate equipment in the path. Thus, a communication may be originated by a first computer at a first endpoint node and pass through several links before reaching the destination computer at a second endpoint node. The control over these communications is typically carried out by some form of networking architecture. Many architectures exist for defining communications between computers in a network. For example, System Network Architecture (SNA) and Transmission Control Protocol/Internet Protocol (TCP/IP) are two examples of existing network architectures.
One existing network architecture for controlling communications between computers is known as Advanced Peer to Peer Networking (APPN). APPN, like many networking architectures, is based upon the transmission of data where a communication is broken into one or more "packets" of data which are then transmitted from the source to the destination over the communication path. Packet based communications allows for error recovery of less than an entire communication which improves communication reliability and allows for packets to take multiple paths to an endpoint destination thus improving communication availability.
While APPN has proven to be a reliable networking architecture, as computer networking demands have increased there has been created a demand for network architectures which utilize the higher performance communication systems and computer systems currently available. These demands have resulted in the development of High Performance Routing which is an enhancement to APPN. The migration from APPN to HPR may be a result of changes in two areas: processing technology and link technology. Processing capability has increased and become less expensive. This has driven the need for larger peer-to-peer networks. Link technology has advanced by several orders of magnitude over the past decade. Advances in wide area links have dramatically increased transmission rates and decreased error rates. Thus, to take advantage of these advances HPR provides high speed data routing which includes end-to-end recovery (i.e. error recovery is performed by the sending and receiving systems) and end-to-end flow and congestion control where the flow of data is controlled by the sending and receiving systems.
HPR includes two main components: the Rapid Transport Protocol (RTP) and automatic network routing (ANR). RTP is a connection-oriented, full-duplex transport protocol designed to support high speed networks. One feature of RTP is to provide end-to-end error recovery, with optional link level recovery. RTP also provides end-to-end flow/congestion control by an adaptive rate based mechanism (ARB).
One advantage of HPR is its ability to route around failed links in a path. HPR may use alternate paths to bypass failing nodes on a path. This ability gives HPR considerable flexibility in recovering from errors at intermediate nodes. However, if a failure occurs at the endpoint node of a path, either the source or the destination endpoint, HPR alone cannot route around the failed endpoint. Thus, to provide error recovery for failures of an endpoint, the endpoints themselves should provide the error recovery.
Errors at endpoints may generally be categorized as one of two types: errors of applications at an endpoint and error of the endpoint itself. Depending upon the type of endpoint and applications at the endpoint a failure of the application may be recovered. For example, if an endpoint is a Virtual Telecommunications Access Methods (VTAM) facility and the application is a persistence enabled application then failures of the application may be recovered by restarting the application at the state prior to the failure. Such an error recovery method is described in commonly assigned U.S. Pat. No. 5,027,269. However, if the failure is of the endpoint, i.e. a VTAM failure, an operating system failure or a hardware failure, no mechanism currently exists to correct for such failure.
Previously, errors of the endpoints themselves had been dealt with by providing redundant endpoints. A standby processor would be designated and a third party routing node would be utilized to recover connections. A live backup connection would be maintained to the dedicated standby processor. In the event of failure, the third party routing node would aid in establishing the connections to the dedicated standby processor. However, such a solution requires dedicated system resources at the third party routing node and the backup processor as well as requiring the resources of maintaining two connections. Furthermore, in the event of failure of the backup processor there is no further error recovery.
In view of the above discussion, there exists a need for improvement in error recovery from failures at endpoint nodes in a network.