The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
1. Stream Control Transmission Protocol
Stream Control Transmission Protocol (SCTP) is a network packet data transport protocol that provides for transparent transfer of data between computer systems, or hosts, and is responsible for end-to-end error recovery and flow control (for a detailed description of SCTP, see Randall Stewart & Qiaobing Xie, Stream Control Transmission Protocol (SCTP), A Reference Guide, ISBN 0-201-72186-4, (Addison -Wesley, 2002)). SCTP is a reliable transport protocol operating on top of a potentially unreliable connectionless packet service protocol, such as the Internet Protocol (IP), and offers acknowledged error-free non-duplicated transfer of datagrams, or packets.
SCTP is a general-purpose transport protocol for message-oriented applications. It was designed by the Internet Engineering Task Force (IETF) SIGTRAN working group, which released the SCTP standard draft document RFC2960 in October 2000. SCTP provides Transport Layer connectivity for computer applications, processes, services, or daemons that run in layers above the Transport Layer. SCTP also provides support for multi-homed hosts, and can be used as the transport protocol for upper-layer applications that require monitoring and detection of loss of session. For such upper-layer applications, SCTP uses a number of path/session failure detection mechanisms, such as a heartbeat mechanism, to actively monitor the connectivity of the session.
SCTP is designed around the concept of a plurality of data streams within a transport connection. The data units transported over an SCTP transport connection are referred to as SCTP packets. If SCTP runs over IP, an SCTP packet forms the payload of an IP packet.
The hosts communicating over an SCTP transport connection are usually represented by SCTP endpoints. An SCTP endpoint is the logical sender/receiver of SCTP packets. On a multi-homed host, such as a computer system that can be reached at more than one network address, an SCTP endpoint is represented to its peers as a combination of a set of eligible destination transport addresses to which SCTP packets can be sent and a set of eligible source transport addresses from which SCTP packets can be received. All transport addresses used by an SCTP endpoint must use the same port number, but can use multiple IP addresses. A transport address used by an SCTP endpoint cannot be used by another SCTP endpoint. A transport address is defined by a Network Layer address, a Transport Layer protocol and a Transport Layer port number. For example, in the case of SCTP running over IP, a transport address is defined by the combination of an IP address and an SCTP port number (where SCTP is the Transport Layer protocol).
An SCTP association is a protocol relationship between SCTP endpoints, and is composed of the two SCTP endpoints and the protocol state information. The protocol state information includes, among other parameters, one or more verification tags, a set of transmission sequence numbers, and a set of stream sequence numbers. An SCTP association can be identified by the transport addresses used by the endpoints in the association. Two SCTP endpoints cannot have more than one SCTP association between them at any given time.
An SCTP packet is composed of a common header and one or more chunks. The common header contains fields for a source port number, a destination port number, a verification tag, and a checksum. The source port numbers and the destination port numbers are used for the identification of an SCTP association. SCTP uses the same port concept used by the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). The verification tag is a 32-bit randomly generated value that is specific to an SCTP association, and is exchanged between the SCTP endpoints at the SCTP association startup. The verification tag serves as a key that allows a receiver to verify that the SCTP packet belongs to the current SCTP association. The checksum is used for the detection of transmission errors.
A chunk is a unit of information within an SCTP packet, consisting of a chunk header and chunk-specific content. Multiple chunks may be multiplexed into one SCTP packet. A chunk may contain either control information or upper-layer application data, and may be of variable length. A chunk header includes a chunk type field, used to distinguish data chunks and different types of control chunks, chunk flag field for chunk specific flags, and a chunk length field.
The chunk-specific content occupies the rest of the chunk, and is represented as a value field. The original SCTP specification defined several chunk types for standard use, including a Payload Data Chunk (DATA, chunk type value 0×0), Initiation Chunk (INIT, chunk type value 0 ×1), Initiation Acknowledgement Chunk (INIT ACK, chunk type value 0×2), Selective Acknowledgement Chunk (SACK, chunk type value 0×3), Heartbeat Request Chunk (HEARTBEAT, chunk type value 0×4), Heartbeat Acknowledgement (HEARTBEAT ACK, chunk type value 0×5), State Cookie Chunk (COOKIE ECHO, chunk type value 0×A), and Cookie Acknowledgement (COOKIE ACK, chunk type value 0×B). Subsequently, the SCTP specification has been extended to include the Address Configuration Change Chunk (ASCONF, chunk type value 0×C1), the Address Configuration Acknowledgement Chunk (ASCONF ACK, chunk type value 0×80), and the Stream Reset Chunk (STREAM RESET, chunk type value 0×82). A 32-bit Transmission Sequence Number (TSN) is attached to each chunk containing upper-layer application data to permit the receiving SCTP endpoint to acknowledge its receipt and detect duplicate deliveries.
SCTP supports different streams of messages within one SCTP association. A message is a unit of data in a chunk sent by an upper-layer application over the SCTP association from one SCTP endpoint to another. A stream is a uni-directional logical channel established from one SCTP endpoint to another associated SCTP endpoint, within which all data messages are delivered in sequence unless out-of-order delivery is requested by the upper-layer application. A 16-bit Stream Sequence Number (SSN) is associated with each stream, and is maintained internally by SCTP to ensure sequenced delivery of the data messages within a given stream to the upper-layer application. One Stream Sequence Number is attached to each data message.
SCTP operates on two levels—the SCTP association level and the stream level. At the SCTP association level, the reliable transfer of SCTP packets is ensured by using checksums, transmission sequence numbers, and a selective retransmission mechanism. At the stream level, ordered delivery of data messages to an upper-layer application is ensured by using Stream Sequence Numbers (SSNs).
The establishing of an SCTP association between two SCTP endpoints is completed on the SCTP association level. When an upper-layer application wants to start an SCTP association, it makes a standard SCTP API call to its SCTP endpoint (the sending SCTP endpoint) to call the SCTP stack and initialize association data structures and association state parameters. The association state parameters include at least the initial TSNs, the number of outbound streams, the number of inbound streams, and a verification tag. The initial association state parameters are then assembled in an INIT chunk. The sending SCTP endpoint sends this INIT chunk to one transport address (e.g. a combination of IP-address and a port number) of the desired SCTP endpoint (the receiving SCTP endpoint). The sending SCTP endpoint then starts a timer that triggers repetitive sending of the INIT chunk until an INIT ACK chunk is received from the receiving SCTP endpoint. If after the INIT chunk was sent a configurable number times and no INIT ACK chunk was received from the receiving SCTP endpoint, then the sending SCTP endpoint reports an error to the upper-layer application, and the receiving SCTP endpoint is considered unreachable.
The receiving SCTP endpoint receives the INIT chunk (with the request to set up an SCTP association), and analyzes the data contained in this chunk. From this data the receiving SCTP endpoint generates all the values needed to establish an SCTP association at its side, including the verification tag, the initial TSNs, and the numbers of the streams in the inbound and the outbound directions. The receiving SCTP endpoint then generates a secure hash of these values and a secret key. The values are then put into a State Cookie Parameter. The receiving SCTP endpoint then sends its initial association setup parameters and the State Cookie Parameter to the sending SCTP endpoint in an INIT ACK chunk. The receiving SCTP endpoint then saves none of this state information and waits until the sending SCTP endpoint sends back the State Cookie parameter in a COOKIE ECHO chunk.
When the sending SCTP endpoint receives an INIT ACK chunk from the receiving SCTP endpoint, it stops the timer, puts the State Cookie parameter from the receiving SCTP endpoint's INIT-ACK chunk into a new COOKIE ECHO chunk, and returns it to the receiving SCTP endpoint. The sending SCTP endpoint then starts a cookie timer that triggers repetitive sending of the new COOKIE ECHO chunk until a COOKIE ACK chunk is received from the receiving SCTP endpoint. If no COOKIE ACK chunk is received after a configurable number COOKIE ECHO chunks have been sent to the receiving SCTP endpoint, the sending SCTP endpoint reports to the upper-layer application that the receiving SCTP endpoint is unreachable.
Upon receipt of the COOKIE ECHO chunk from the sending SCTP endpoint, the receiving SCTP endpoint unpacks the data contained in the chunk and verifies that the chunk was sent by the sending SCTP endpoint. The data contained in the chunk, specifically the State Cookie parameter, is validated against the secret key and includes at least the verification tag, the number of inbound and outbound streams, and the initial TSNs. The receiving SCTP endpoint then uses the values of these parameters to initialize an SCTP association with the sending SCTP endpoint by creating and initializing the data structures necessary to support the association. The receiving SCTP endpoint then sends a COOKIE ACK chunk to the sending SCTP endpoint, and is thereby ready to accept data or send data chunks over the SCTP association. The sending SCTP endpoint receives and verifies the COOKIE ACK chunk, and thereby can start transmitting or receiving upper-layer application data messages over the SCTP association.
If a host is multi-homed on an IP network, its associated SCTP endpoint informs the other SCTP endpoint in the association about all of the host's IP addresses with the NIT chunk's address parameters (if the multi-homed host initiates the establishing of the association), or with the INIT ACK chunk's address parameters (if the multi-home host does not initiate the establishing of the association). If no explicit network addresses are contained in the INIT or INIT ACK chunks, the source IP address of the IP packet that carries the SCTP packet is used. This mechanism eases application of SCTP when Network Address Translation (NAT) is involved, e.g. at the edge of large private IP networks. To further facilitate the use of SCTP along with NAT, an additional optional feature has been introduced into the SCTP specification that allows the usage of host names in addition to or instead of IP addresses.
All data chunks sent from an SCTP endpoint are numbered with the current Transmission Sequence Number (TSN) for the endpoint. This enables the detection of loss and duplication of data chunks. Acknowledgements sent from an SCTP endpoint that receives the data chunks are based on this TSN. When the SCTP endpoint that receives the data chunks detects one or more gaps in the sequence of data chunks, each received SCTP packet is acknowledged by sending a Selective Acknowledgement (SACK) control chunk that reports all gaps. Whenever the SCTP endpoint that sends data chunks receives four consecutive SACKs reporting the same data chunk missing, this data chunk is immediately retransmitted (fast retransmit).
The stream level utilizes a flexible delivery mechanism that is based on the concept of multiple streams within an SCTP association. With respect to an SCTP endpoint, the SCTP association includes a set of inbound streams and a set of outbound streams, where the SCTP endpoint receives data through the inbound streams, and transmits data through the outbound streams. Chunks belonging to one or several streams may be bundled and transmitted in one SCTP packet. Every data chunk correctly received by an SCTP endpoint is delivered to the stream level.
At the stream level, an upper-layer application transmitting over an SCTP association may assign each data message to one of several streams within the association. When the SCTP association is established, the number of available streams per direction is exchanged between the associated SCTP endpoints. Within each stream, SCTP assigns independent Stream Sequence Numbers (SSNs) to the data messages. These numbers are used at the SCTP endpoint receiving the data messages to determine the sequence of delivery to the upper-layer application. SCTP performs in-sequence delivery per stream for all messages that are not marked for unordered delivery.
2. High-Availability Computer Systems
One past approach for providing a high-availability computer system is to have a backup system that periodically determines the status of the computer system (the primary system), and when the primary system fails, the backup system takes over for the primary system by assuming its identity. Under this approach, the backup system communicates with and monitors the primary system via a special LAN (Local Area Network) connection or some other network connection. Usually, high-availability implemented using this approach does not require special hardware for the connection between the primary system and the backup system. Under this approach, when the primary system comes back to life, the primary system continues to perform its duties as a primary, and the backup system assumes its own identity and reverts back to perform as a backup.
This approach has a number of disadvantages. Consider, for example, a primary computer system that is a host in an IP network using a reliable transport protocol such as TCP. When the primary host fails, its backup host must establish transport-level connectivity to all network clients that had TCP connections to the primary host. The establishment of transport-level connectivity between the backup host and a client requires: (1) establishment of a TCP connection at the backup host (assuming the backup host had no prior TCP connection to the client), and (2) re-setting the TCP connection at the client. Both the establishment and the re-setting of a TCP connection require changing the source and/or destination IP address for the connection, as well as re-initialization of the data structures that support the connection. If the client runs an application that is not designed to support TCP connection re-establishment or failover, the entire client application may need to be restarted in order to establish transport-level connectivity with the backup host. Even if the application is designed to support TCP connection re-establishment or failover, there is still undesirable added delay incurred in setting the new transport connections at the backup host. Moreover, the detailed transport connection state of the TCP connection needs to be exactly mirrored in the backup host. As the TCP connection is very dynamic, mirroring it to the backup host that may be physically separate from the primary platform (by many tens of milliseconds or hundred of miles) may not be practical or feasible.
Another disadvantage of this approach is that the backup system must timely discover the failure of the primary system, which requires more elaborate and frequent communications between the two systems. The backup system must discover the failure of the primary before a client connected to the primary discovers the failure, because otherwise, upon discovering that the primary has failed, the client might simply conclude that the primary system is unavailable and might give up trying to connect to it. Thus, even if later the backup system takes over the primary, there will be no way for the backup system to know of, and establish connection to, the client that gave up trying to connect to the primary.
Another approach for providing high-availability computer systems is to provide special hardware for communications between the primary and the backup computer systems. Under this approach, if the primary system fails the backup assumes the identity of the primary system. The special hardware is used by the primary system to constantly update the backup system with the state of the applications running on the primary system, and with the state of all transport connections between the primary system and any clients connected to it. An example of special hardware that can be used to implement this approach is a shared reflective memory that allows instant updates to the backup system whenever any changes to application states or transport connections occur in the primary system.
One of the many disadvantages of this approach is the high cost of the special hardware necessary for the communications between the primary and the backup systems. In other words, this approach gets rid of the necessity to reset transport connections and to transfer application state between the primary and the backup systems at the expense of higher hardware costs. Other significant disadvantages of this approach are the higher complexity and the higher costs involved in setting up and maintaining the failover scheme described above.
Based on the foregoing, there is a clear need for techniques providing a high-availability computer system with the ability to preserve and move, to a backup system, the transport connections that exist between the computer system and its clients without employing special hardware.