The present invention relates to distributed software systems. In particular, the invention relates to monitoring a process status based on transmitted data packets.
As an example of a distributed system, high-performance computer clusters are implemented to provide increased performance by splitting computational tasks across several computers in the cluster. Such a setup is often not only much more cost-effective than a single computer of comparable speed, but in many cases it is also the only way to further increase the computational power and reliability needed by evolving applications. Another good example of a distributed system is complex operator telecommunication equipment, like e.g. the Radio Network Controller (RNC) in a Radio Access Network (RAN), producing huge workload and demanding high reliability.
In order to distribute the workload among nodes in a distributed system, efficient communication (high bandwidth, low delay, tolerable data loss/corruption while demanding only little resources) between the nodes is indispensable, usually provided by light-weight transport protocols.
The term “node” shall be understood as a “computer” in a distributed system, or a telecom network element in a telecom network, or a part of modular telecom network element, which is running at least one process.
Requirements of network transport protocols targeted for utilisation in distributed systems are driven by a few factors only: high efficiency and robustness against failures, which are further divided into failures of nodes or parts of a node (e.g. hardware malfunction or process restart/SW failure) and failures of the communication network in-between (congestion, line break, etc.). Currently, robustness is often neglected, but steadily growing size and complexity of distributed systems increase the probability of failures drastically. This emphasizes importance of precise error detection and efficient recovery.
Taking a look at the mostly used transport protocol TCP (Transport Control Protocol), these requirements are fulfilled only to a certain extent: although TCP is fault-tolerant in a general manner, it is not able to recognize the specific kind of transmission failure: behaviour is identical in case of network failure (e.g. transmission congestion) or a node related failure (e.g. a node or process restart). But it would be beneficial to distinct between these types of transmission failures. The key difference is that the state between two communicating processes, defined by the history of the earlier communication between them, remains intact in case of a transmission failure, whereas in case of node related failure the state is lost and thus corrective actions may be necessary in the survived peer process. For instance, data transmission could be recovered after a network line break recovers without losing the connection. Further on, if a process fails corrective actions might be taken, for instance workload re-distribution or internal resource cleanup. Here, TCP (and also other available protocols) lack these features of recognizing and correcting such problems.
Furthermore, the costly retransmission mechanism of TCP reduces transport efficiency (due to acknowledgements and data retransmission transmitted between the processes), and even more problematic adds a remarkable processing overhead to each process required for inter-node communication, which is especially critical in the case of distributed system where all processes need to communicate with each other, often resulting in a huge number of connections which need to be maintained within the nodes.
Classical (connection-oriented) transport protocols allow the detection of process and network failures by informing about unexpected connection loss. However, this does not allow differentiation between process and network failures. TCP is a well-known example for this case: error recovery is based on timeouts due to missing acknowledgements from the receiving side. Consequently it takes a long time until a process failure is recognised. In this case the whole connection needs to be released and re-established, causing an even longer down-time of the affected process.
An improvement with respect to process failure detection is introduced by SCTP (Streaming Control Transmission Protocol), which is using special “Heartbeat Request” chunks in the packet header to gather another process' status: a node receiving such a request must respond sending a “Heartbeat Acknowledgement”. This speeds up failure detection remarkably but also adds network overhead, especially in distributed systems, where in worst-case scenario each process is communicating with every other one in the system.
A similar approach is being followed in SS7 (Signalling System No. 7) signalling stack: “Signalling Link Test Messages” (SLTM) and “Signalling Link Test Acknowledgements” (SLTA) are exchanged in Message Transfer Part (MTP) to detect network failures and node related failures.
And still, both protocols lack the differentiation between network and node related failures in the system.
Another transport protocol candidate is Transparent Inter Process Communication (TIPC) protocol, which is also using “probe” messages for link-layer supervision—with the same drawbacks as mentioned above.