1. Field of the Invention
In at least one aspect, the present invention relates to communication within a cluster of computer nodes.
2. Background Art
A computer cluster is a group of closely interacting computer nodes operating in a manner so that they may be viewed as though they are a single computer. Typically, the component computer nodes are interconnected through fast local area networks. Internode cluster communication is typically accomplished through a protocol such as TCP/IP or UDP/IP running over an ethernet link, or a protocol such as uDAPL or IPoIB running over an Infiniband (“IB”) link. Computer clusters offer cost effective improvements for many tasks as compared to using a single computer. However, for optimal performance, low latency cluster communication is an important feature of many multi-server computer systems. In particular, low latency is extremely desirable for horizontally scaled databases and for high performance computer (“HPC”) systems.
Although present day cluster technology works reasonably well, there are a number of opportunities for performance improvements regarding the utilized hardware and software. For example, ethernet does not support multiple hardware channels with user processes having to go through software layers in the kernel to access the ethernet link. Kernel software performs the mux/demux between user processes and hardware. Furthermore, ethernet is typically an unreliable communication link. The ethernet communication fabric is allowed to drop packets without informing the source node or the destination node. The overhead of doing the mux/demux in software (trap to the operating system and multiple software layers) and the overhead of supporting reliability in software result in significant negative impact on application performance.
Similarly, Infiniband (“IB”) offers several additional opportunities for improvement. IB defines several modes of operation including Reliable Connection, Reliable Datagram, Unreliable Connection, and Unreliable Datagram. Each communication channel utilized in IB Reliable Datagrams requires the management of at least three different queues. Commands are entered into send or receive work queues. Completion notification is realized through a separate completion queue. Asynchronous completion results in significant overhead. When a transfer has been completed, the completion ID is hashed to retrieve context to service the completion. In IB, receive queue entries contain a pointer to the buffer instead of the buffer itself resulting in buffer management overhead. Moreover, send and receive queues are tightly associated with each other Implementations cannot support scenarios such as multiple send channels for one process, and multiple receive channels for others, which is useful in some cases. Finally, reliable datagram is implemented as a reliable connection in hardware, and the hardware does the muxing and demuxing based on the end-to-end-context provided by the user. Therefore, IB is not truly connectionless and results in a more complex implementation.
Accordingly, there exists a need for improved methods and systems for connectionless internode cluster communication.