The exponential increase in the number of cores per computer cluster node demands more efficient and scalable communication mechanisms since they are the limiting factors for cluster performance. The adoption of high-speed networks, such as InfiniBand, Myrinet, and 1/10/40/100 Gigabit Ethernet, generally improves communication performance, but their support in Java is poor. The main reason for this is that all Java communications are based on TCP/UDP sockets, which do not support reliable delivery of messages, since TCP only supports reliable streaming, and UDP only supports unreliable streaming and messaging. The upcoming JDK 1.7 will improve this situation by incorporating Stream Control Transmission Protocol (SCTP) sockets and Sockets Direct Protocol (SDP) support, but neither solution is portable. In fact, SCTP sockets cannot be used on several platforms, and SDP is initially supported only in Solaris. Moreover, both solutions still provide poor performance since SCTP sockets rely on their native implementation that performs less effectively than TCP sockets, and SDP has a performance similar to Internet Protocol (IP) emulation on InfiniBand (IPoIB) that performs well below the capability of the communication hardware. Additionally, Java does not support an efficient mechanism for message delivery in shared memory systems. In fact, Java communications are strongly oriented to the efficient support of distributed WAN applications, to the detriment of Java communications performance on clusters with high-speed networks, which are widely employed in High Performance Computing (HPC), data processing centers, and cloud infrastructures.
High-speed networks are supported in standard Java Virtual Machines (JVM) using IP emulations. These emulation libraries provide high start-up latency (0-byte message latency), low bandwidth, and high CPU load. The main reason for this poor throughput is that the IP protocol was designed to cope with low speed, unreliable, and failure prone links in WAN environments, whereas current cluster networks in Local Area Network (LAN) and System Area Network (SAN) environments are high-speed, hardware reliable, and failure resistant. Examples of IP emulations are IP over Myrinet low-level libraries MX (Myrinet Express) and GM (IPoMX and IPoGM), LANE driver over Giganet, IP over InfiniBand (IPoIB), and ScaIP and SCIP on Scalable Coherent Interface (SCI).
In order to provide Java with fuller more efficient support on high-speed networks, several approaches have been followed: (1) Virtual Interface Architecture (VIA) based projects, (2) Remote Method Invocation (RMI) optimizations, (3) Java Distributed Shared Memory (DSM) middleware on clusters, (4) high performance Java sockets implementations, and (5) low-level libraries on high-speed networks.
Javia and Jaguar provide access to high-speed cluster interconnections through VIA, a communication library implemented on Giganet, Myrinet, Gigabit Ethernet and SCI, among others. More specifically, Javia reduces data copying using native buffers, and Jaguar acts as a replacement for the Java Native Interface (JNI). Their main drawbacks are the use of custom APIs, the need for modified Java compilers, and the lack of non-VIA communication support. Additionally, Javia exposes programmers to buffer management and uses a custom garbage collector.
Typical projects that deal with Remote Method Invocation (RMI) optimization are Manta, a Java to native code compiler with a fast RMI protocol, and KaRMI, which improves RMI through efficient object serialization that reduces protocol latency. Serialization is the process of transforming objects into a series of bytes, in this case to be sent across the network. However, the use of custom high-level solutions that incur substantial protocol overhead, and the focus on Myrinet, has restricted the applicability of these projects. In fact, their start-up latency is from several times to an order of magnitude greater than socket latencies.
Noteworthy Java DSM projects are Jackal, cJVM, CoJVM, JESSICA2, and JavaSplit. As these are socket-based projects, they benefit from socket optimizations, especially in shared memory communication. However, they share unsuitable characteristics, such as the use of modified JVMs, the need for source code modification, and limited interoperability. Additionally, they do not directly support high-speed networks. A related project is Pleiad, which provides shared memory abstraction on top of physically distributed resources. The programmer uses an API with special threads, shared arrays, and shared objects across a multi-core cluster. However, Pleiad does not support directly high-speed networks.
Java Fast Sockets (JFS) is an optimized Java socket implementation that: (1) more directly supports high-speed networks such as SCI, Myrinet, and Gigabit Ethernet; (2) alleviates the serialization overhead; (3) reduces buffering and unnecessary copies; and (4) re-implements the protocol for boosting shared memory (intra-node) communication by re-sorting to UNIX sockets. Another related project in high performance sockets implementations is Non-Blocking IO (NBIO), which introduced non-blocking features. These features were eventually standardized in Java New I/O (Java NIO) sockets, which are crucial for increased scalability in server applications. Nevertheless, neither NBIO nor NIO sockets provide high-speed network support. Ibis sockets library is a high performance sockets implementation over the Ibis Portability Layer (IPL), which can run on TCP or MX (Myrinet). However, this socket implementation does not significantly exploit MX direct support, and consequently only achieves a performance similar to TCP support on IPoMX.
Another approach for the support of high-speed networks in Java is the development of custom low-level Java libraries on a specific network. An example is Jdib, which accesses Mellanox Verbs Interface (VAPI) on InfiniBand through a low-level API that directly exploits Remote Direct Memory Access (RDMA) and communication queues. In this way, Jdib achieves almost native performance on InfiniBand, but the use of a low-level API makes the implementation of Java applications difficult, and compromises the efficiency of the solution due to the need for multiple accesses to VAPI native functions for each message transfer.
Other efforts to provide fuller and more efficient support for high-speed networks in Java have several drawbacks, such as the use of non-standard JVMs and Java compilers, the use of custom APIs, and the relatively small performance benefits due to the inefficiency of the communication mechanisms implemented.
In addition to poor Java support on high-speed networks, Java applications usually suffer from inefficient communication middleware that is substantially based on protocols with high communication overhead, such as sockets and especially Java RMI, whose protocol involves a significant number of socket transfers.
Initial implementations of Message-Passing in Java (MPJ) middleware, which are messaging systems oriented towards HPC, relied upon RMI for communications. However, for reasons of efficiency, they are now implemented either with sockets or with wrapped native Message Passing Interface (MPI) communication libraries. The most common socket-based implementations are MPJ Express, MPJ/Ibis and F-MPJ, and the most common wrapper implementation is mpiJava.
MPJ Express is a “pure” Java (100% Java) MPJ solution, implemented on top of Java NIO. MPJ Express is thread-safe and implements a pluggable architecture that combines the portability of “pure” Java NIO communications with high performance Myrinet support. The latter occurs through use of the native MX communication library. However, the use of several communication layers, such as MPJ, mpjdev, xdev and the buffering layer, adds significant overhead to MPJ Express communications.
MPJ/Ibis is another MPJ library. It has been implemented on top of Ibis, a parallel and distributed Java computing framework. Ibis can use either “pure” Java communications or native communications on Myrinet. There are two low-level communication devices in Ibis: TCPIbis, which is based on Java IO sockets (TCP), and NIOIbis, which provides both blocking and non-blocking communication through Java NIO sockets. However, MPJ/Ibis is not thread-safe, does not take advantage of non-blocking communication, and its Myrinet support is based on the GM library, which has lesser performance than the MX library.
F-MPJ is an MPJ library that outperforms MPJ Express and MPJ/Ibis. It does so by using Java Fast Sockets (JFS) and by implementing a communication protocol that provides efficient non-blocking communication, thereby allowing communication overlapping, and thus more scalable performance. Additionally, F-MPJ reduces the buffering overhead and implements efficient MPJ collective primitives.
The most relevant MPJ Java wrapper project is mpiJava, a library that uses native MPI implementations for communications. However, although mpiJava performance is usually high, this library currently only supports some native MPI implementations, since the wrapping of a wide number of functions and heterogeneous runtime environments entails a significant effort. Additionally, this implementation is not thread-safe, and therefore is unable to take advantage of multi-core systems through multithreading. Because of these drawbacks, the mpiJava project has been superseded by the development of MPJ Express.
Shared memory communication support in Java messaging systems is currently only implemented in MPJ Express within its “smpdev” multi-core communication device. This allows thread-based shared memory (intra-node) transfers. However, the performance benefits of this thread-based shared memory communication support are severely limited due to the use of: the MPJ Express buffering layer (mpjbuf), excessive synchronization overhead, and multiple processing layers.
Examples of other native (non-Java) messaging systems that support shared memory communications are MPICH2 (through its “shm” and “Nemesis” channels), TOMPI, and TMPI.
However, these systems are limited to intra-node communications and do not apply to clusters of multi-core processors.
Hybrid shared/distributed memory architectures increase the complexity of communication protocols, since they combine network (inter-node) communications with shared memory (intra-node) communications, thereby requiring efficient communication overlapping.
Existing systems that utilize hybrid shared/distributed memory architectures use a hybrid-programming paradigm of shared memory and Message-Passing libraries, such as OpenMP and MPI, in a hierarchical structure. In this structure, MPI is used for inter-node communications and OpenMP is used for parallel processing within each node. Although this approach might appear to fully exploit the available computer resources, it actually presents several problems, such as preventing compiler optimizations due to the use of threads, excessive synchronization overhead, and the need for thread safety in the Message-Passing library. In the MPI-2.0 standard, Section 8.7, the hybrid approximation replaces OpenMP by POSIX-threads (pthreads) in order to provide a higher degree of control and reduce the impact of the aforementioned issues. A hybrid paradigm approach, for use with Java, can also be followed. Thus, messaging libraries can be combined with Java threads and Java OpenMP-like libraries, such as JOMP and the shared memory API of the Parallel Java library. Nevertheless, all these projects require the use of two programming paradigms, which involves a significant programming effort. In large part this is due to the use of low-level threading models, which typically incurs a higher synchronization overhead than a single programming paradigm.
Current native messaging systems take advantage of hybrid shared/distributed memory architectures using only the messaging programming model when combining, transparently to the user, shared memory communication devices (intra-node) with network communication devices (inter-node). Examples of the transparent support of hybrid memory architectures in messaging systems are MPICH2 “sshm” (sockets plus “shm” shared memory support), Nemesis-IB (shared memory and InfiniBand support), and SHIBA (POSIX shared memory and InfiniBand support). Nevertheless, when supporting both shared and distributed memory communications (both intra-node and inter-node transfers), the shared memory communication is implemented as an inter-process transfer, not as a thread-based intra-process communication.