In-memory cluster computing frameworks are a key component of the modern computing era, and provide an economically viable alternative to specially-built supercomputers. Cluster computing frameworks use commodity hardware that is easily and cheaply obtained. For example, a cluster of personal computers can be networked together to provide computing power that compares favorably (pricewise, if not in terms of physical space) with a supercomputer. But whereas traditional operating systems work well with individual personal computers that are not organized in a cluster, some special software is needed to make a cluster of personal computers work together. Apache Spark™, an example of such software, is growing quickly, and internet-service companies such as Google, Facebook, and Amazon are considering Apache Spark seriously. (Apache, Apache Spark, and Spark are trademarks of The Apache Software Foundation.) Moreover, SAP®, Cloudera™, MapR™, and Datastax are pursuing their efforts to make new products on top of Apache Spark framework. (SAP is a registered trademark of SAP SE in the United States and other countries. Cloudera is a trademark of Cloudera, Inc. MapR is a trademark of MapR Technologies Inc.)
Apache Spark is well-known for its capability to provide “memory-speed” computations, especially for, but not limited to, iterative, big-data analytics and real-time applications. To achieve such a great performance improvement compared to existing distributed computing platforms such as Apache Hadoop™, Apache Spark needs to keep its data in the memory of the clusters for fast computation in “resilient distributed dataset” (RDD) format. (Apache Hadoop and Hadoop are trademarks of The Apache Software Foundation.)
Existing Apache Spark implementations utilize the memory heap space of Java Virtual Machines (JVM), but this introduces significant performance degradation due to the needed Garbage Collection (GC) time. The GC event pauses the whole JVM and thus literally stops the whole execution.
To alleviate such high costs for maintaining RDD in the memory space of Java, Apache Spark developers came up with another solution, called “Tachyon”. Tachyon utilizes RAMDisks to cache RDD in memory without triggering the GC event in the JVM, while also maintaining the file system in the memory system. Tachyon not only eliminates GC overhead, but provides better separation between the execution engine (Apache Spark) and the storage/cache engine (Tachyon), because Tachyon runs as a different process and is controlled by a central manager which can also be fault-tolerant with other application such as Zookeeper.
But despite such efforts from the Apache Spark community, performance bottlenecks still exist in Apache Spark and Tachyon. By sharing memory space in the same memory system, both Apache Spark and Tachyon demand high memory bandwidth. Due to this bandwidth sharing, Apache Spark cannot achieve maximum performance.
Moreover, Tachyon, by itself, does not provide any fault tolerance, but relies on the fault tolerances of the storage systems that it relies on. This lack of fault tolerance within Tachyon can be a serious problem in the case where system engineers optimize cluster system configurations to squeeze the best performance out of the system by mounting non-fault tolerant memory/storage systems for Tachyon implementation.
While the above description focuses on Apache Spark and Tachyon, the problem with determining which devices cache data can potentially be found in any cluster computing framework.
A need remains to better manage the caching of data in a cluster computing framework that solves this and other problems.