The present invention relates to the electrical, electronic and computer arts, and, more particularly, to caching, pre-fetching, and the like.
In-memory directed acyclic graph (DAG)-based data analytic platforms support complex data analytic workflows in a high performance, distributed, fault-tolerant way. The Spark framework provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines. Spark has been rapidly adopted by industry.
Memory management is pertinent in order for in-memory Spark to achieve high performance. Memory is becoming a scarce resource due to RDD data persistence: intermediate data are usually cached in disk and memory to avoid re-computation. The management of the RDD cache is oblivious to the DAG scheduling. Suboptimal memory management can lead to significant performance degradation and low memory efficiency.
DAG is easy to use and popular; DAG with Spark is memory intensive. The RDD abstraction used by Spark is like a specialized cache used to store intermediate data. Currently, management of RDD uses a traditional policy called Least Recently Used (LRU); the current technique is, as noted, oblivious to DAG scheduling and is suboptimal.