The Apache Spark framework is a generalized framework for distributed data processing providing functional APIs for manipulating data at scale, in-memory data caching, and reuse across computations. It applies a set of coarse-grained transformations over partitioned data and relies on a dataset's lineage to re-compute tasks in case of failures. Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, which is maintained in a fault-tolerant way. From a caching point of view, RDD represents distributed immutable data (partitioned data and iterator) and lazily evaluated operations (transformations). RDD has the following properties: different RDDs may have different sizes (in bytes); different RDDs' re-generating times may be different; a child RDD's re-generating time may be changed if its parent RDD(s) is (are) removed from the memory; and operations on RDD have different types, such as transformations, actions, and persistence. Spark improves the system performance by storing as many intermediate results (such as resilient distributed datasets (RDDs)) into the volatile memory instead of long-term data storage (such as a hard disk drive (HDD)) as possible.
However, to some extents, the performance of this all-in-memory design depends on how the users (i.e., application developers) hardcode and explicitly tell Spark to reuse some RDDs to reduce the makespan, that is, the length of time required to process all jobs in a schedule. In fact, without RDD-reuse declarations from users, Spark has to pay both temporal (re-computing RDDs) and spatial (storing duplicated RDD) costs for duplicated RDDs and RDD operations. This kind of resource waste commonly occurs in database query (e.g., duplicated complicated SQL conditions) and some cross-application-but-same-logic scenarios.