In some dataflows, a given action can have multiple executions during the dataflow, with various dependent transformations. To improve the performance of such dataflows, some dataflow engines provide mechanisms to persist the output of a transformation using a caching operation, thereby avoiding the re-execution of precedent operations. The caching operation indicates that the dataset produced by an operation should be kept in memory for future reuse, without the need for re-computation.
The use of a caching operation (potentially) avoids the increased cost incurred by multiple actions in a dataflow. In complex dataflows, however, comprised of tens to hundreds of operations and control flows, deciding which datasets to cache is not trivial. Thus, the decision to cache a dataset requires considerable effort from the users to estimate a number of metrics.
A need therefore exists for improved techniques for automatic placement of cache operations for such dataflows.