Shared analytics clusters have become the de facto way for large organizations to analyze and gain insights over their data. Often, a cluster is comprised of tens of thousands of machines, storing exabytes of data, and supporting thousands of users, collectively running hundreds of thousands of batch jobs daily.
With shared analytics clusters, significant overlaps can be observed in the computations performed by the submitted jobs. Naively computing the same job subexpressions multiple times wastes cluster resources, which has a detrimental effect on the cluster's operational costs.