A framework, e.g., Apache Hadoop, can be deployed to manage distributed storage and distributed processing of large data sets on clusters of many computers, i.e., nodes, which may be physical or virtual. The framework can include multiple components to be run on different nodes in the cluster. Each component can be responsible for a different task. For example, a first component, e.g., Hadoop Distributed File System (HDFS), can implement a file system, and a second component, e.g., Hive, can implement a database access layer. The components work together to distribute processing of a workload of files among the nodes in the cluster.
A third component, e.g., YARN, of the framework can break up the workload into multiple tasks. In particular, the third component is a resource manager that assigns each task to a respective compute node in the cluster that performs computations of the task. Each compute node can retrieve a portion of data required for the task from the file system before executing a process that uses the portion of data to complete the task.