A framework, e.g., Apache Hadoop, can be deployed to manage distributed storage and distributed processing of large data sets on clusters of many computers, i.e., nodes, which may be physical or virtual. The framework can include multiple components to be run on different nodes in the cluster. Each component can be responsible for a different task. For example, a first component, e.g., Hadoop Distributed File System (HDFS), can implement a file system, and a second component, e.g., Hive, can implement a database access layer. The components work together to distribute processing of a workload of files among the nodes in the cluster.
When managing data access at the cluster, a file system component, e.g., HDFS, can include a master node, e.g., a Name Node, that manages a file system namespace and regulates access to files. The master node can store a mapping of data in the files to one or more slave nodes, e.g., Data Nodes. The slave nodes can store the files and serve read and write requests from client devices.