This specification relates to workload profiling in a computer cluster.
A distributed computing framework, e.g., Apache Hadoop, can be deployed to manage distributed storage and distributed processing of large data sets on clusters of many computers which may be physical or virtual. One computer is referred to as a node. The framework includes multiple components that can be run on different nodes in the cluster. Each component is responsible for a different task. For example, a first component, e.g., Hadoop Distributed File System (HDFS), can implement a file system, and a second component, e.g., Hive, can implement a database access layer. The components work together to distribute processing of a workload among nodes in the cluster.
A cluster of computers running the distributed computing framework is generally highly scalable. Additional nodes can be added to the cluster to increase throughput. Each cluster is also highly resistant to failure because data can be copied to multiple nodes in the cluster in case one or more nodes fail.