1. Field of the Invention
The present invention relates to a computer program product, system, and method for determining storage tiers for placement of data sets during execution of tasks in a workflow.
2. Description of the Related Art
Enterprises are moving computational operations including big data analytics to the cloud, where computing can be performed across distributed computing nodes. One system to manage the execution of multiple tasks across various computing nodes is known as Apache™ Hadoop®. (Apache is a trademark and Hadoop is a registered trademark of the Apache Software Foundation throughout the world). Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. Hadoop is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. The Hadoop framework is used to run long running analytics jobs on very large datasets through distributed map-reduce style processes.
Some Hadoop distributed computing environments utilize a shared backend storage managed by a storage layer, where each computing node is assigned a portion of the shared storage, which acts as a local storage of the computational node. The storage layer may use a hot/cold data classification to determine where to locate data on different storage tiers, so that the “hot” or more frequently accessed data is placed in the more expensive higher performance storage tier. Other options include assigning higher performance tiers to data sets that have higher Service Level Agreement (SLA) guarantees or based on pricing models.
There is a need in the art for improved techniques for assigning storage tiers to tasks in a distributed computing environment.