It is often difficult to balance the conflicting demands of storage capacity requirements and performance requirements (e.g., access times). Multi-tier storage environments, such as two-tier storage systems, typically provide a performance tier that employs memory based on performance considerations and a capacity tier that employs storage based on capacity considerations. In this manner, multi-tier storage systems balance between the relative costs of memory and other storage and their relative speeds. Such multi-tier storage environments typically allow particular levels of performance to be achieved at a significantly lower cost than would otherwise be possible.
It is often desirable to provide such multi-tier storage environments transparently to users and applications. In some circumstances, however, applications can obtain performance improvements when the multiple tiers are visible to applications.
MapReduce is a programming model for processing large data sets, such as distributed computing tasks on clusters of computers. During the map phase, a master node receives an input, divides the input into smaller sub-tasks, and distributes the smaller sub-tasks to worker nodes. During the reduce phase, the master node collects the answers to the sub-tasks and combines the answers to form an output (i.e., the answer to the initial problem).
A need exists for improved techniques for transparently including multi-tier storage systems in such MapReduce environments. A further need exists for content-aware storage tiering techniques in such MapReduce environments that are based on semantic data chunks.