Parallel storage systems are widely used in many computing environments. Parallel storage systems provide high degrees of concurrency in which many distributed processes within a parallel application simultaneously access a shared file namespace.
Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations. For example, the Department of Energy uses a large number of distributed compute nodes tightly coupled into a supercomputer to model physics experiments. In the oil and gas industry, parallel computing techniques are often used for computing geological models that help predict the location of natural resources. One particular parallel computing application models the flow of electrons within a cube of virtual space by dividing the cube into smaller sub-cubes and then assigning each sub-cube to a corresponding process executing on a compute node.
Storage tiering techniques are increasingly used in parallel computing environments to more efficiently store the vast amounts of information. For example, the Symmetrix system from EMC Corporation is an enterprise storage array that optionally includes Fully Automated Storage Tiering (FAST). Storage tiering techniques typically combine Non-Volatile Random Access Memory (NVRAM), also referred to as flash memory, with more traditional hard disk drives (HDDs). Flash memory is used to satisfy the bandwidth requirements of a given system while the hard disk drives are used to satisfy the capacity requirements.
MapReduce is a programming model for processing large data sets, such as distributed computing tasks on clusters of computers. During the “map” step, a master node receives the input, divides it into smaller sub-problems, and distributes the smaller sub-problems to worker nodes. The worker nodes process the smaller problems, and pass the answer back to its master node. During the “reduce” step, the master node collects the answers to the sub-problems and combines the answers to form the output (i.e., the answer to the initial problem).
The map phase acts as a filter across all data blocks. The filtered blocks are then applied to the reducer phase. For example, consider climate data that has been loaded into a map-reduce storage file system. Assume that there are 100 blocks of data spread across 100 map-reduce nodes and the application wants to process data blocks for which the air pressure is greater than a predefined threshold, T. If there are two blocks matching this criteria, then the map job will read all 100 blocks and forward only the two matching blocks to the reducer. The remaining 98 blocks were read only to discover that they did not satisfy the criteria. Thus, a complete search of the entire data set (i.e., a map-reduce function applied on all of the data) must be performed while only a small percentage of the data blocks are actually needed.
A need therefore exists for improved data analytic techniques for data distributed across a plurality of flash based storage nodes in a hierarchical storage tiering system.