1. Technical Field
The present disclosure relates to data processing systems and, more specifically, to caching of data in a distributed data processing system.
2. Background Information
In many current analytics frameworks, distributed data processing systems may be used to process and analyze large datasets, such as files. An example of such a framework is Hadoop, which provides data storage services using a distributed file system and data processing services though a cluster of commodity servers. The Hadoop based distributed system partitions the datasets into blocks of data for distribution and storage among local storage devices coupled to the servers to enable processing of the data by the servers in accordance with one or more data analytics processes. MapReduce is an example of a computational model or paradigm employed by Apache Hadoop to perform distributed data analytics processes on large datasets using the servers.
Broadly stated, a MapReduce process is organized into a Map step and a Reduce step. In the Map step, an analytics request or “job” is apportioned into a plurality of sub-jobs or “tasks” that are distributed to the servers. Each server performs its tasks independently on its stored data blocks and produces intermediate results. The servers then execute the Reduce step to combine all of the intermediate results into an overall result. Apache Hadoop is a specific example of a software framework designed for performing distributed data analytics on large datasets.
When deployed in an enterprise environment, however, such distributed systems typically suffer from problems including reliance on a single storage tier (i.e., the local storage device tier) for both performance and reliability, as well as lack of data management features. To address these problems, the system may be enhanced through the addition of a storage system and a caching layer distributed among the servers that increases the number of storage tiers, e.g., a shared storage tier and a distributed cache tier. Yet, the enhanced distributed system may be subjected to congestion conditions, such as local and remote cache bottlenecks at the servers, data popularity at the servers, and shared storage bottleneck at the storage system, that may adversely affect throughput and performance.
According to the distributed data analytics process, a block of data may reside on a local storage device of a server, as well as on the shared storage system. Different tasks pertaining to multiple jobs that require that block of data may be scheduled on the server. If all the tasks requests the data block, the local storage device may become a local bottleneck, which adversely impacts throughput of the device and server. Each server may also be assigned a limited number of “slots” or tasks that may be run in parallel. If the slots are occupied by existing tasks, new tasks may be scheduled in a different server, resulting in traffic forwarded from remote servers and creating a remote bottleneck at the different server.
In addition, a failure may occur to a server of the cluster, requiring that the server's block of data be accessed from the shared storage system, e.g., during reconstruction. If multiple servers of the cluster experience failures, there may be an increase in traffic to the shared storage system to access multiple blocks. The resulting increase in traffic may effectively reduce the size of the cluster supported by the shared storage system and create a shared storage bottleneck. Moreover, there may be one or more blocks residing on the local storage device of a server that are popular in the sense that multiple requests from other servers are directed to those blocks. The increased traffic at the server due to popularity of these data blocks may degrade performance of the server and its local storage device.