Every day, several quintillion bytes of data may be created around the world. These data come from everywhere: posts to social media sites, digital pictures and videos, purchase transaction records, bank transactions, sensors used to gather data and intelligence, like climate information, cell phone GPS signal, and many others. This type of data and its vast accumulation is often referred to as “big data.” This vast amount of data eventually is stored and maintained in storage nodes, such as hard disk drives (HDDs), solid-state storage drives (SSDs), or the like, and these may reside on networks or on storage accessible via the Internet, which may be referred to as the “cloud.” This stored data may also require processing, or be subject to operations, such as during a search, Pattern Mining, Classification, or other processes. Typically, a processing device, such as a central processing unit (CPU), in a server performs operations on the data. The data is read from the storage node, processed by the CPU and the processed data is sent to the source of a request and/or stored back on the storage node. Standard storage nodes generally do not include computational resources to perform such operations on data stored in the storage node.
Moreover, standard storage node interfaces, such as Serial Advanced Technology Attachment (SATA), Fibre Channel, or Serial Attached SCSI (SAS), do not define commands to trigger the storage node to perform data operations in the storage node. Accordingly, operations are performed outside of the storage node, e.g., in a server CPU. To perform such an operation, a server uses standard read and write commands supported by existing storage node interfaces to move data from and to the storage node. Specifically, the server sends a standard read command to the storage node via a bus. The storage node then sends the stored data over the bus to the server, which typically holds the data in its main memory. The CPU of the server then performs operations on the data to produce a result. Depending on the type of operation, the server provides the result to a requesting source and/or stores the result on the storage node.
There are several disadvantages associated with this process of reading the data from the storage node, and processing the data within the server, and potentially storing the processed data back on the storage node. Because of these disadvantages, the process of performing data operations on the server is referred to as “costly” or “expensive” in terms of device performance and power consumption. Because the server CPU is involved in every step of the process, this process occupies the CPU of the server, consumes power, blocks other user operations that otherwise could have been performed, and requires that the server contain a buffer, or a larger buffer than would otherwise be needed. The buffer is typically the main memory of the CPU, or double data rate (DDR) random access memory. This process also ties up the communication bus between the server and the storage node since data is sent from the storage node to the server and then back to the storage node. In other words, existing processes for searching and analyzing large distributed unstructured databases are time-consuming and use large amounts of resources such as CPU utilization, memory footprint, or energy.
In summary, typical operations like search, pattern mining, classification, machine learning algorithms and data analysis are, in existing systems, performed on the local server's CPU. Search and processing may be performed over the entire data residing in storage nodes (e.g., solid state drives (SSDs), hard disk drives (HDDs), etc.) within the server. Data needs to be moved from the storage node into the CPU memory where it can then be processed. This is inefficient, e.g., slow, because a single server CPU, which may control a large collection of storage nodes, has relatively little processing power with which to process the large volume of data stored on the collection of storage nodes. Efficiency may also be compromised by one or more data bottlenecks between the server CPU and the storage nodes. Moreover, requiring the server's CPU to do this work makes inefficient use of energy as well, in part because a general-purpose CPU like a server CPU generally is not optimized for large data set processing, and in part because transferring data over a data bus and across the interface to the storage node requires a significant amount of power.
Big data may be managed and analyzed using the Hadoop™ software framework and using the Map-Reduce programming model. The Hadoop™ framework may implement Map-Reduce functions to distribute the data query, which may be a Map-Reduce job, into a large number of small fragments of work, referred to herein as tasks, each of which may be performed on one of a large number of compute nodes. In particular, the work may involve map tasks and reduce tasks which may be used to categorize and analyze large amounts of data in distributed systems. As used herein, a compute node is a piece of hardware capable of performing operations, and a storage node is a piece of hardware capable of storing data. Thus, for example, a piece of hardware may be, or contain, both a compute node and a storage node, and, as another example, a compute node may include or contain a storage node.
Related art Map-Reduce systems for large-scale processing of data in a parallel processing environment include one or more map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values. An intermediate data structure stores the intermediate data values. These systems also include reduce modules, which are configured to retrieve the intermediate data values from the intermediate data structure and to apply at least one user-specified reduce operation to the intermediate data values to provide output data. Preferably, the map and/or reduce tasks are automatically parallelized across multiple compute nodes in the parallel processing environment. The programs or instructions for handling parallelization of the map and reduce tasks are application independent. The input data and the intermediate data values can include key/value pairs and the reduce operation can include combining intermediate data values having the same key. The intermediate data structure can include one or more intermediate data files coupled to each map module for storing intermediate data values. The map and reduce tasks can be executed on different compute nodes. The output data can be written to the local storage node or to another compute node using a distributed file system, for instance, a Hadoop™ distributed file system (HDFS).
Map-Reduce (M-R) is a programming model that allows large amounts of data to be processed on parallel computer platforms using two basic functions: map and reduce. Data is first mapped (for grouping purposes) using the map function and then reduced (aggregated) using the reduce function. For example, records having different attributes such as “dog” and “cat” could be mapped, for grouping purposes, to new records (or tuples) where each has attributes of “animal” instead of “dog” or “cat”. Then, by a reduce function, all the “animal” records (or tuples) could be aggregated. A Map-Reduce model implemented in a parallel processing computer system may enhance the processing of massive quantities of data by a “divide-and-conquer” strategy that may result from dividing the data into portions and processing it on parallel-processing computer installations.
Related art hardware systems may include a set of data nodes, which may also be referred to as slave nodes, controlled by a master node which may also be referred to as a job tracker or name node. Within the Hadoop™ framework, the master node may use the Map-Reduce process to assign tasks to slave nodes, the slave nodes may complete the tasks, and the master node may then aggregate the results produced by the slave nodes.
The master node and the slave nodes may be servers, each including a CPU and a storage node. As in the case of other operations, slave node sub job operations executed in a CPU which retrieves data from a storage node and may save results on a storage node are relatively slow and power-inefficient. Thus, there is a need for a system and method, in, e.g., a Hadoop™ system, for more efficiently processing data stored on storage nodes.