1. Field of the Invention
The present invention relates to distributed computing for large data sets on clusters of computers and more particularly to suspicious node detection and recovery in MapReduce computing.
2. Description of the Related Art
Application server clusters have become common in the field of high-availability and high-performance computing. Application cluster-based systems exhibit three important and fundamental characteristics or properties: reliability, availability and serviceability. Each of these features is of paramount importance when designing a robust clustered system. Generally, a clustered system consists of multiple application server instances grouped together in a server farm of one or more server computing nodes that are connected over high-speed network communicative linkages. Each application server process in the application cluster can enjoy access to memory, possibly disk space and the facilities of a host operating system.
Among the many challenges faced by those who manage the capacity and performance of a clustered system is the allocation of network resources for consumption by a particular application or workload. Network resources in a cluster can be managed through agents known as workload managers. The workload managers can optimally assign different network resources within endpoint containers to handle selected workloads in an application. In many cases, workload managers can adjust the assignment of network resources based upon performance metrics measured through systems management components in the clustered system.
MapReduce is a parallel programming technique frequently used in Cloud computing environments. In other words, MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cloud or cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured). MapReduce has two main components a “Map” step and a “Reduce” step.
“Map” step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. (A worker node may do this again in turn, leading to a multi-level tree structure.) The worker node processes that smaller problem, and passes the answer back to its master node.
“Reduce” step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output—the answer to the problem it was originally trying to solve.
One advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel—though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of ‘reducers’ can perform the reduction phase—all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than that which “commodity” servers can handle—a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled—assuming the input data are still available.
However, in this computing configuration there are number possible attacks, including a rogue worker node that produces bad results, produces no results, produces results slowly, produce extra tasks, replaces good tasks with bad tasks “leaks” tasks or results to allow parties outside a firewall to see them.
It will be apparent to the skilled artisan, then, that security in Cloud computing environments can be complicated. In the presence of potentially malicious nodes in a public cloud, the master node needs to be able to both detect suspicious nodes and take corrective action when a suspicious node is detected