Cloud computing provides traditional information technology resources like computation, capacity, communication, and storage on demand. Typically, cloud computing vendors offer their consumers the ability to access and rent these resources at agreed upon rates. These arrangements can offer significant benefits to the consumers over traditional enterprise data center implementations, which typically feature a network of computing technology hardware that is privately procured, integrated, secured, and monitored independently. These benefits include providing the ability to provision additional resources on demand, dynamically scale a client's application or system, and limit costs to reflect actual resource usage and consumption. In addition, the advantages inherent to avoiding constructing and maintaining a network architecture—such as eliminating the time required for hardware procurement and assimilation and the notorious difficulties of software integration—are also provided through the utilization of the cloud computing model.
The majority of current cloud computing infrastructures consist of numerous servers with varying levels of virtualization technologies. Architecturally, cloud computing data center networks can resemble traditional enterprise architectures, albeit on a (generally) much grander scale. For example, the architecture for a typical data center network for any particular cloud computing vendor may be implemented as a hierarchy of routers and concentric subnets connecting a large network of servers, often numbering in the hundreds or thousands. However, like enterprise infrastructures, cloud computing data center networks are typically under-provisioned relative to a total, aggregate peak demand for all consumers, and often by a significant factor. This under-provisioning can compromise the efficacy of the network and prevent the network from performing at its supposed level of throughput when the simultaneous demand for resources is high.
As a solution to mitigate this problem, utilization of a spot market has been recently employed among some prominent cloud computing vendors. In a cloud spot market, the cost for certain computing resources can fluctuate, according to demand. Typically, processing services (as opposed to communication or storage services) experience the most drastic impact of high demand, and, accordingly, processing service rates often experience the most elasticity. Spot market allows consumers to automatically cease operation of some or all of a consumer's requisitioned cloud computing resources when the service rate for the resource gets too high (e.g., exceeds a pre-determined threshold rate), and to automatically restart operation (and thus, resume incurring service fees) when the current service rate drops below the threshold rate. This allows those consumers to reduce the costs incurred by cloud computing service fees, take advantage of reduced rates, all while potentially alleviating the strain of accommodating a high number of consumers during peak demand hours or times.
MapReduce is an increasingly popular programmed framework for processing application data in parallel over a plurality of computing units, and is especially suited to cloud computing (e.g., by performing the parallel processing in provisioned virtual machines in a cloud infrastructure). The individual computing units themselves which perform the processing are often referred to as “nodes.” According to conventional practice, MapReduce performs the data processing of an application over a series of “map” and “reduce” steps.
During a map step, a “master” (control) node takes the input (such as a processing task) from an input, partitions it up into smaller sub-tasks suitable for processing in a queue (typically, an instance of a communication or messaging service), and distributes the sub-tasks to “worker” (processing) nodes. The worker nodes process the sub-tasks, and pass the resultant data back to its master node. During the reduce step, the master node then takes the processed data from all the sub-tasks and combines them in some way to derive the output—which, when aggregated and organized, becomes the output of the original task to be performed. Once the output is derived, the output can be “committed”—that is, stored. Within a cloud computing infrastructure, storage of the outputs can be temporarily stored in an instance of a storage service, such as an instanced database. Alternatively, the output can be stored in memory (either volatile or non-volatile) of the master node.
One of the benefits of the MapReduce framework is that it allows for distributed parallel processing of the map and reduction operations, rather than typical sequential processing. For example, if each mapping operation is independent of the others, all maps can be performed in parallel. Similarly, a number of ‘reducers’ can perform the reduction phase. A feature of MapReduce implementations over traditional server solutions is that the MapReduce framework can be applied to significantly larger datasets with greater efficiency. The distributed structure of processing task assignments also offers some degree of fault tolerance during node failures, since the sub tasks can be rescheduled and reassigned.
Although conventional MapReduce implementations can tolerate failures, they are ill-suited to the spot market environment. Conventional MapReduce implementations are designed to tolerate infrequent or smaller-scale machine failures, but are unable to cope with massive, simultaneous machine terminations caused by a spot price rate increase. If the primary and backup master nodes fail, no computation can progress. More problematically, the processing performed by the nodes for a specific map task which has not been committed (stored) can be lost during termination as well. Even if the master nodes do not run on spot instances, several simultaneous node terminations could cause all replicas of a data item to be lost. In addition to data corruption, operation pauses and delays can significantly lengthen MapReduce computation times.