MapReduce is becoming a popular programming function for processing large jobs in a distributed network of processing clusters or nodes, such as Hadoop®. Currently, there are variety of services providers offering Hadoop® cloud services, such as for example, Amazon® Elastic MapReduce, Skyptap®, Joyent®, Windows® Azure, Rackspace® and the like.
Currently, there is no fast and efficient way to estimate a cost and job completion time for MapReduce jobs. Trying to obtain a job completion time estimate and a cost estimate can be challenging because many infrastructure configurations are hidden to a user in cloud computing environments. Typically, the MapReduce jobs can be very complex and the only way to obtain a job completion time may be to run the MapReduce job itself on each cluster or service. Unfortunately, this may take a considerable amount of time to obtain an estimated cost and job completion time.
In addition, each one of the services may offer multiple types of virtual nodes with different hardware configurations and software. For example, Amazon® Elastic MapReduce may offer more than eight different types of virtual nodes in which a user can choose to run his or her MapReduce job. Thus, running the MapReduce job on each one of the vast number of available virtual nodes to obtain estimated job completion times and estimated costs would be challenging, complex and time consuming.