1. Field of the Invention
The present invention relates generally to distributed computing, and in particular, to a method, apparatus, system, and article of manufacture for utilizing a map-reduce software framework to perform computation processing. The invention relates to optimally leverage distributed computing hardware for data mining by developing a highly reusable infrastructure that allows developers an easy way to execute map-reduce functionality over an entire cluster. Thus, the invention turns multiple computers into a single problem solving machine through parallel distributed paired data.
2. Description of the Related Art
Map-reduce is a software framework used to support distributed computing on large data sets on clusters of computers (nodes). There are two steps as part of a map-reduce framework—“map” and “reduce”. During the “map” step, a master node takes input, chops it up into smaller sub-problems and distributes those to worker nodes. The worker node processes the smaller problem and passes the answer back to the master node. During the “reduce” step, the master node takes the answers to all the sub-problems and combines them in a way to get the output (the answer to the problem it was originally trying to solve). Many problems exist in the prior art map-reduce implementations: (1) they are often restricted to a particular operating system such as Linux™; (2) they are difficult to use without substantial experience in a particular programming language; (3) they require a preexisting knowledge of the master computer, the functionality of the master computer, and the use of the master computer in the cluster; and (4) processing may not be evenly distributed across all machines in a cluster. These problems may be better understood with a more detailed explanation of prior art map-reduce implementations and uses.
Prior art map-reduce implementations are commonly open source and have been primarily limited to a Linux™ operating system environment. While such implementations may be useful, many users and developers are used to the Microsoft™ Windows™ operating system environment and are unable to take advantage of such Linux™-based implementations.
Further, prior art map-reduce technologies often require substantial programming experience in a particular programming language. In addition, the developer is required to maintain a preexisting knowledge base regarding the functionality of all nodes in a cluster (i.e., in order to determine which node should be used to perform a desired function). Also, a single master node must be used as the entry point for accessing and utilizing the functionality provided by a cluster.
Some more specific details regarding map-reduce functions may be useful in better understanding the problems of the prior art. The “map” and “reduce” functions are both defined with respect to data structured in (key,value) pairs. The “map” function takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain. The map function is applied in parallel to every item in the input dataset to produce of list of (key,value) pairs for each call. Thus, the “map” function identifies a set of (key,value) pairs. All of the pairs with the same key from all lists may be collected and grouped together to create one group for each of the different generated keys. The “reduce” function is then applied in parallel to each group, to produce a collection of values in the same domain. Accordingly, the map-reduce function transforms a list of (key,value) pairs into a list of values. During a “map” function, (key,value) pairs are identified, and during the “reduce” function, like keys are brought together and merged to produce a result.
However, as part of the execution of the “map” and “reduce” functions, the data may be skewed such that a larger portion of processing may be performed by a particular machine in the cluster. Further, a single master node is responsible for executing and controlling the function execution. Examples of various prior art implementations include the Apache™ Hadoop™ project implementation of map-reduce, the Apache™ CouchDB™ project, the Skynet™ open source implementation of Google™'s map-reduce framework, and the Disco™ open-source implementation.
In addition to the above described limitations, the prior art fails to provide the ability to perform cluster based debugging and editing of code/transactions. Accordingly, what is needed is a distributed computer system that enables the easy performance of a map-reduce function on any node in a cluster without requiring a specific master node where data and processing is evenly distributed across a cluster and where cluster based processing/debugging can be utilized.