Configuration files are required in a distributed computing environment to allow each of the computers of such a network to communicate with each other. FIG. 1 is a block diagram of a conventional distributed computing process 10. The computing process 10 comprises a user 12 which sends configuration files to a master process 14. The master process 14 then initiates all of the slave processes 16, 18 and 20. Typically the computers which run these processes must have a standard convention to allow for communications between computers within a distributed computing environment. A typical convention for such an environment is the use of the message passing interface (MPI). Using such an interface on applications run on such computers allows for communication therebetween. Accordingly, the standard way to generate the configuration files used to run an MPI application on a cluster of computers in a distributed environment is to start with a list of the computers in the cluster that will be hosting the MPI processes and determine based on their addresses and the number of CPUs in each computer what the contents of each configuration file should be, and then send the appropriate configuration file to each computer, and then start all of the slave processes, and finally start the master process.
This standard method of configuration file generation is not possible if you do not have a list of IP addresses and CPU counts ahead of time of the computers that run the MPI processes. To explain this problem in more detail refer to the following. Apple Computer provides, for example, Xgrid, a suite of applications which runs computational intensive applications. Xgrid enables administrators to group locally networked computers or nodes into clusters or grids and allows users on the network to remotely submit long-running computations as jobs to the clusters. Xgrid then creates multiple tasks for each job and distributes those tasks among multiple nodes, which can be either multipurpose desktops or dedicated cluster nodes.
Distributed Computing Under Xgrid Architecture
FIG. 2 is a block diagram of a distributed computing environment cluster 100.
Components
A cluster comprises three main software components:
1. An agent 106-110 runs one task at a time per CPU, in either dedicated mode or screensaver mode.
2. A controller 104 queues tasks, distributes those tasks to agents, and handles failover.
3. A client 102 submits jobs to the controller in the form of multiple tasks.
A user interacts with the grid via the client. The client uses a multicast broadcast, for example, from Rendezvous or an internet protocol (IP) address/hostname to find a controller to submit a job—a collection of execution instructions that may include data and executables. The controller 104 accepts the job and its associated files, and communication with the agents. Agents 106-110 accept the jobs, perform the calculations, and return the results to the controller, which aggregates them and returns them to the appropriate client.
In principle, all three components can run on the same computer, but it is often more efficient to have a dedicated controller.
Client 102
A user submits a job to the controller via an Xgrid client application, using either the command-line (Xgrid) or a graphical user interface application built using the Xgrid application framework. The user defines the parameters for the job to be executed in the Xgrid client, and these are sent to the controller. When the job is complete, the client is notified and can retrieve the results from the controller.
Any system can be an Xgrid client provided it has the Xgrid application installed and has a network connection to the controller system. In general, the client submits a job to a single controller at a time.
Controller 104
The controller service (xgridcontrollerd) manages the communications and the resources of the clusters. The xgridcontrollerd process accepts network connections from clients and agents. It receives job submissions from the clients, breaks the jobs up into tasks, dispatches tasks to the agents and provides feedback to the clients.
Agents 106, 108, 110
The agents handle running the computational tasks that comprise a job. When an agent (xgridagentd) starts running at startup it registers with the controller, which sends instructions and data to the xgridagentd when appropriate. An agent can be connected to only one controller at a time. Once the instructions from the controller are received, the agent then executes the appropriate code and sends the results back to the controller.
Accordingly, Xgrid allows a client to submit a list of processes to run on distributed set of computers but does not let them decide ahead of time which computers will be hosting which processes. Using a system such as Xgrid not only does the client not know the IP addresses of the computer that will be assigned to run the processes, but the client also does not know how many processes will be run on each computer. Therefore it is impossible for the client to generate either the master configuration file or the slave configuration files for the processes.
Accordingly, as before mentioned, the standard way to generate the configuration files used to run a MPI application on a cluster of computers is to start with a list of the computers in the cluster that will be hosting the MPI processes and determine based on their addresses and the number of CPUs in each computer what the contents of each configuration file should be, and then send the appropriate configuration file to each computer, and then start all of the slave processes, and finally start the master process.
This standard method of configuration file generation is not possible if a list of IP addresses and CPU counts is not available ahead of time for the computers that run the MPI processes.
Accordingly, what is needed is a system and method for configuration file generation which does not require a list of addresses and CPU counts ahead of time. The system and method should be easily implemented on existing systems and should be adaptable therewith. The present invention addresses such a need.