Parallel processing has been used in the past to execute parallel jobs quickly and efficiently. In parallel processing systems, the multiple tasks of a parallel job are separated and allocated to more than one resource, generally located on more than one host for parallel execution. The exit statuses of the completed tasks together constitute the completed job.
FIG. 1 illustrates a prior art parallel processing system, shown generally by reference numeral 10. As shown in FIG. 1, the prior art system 10 comprises a vendor parallel job launcher 8 that allocates and dispatches message passing interface (MPI) tasks 4 to resources 6. The vendor parallel job launcher 8 may also support secondary features, such as providing support for a switch to facilitate high-speed communication between resources. The resources 6 would then complete the MPI tasks 4 and generally send the exit status of the MPI tasks 4 to an appropriate location, which may be dictated by the framework, and/or programmed by the system. The resources 6 are generally present on more than one host 40 and a particular host 40 may have more than one resource. The resources 6 can include processors, memory, swap space or even a license for software.
The prior art system 10 shown in FIG. 1 was found to suffer from the disadvantage that there was no job management present. The lack of a job management function resulted in inefficient allocation of resources to jobs, an inability to monitor resources used by the job and inability to control the jobs.
Accordingly, a further improvement on the prior art system 10 illustrated in FIG. 1 was developed and is shown generally by reference numeral 20 in FIG. 2. As shown in FIG. 2, the improved prior art system 20 comprises a resource and parallel application manager (PAM), shown generally by reference numeral 24. The PAM 24 controls and manages the parallel jobs and offers comprehensive collection and management of the parallel jobs. In general, the PAM 24 collects the resources 6 required to execute each task 4 of a parallel job, whether the job is parallel or sequential, or a combination of both. In this way, the PAM 24 provides a point for job control. In one embodiment, the PAM 24 controls the execution of the parallel job, such as by performing the functions of stop, resume and suspension of a parallel job, or the tasks of a parallel job.
While the system 20 shown in FIG. 2 provides improved resource management, the system 20 suffers from the disadvantage that in order to permit the PAM 24 to communicate with the vendor parallel job launcher 18, substantial revisions and customization are required to the different components. This is the case, in part, because the PAM 24 is generally generic but the vendor parallel job launcher 18 is vendor-specific. Therefore, customized software and code is generally required for the PAM 24 to communicate with the job launcher 18, or users must utilize specific vendor parallel job launcher 18 supplied by the same vendors as the PAM 24.
If the PAM 24 and the job launcher 18 cannot communicate, some features or abilities may be sacrificed. For example, the PAM 24 needs information regarding the tasks 4 in order to (a) collect resources for the job, and (b) control the job. In order to access task 4, such as to monitor usage or control execution of the task, the PAM 24 generally requires the host and process identifier (host/pid) of a task 4. However, because in the prior art system 20 the parallel job launcher 18 commenced the task 4 on the resource 6, the job launcher 18 would generally have the process identifier of the task 4. Therefore, generally it was necessary to customize the PAM 24 and the job launcher 18 so that the host and process identifier host/pid could be communicated from the job launcher 18 to the PAM 24.
Furthermore, some resources 6 may also be vendor specific. In this case, customized software may be required to communicate between the resource 6, the job launcher 18 and PAM 24.
Accordingly, in order to launch and execute tasks 4 of parallel jobs, it has been necessary to customize the PAM 24, the job launcher 18, and sometimes applications being executed on the resource 6 so that the various components could communicate. This has been cumbersome for several reasons. Firstly, it has been difficult and time consuming to prepare the customized portions for the PAM 24 and job launcher 18. In addition, this customization would need to be done for each PAM 24 and job launcher 18 combination. Furthermore, this customization generally must be updated each time either the PAM 24 or the job launcher 18 is updated. In the case where the resource 6 comprises a vendor-specific application, the customization may need to be updated each time a new version of the application is installed.
Secondly, in order to create this customized portion, it is generally necessary to have information regarding the PAM 24, the job launcher 18 and any vendor-specific application being executed by the resource 6. While the PAM 24 is often generic, the job launcher 18 is generally vendor-specific and the applications being executed by the resource 6 is usually vendor-specific and occasionally have specific requirements, not available to the public. In other words, some applications and job launchers 18 are generally purchased from different vendors and can be “closed applications”, meaning that it is not easy, or even possible, to see how they operate. In order to create the customized portion for “closed applications”, an analysis must be made of the external functioning of the closed applications so that the customization can be completed. This is often a time consuming and difficult process and sometimes may not even be possible. Alternately, if the vendor of the job launcher 18 is co-operating, the internal details of the generally “closed” parallel job launcher 18 may be obtained.
A further disadvantage of the prior art devices is that the PAM 24 lacks information about the completion of the tasks 4. In other words, while the PAM 24 manages the resources 6 and controls the execution of jobs, the ability of the PAM 24 to do these functions is limited because the prior art PAM 24 devices do not generally have easy access to the exits status of the tasks 4. As illustrated in FIGS. 1 and 2 the prior art systems 10, 20 provide little or no information back to the PAM 24 regarding the tasks 4. The PAM 24 in the prior art systems 10, 20 have little or no information about the MPI tasks after they are dispatched to the parallel job launcher 18.
Accordingly, the prior art suffers from several disadvantages. Principally among these is that the various components, such as the PAM 24 and the vendor parallel job launcher 18, cannot communicate with one another unless detailed customized software is prepared for each of them, or, the same vendor supplies them. The prior art also suffers from the disadvantage that the PAM 24 cannot easily access, or communicate with the tasks 4, which limits the ability of the PAM 24 to comprehensively manage the resources 6 and control the execution of parallel jobs. Furthermore, the prior art devices suffer from the disadvantage that the PAM 24 lacks easy access to information regarding the exit status of tasks 4, and, in particular the resource usage of the task 4.