1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a method for providing hardware based dynamic load balancing of message passing interface tasks.
2. Description of Related Art
A parallel computing system is a computing system with more than one processor for parallel processing of tasks. A parallel program is a program that may consist of one or more jobs that may be separated into tasks that may be executed in parallel by a plurality of processors. Parallel programs allow the tasks to be simultaneously executed on multiple processors, with some coordination between the processors, in order to obtain results faster.
There are many different approaches to providing parallel computing systems. Examples of some types of parallel computing systems include multiprocessing systems, computer cluster systems, parallel supercomputer systems, distributed computing systems, grid computing systems, and the like. These parallel computing systems are typically distinguished from one another by the type of interconnection between the processors and memory. One of the most accepted taxonomies of parallel computing systems classifies parallel computing systems according to whether all of the processors execute the same instructions, i.e. single instruction/multiple data (SIMD), or each processor executes different instructions, i.e. multiple instruction/multiple data (MIMD).
Another way by which parallel computing systems are classified is based on their memory architectures. Shared memory parallel computing systems have multiple processors accessing all available memory as a global address space. These shared memory parallel computing systems may be further classified into uniform memory access (UMA) systems, in which access times to all parts of memory are equal, or non-uniform memory access (NUMA) systems, in which access times to all parts of memory are not equal. Yet another classification, distributed memory parallel computing systems, also provides a parallel computing system in which multiple processors are utilized, but each of the processors can only access its own local memory, i.e. no global memory address space exists across them. Still another type of parallel computing system, and the most prevalent in use today, is a combination of the above systems in which nodes of the system have some amount of shared memory for a small number of processors, but many of these nodes are connected together in a distributed memory parallel system.
The Message Passing Interface (MPI) is a language-independent computer communications descriptive application programming interface (API) for message passing on shared memory or distributed memory parallel computing systems. With MPI, typically a parallel application is provided as one or more jobs which are then separated into tasks which can be processed in a parallel manner on a plurality of processors. MPI provides a communication API for the processors to communicate with one another regarding the processing of these tasks.
There are currently two versions of the MPI standard that are in use. Version 1.2 of the MPI standard emphasizes message passing and has a static runtime environment. Version 2.1 of the MPI standard includes new features such as scalable file I/O, dynamic process management, and collective communication of groups of processes. These MPI standards are available from MPI forum website. It is assumed for purposes of this description, that the reader has an understanding of the MPI standards.
Of particular note, the MPI standard provides for collective communication of processes or tasks, i.e. communications that involve a group of processes or tasks. A collective operation is executed using MPI by having all the tasks or processes in the group call a collective communication routine with matching arguments. Such collective communication routine calls may (but are not required to) return as soon as their participation in the collective communication is complete. The completion of a call indicates that the caller is now free to access locations in a communication buffer but does not indicate that other processes or tasks in the group have completed or even have started the operation. Thus, a collective communication call may, or may not, have the effect of synchronizing all calling processes.
One way in which MPI enforces synchronization of the processes or tasks is to provide a synchronization operation referred to as the MPI_BARRIER( ) call. The MPI_BARRIER( ) call blocks the caller until all tasks or processes in the group have called MPI_BARRIER( ). Thus, the MPI_BARRIER( ) call is used with a group of tasks which must wait for the other tasks in the group to complete before proceeding to the next tasks, i.e. each task must call MPI_BARRIER( ) before any of the processors are able to execute additional tasks. Essentially, the barrier operation enforces synchronization of the tasks of a job and enforces temporal dependence.
While such synchronization operations aid programmers in generating parallel programs that ensure that dependent tasks are accommodated without errors, the synchronization results in inefficient use of the processor resources. For example, if a processor executes a task in parallel with one or more other processors, and finishes its task before the other processors, then it must wait for each of the other processors to complete their tasks and call the synchronization operation before it can proceed. As a result, there are a number of wasted processor cycles while the fast processors wait for the slower processors to complete. During this time period, the faster processors are still consuming power but are not providing any useful work.