1. Field of the Invention
Embodiments of the present invention generally relate to improving workload distribution for parallel computing application
2. Description of the Related Art
Computer system performance can be increased by increasing computing power. For example, parallel computer systems utilizing more powerful processors and/or a larger number of processors enable applications to run faster or solve more complex problems. The efficiency of applications run on parallel systems is directly related to dataset and workload distribution.
Massively parallel supercomputers, such as those that may be formed using the Altix systems available from Silicon Graphics, Inc. may include several hundred processors and several tera-bytes (TBs) of memory. These computers can be interconnected together to form a large cluster. Such systems make it possible to solve problems that advance the bounds of scientific research. Example applications span a wide variety of scientific endeavors, including climate simulations for weather prediction, 3-D reservoir modeling for oil and gas exploration and production, and flight path calculations for space flight.
A parallel application uses several (up to several thousands) processors to solve a problem. Since the computations are split among tasks running on the different processors available, the application runs faster. Ideally, assuming 100% efficiency, the total run time for an application run in a parallel system is found by dividing the run time on a single processor by the total number of processors in the parallel system. As the problem size increases, the amount of work increases as well. Parallel computers reduce the application total runtime and make it possible for users to solve larger problems in manageable time.
The efficiency of parallel applications is directly related to the workload distribution. To parallelize an application the workload is distributed among the various tasks (parallel processing threads). If T—1 is the application runtime when running on a single processor, the ideal time when running in parallel with M threads will be:T_ideal(M)=T1/M. If the distribution is uneven, however, the thread with the largest amount of work will run longer than the others and its runtime will correspond to the runtime of the whole application:T_application(M)=T_longest(M) and T_longest(M)>T_ideal(M).Since the computation time is assumed to be proportional to workloadT_application(M)=T_longest(M)=Max_Workload(M)/Average_Workload(M)*T_ideal.The previous formula illustrates the importance of the workload distribution. Any deviation from an optimal distribution where each thread gets the same amount of work, translates into a less efficient parallel application. The ratio Max_Workload/Average_Workload defines the workload balancing. For a perfect distribution the workload balancing is 1, the larger the workload ratio, the less efficient the parallelism.
Workload distribution is performed either statically at compilation time or dynamically at runtime. When the computing cost of each work item to be distributed among the different threads is known in advance, the application may define a fixed distribution pattern. When the computing cost of the different operations is highly variable, the application may use a dynamic scheme to distribute the workload. A dynamic distribution scheme involves a higher overhead than a static one. Whenever an application has a predictable workload, the distribution may be defined statically (at compilation time).
Static workload distribution allows for optimal dataset distribution. In general, the computing cost of an elementary item depends not only on the number of operations to be performed, but also on how the thread accesses the data it is operating on. If the access is local, the efficiency is the best. If it is remote, the cost could be much higher. With static workload distribution, the dataset can be distributed so that each thread access data locally as much as possible. The communication overhead can be reduced and the parallel efficiency further improved.
As described above, static workload distribution may reduce parallelism and communication overhead. The parallel efficiency depends directly on the workload distribution, such that the best static workload distribution will improve the application parallel efficiency.
Typical static distributions consist of splitting the workload into equal parts either contiguous or cyclically distributed (e.g., in a round-robin distribution). Alternatively, rather than distributing one item at the time, a “block-cyclic” approach extends the round-robin scheme to contiguous blocks of fixed size. Distributing blocks rather than single items may be advantageous when contiguous items share common parts. Considering three threads A, B and C, the different static distributions mentioned above would split 24 workload items as shown in the following table:
TABLE 1CONVENTIONAL WORKLOAD DISTRIBUTION SCHEMESEqualA A A A A A A A- B B B B B B B B- C C C C C C C CpartitionsSimpleA B C -A B C- A B C- A B C- A B C -A B C -A B C- A B Ccyclic(round-robin):BlockA A B B C C- A A B B C C -A A B B C C -A A B B C Ccyclic(blocksize = 2):
Unfortunately, when the cost for each element varies, the static distributions are generally not optimal. Further, if the cost for each element increases (or decreases) monotonically, simple cyclic and block cyclic distributions may not provide optimal workload balancing.
Accordingly, what is needed is an improved scheme for workload distribution in parallel systems.