A. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to data driven processing of computer programs using a multiprocessor computer system.
B. Description of the Related Art
Multiprocessor computer systems include two or more processors that may be employed to execute the various instructions of a computer program. A particular set of instructions may be performed by one processor while other processors perform unrelated sets of instructions.
Fast computer systems, like multiprocessor computer systems, have stimulated the rapid growth of a new way of performing scientific research. The broad classical branches of theoretical science and experimental science have been joined by computational science. Computational scientists simulate on supercomputers phenomena too complex to be reliably predicted by theory and too dangerous or expensive to be reproduced in a laboratory. Successes in computational science have caused demand for supercomputing resources to rise sharply in recent years.
During this time, multiprocessor computer systems, also referred to as xe2x80x9cparallel computers,xe2x80x9d have evolved from experimental contraptions in laboratories to become the everyday tools of computational scientists who need the ultimate in computing resources in order to solve their problems. Several factors have stimulated this evolution. It is not only that the speed of light and the effectiveness of heat dissipation impose physical limits on the speed of a single processor. It is also that the cost of advanced single-processor computers increases more rapidly than their power. And price/performance ratios become more favorable if the required computational power can be found from existing resources instead of purchased. This factor has caused many sites to use existing workstation networks, originally purchased to do modest computational chores, as xe2x80x9cSCANxe2x80x9ds (SuperComputers At Night) by utilizing the workstation network as a parallel computer. This scheme has proven so successful, and the cost effectiveness of individual workstations has increased so rapidly, that networks of workstations have been purchased to be dedicated to parallel jobs that used to run on more expensive supercomputers. Thus, considerations of both peak performance and price/performance are pushing large-scale computing in the direction of parallelism. Despite these advances, parallel computing has not yet achieved wide-spread adoption.
The biggest obstacle to the adoption of parallel computing and its benefits in economy and power is the problem of inadequate software. The developer of a program implementing a parallel algorithm for an important computational science problem may find the current software environment to be more of an obstruction than smoothing the path to use of the very capable, cost-effective hardware available. This is because computer programmers generally follow a xe2x80x9ccontrol flowxe2x80x9d model when developing programs, including programs for execution by multiprocessor computers systems. According to this model, the computer executes a program""s instructions sequentially (i.e., in a series from the first instruction to the last instruction) as controlled by a program counter. Although this approach tends to simplify the program development process, it is inherently slow.
For example, when the program counter reaches a particular instruction in a program that requires the result of another instruction or set of instructions, the particular instruction is said to be xe2x80x9cdependentxe2x80x9d on the result and the processor cannot execute that instruction until the result is available. Moreover, executing programs developed under the control flow model on multiprocessing computer systems results in a significant waste of resources because of these dependencies. For example, a first processor executing one set of instructions in the control flow program may have to wait for some time until a second processor completes execution of another set of instructions, the result of which is required by the first processor to perform its set of instructions. This wait-time translates into an unacceptable waste of computing resources in that at least one of the processors in this two-processor configuration is idle the whole time while the program is running.
To better exploit parallelism in a program some scientists have suggested use of a xe2x80x9cdata flowxe2x80x9d model in place of the control flow model. The basic concept of the data flow model is to enable the execution of an instruction whenever its required operands become available, and thus, no program counters are needed in data-driven computations. Instruction initiation depends on data availability, independent of the physical location of an instruction in the program. In other words, instructions in a program are not ordered. The execution simply follows the data dependency constraints.
Programs for data-driven computations can be represented by data flow graphs. An example data flow graph is illustrated in FIG. 1 for the calculation of the following expression:
z=(x+y)*2
When, for example, x is 5 and y is 3, the result z is 16. As shown graphical in the figure, z is dependent on the result of the sum and x and y. The data flow graph is a directed acyclic graph (xe2x80x9cDAGxe2x80x9d) whose nodes correspond to operators and arcs are pointers for forwarding data. The graph demonstrates sequencing constraints (i e., constraints with data dependencies) among instructions.
For example, in a conventional computer, program analysis is often done (i) when a program is compiled to yield better resource utilization and code optimization, and (ii) at run time to reveal concurrent arithmetic logic activities for higher system throughput. For instance, consider the following sequence of instructions:
1. P=X+Y
2. Q=P/Y
3. R=X*P
4. S=Rxe2x88x92Q
5. T=R*P
6. U=S/T
The following five computational sequences of these instructions are permissible to guarantee the integrity of the result when executing the instructions on a serial computing system (e.g., a uniprocessor system):
1, 2, 3, 4, 5, 6
1, 3, 2, 5, 4, 6
1, 3, 5, 2, 4, 6
1, 2, 3, 5, 4, 6
1, 3, 2, 4, 5, 6
For example, the first instruction must be executed first, but the second or third instruction can be executed second, because the result of the first instruction is required for either the second or third instruction, but neither the second nor the third requires the result of the other. The remainder of each sequence follows this simple rule-no instruction can be run until its operands (or inputs) are available.
In a multiprocessor computer system with two processors, however, it is possible to perform the six operations in four steps (instead of six) with the first processor computing step 1, followed by both processors simultaneously computing steps 2 and 3, followed by both processors simultaneously steps 4 and 5, and finally either processor computing step 6. This is an obvious improvement over the uniprocessor approach because execution time is reduced.
Using data flow as a method of parallelization will thus extract the maximum amount of parallelism from a system. Most source code, however, is in a control form, which is difficult and clumsy to parallelize efficiently for all types of problems.
It is therefore desirable to provide a facility for developers to more easily develop data flow programs and to convert existing control flow programs into data flow programs for execution on multiprocessor computer systems. There is also a need for technique that optimizes performance of the data flow programs in a multiprocessor computer system.
Methods, systems, and articles of manufacture consistent with the present invention overcome the shortcomings of existing systems by enabling developers to easily convert control flow programs into a data flow approach and to develop new programs according to the data flow model. According to one aspect of the present invention, such methods, systems, and articles of manufacture, as embodied and broadly described herein, this program development process includes defining a memory region and dividing it into multiple blocks, each block defining a set of values associated with a function. Sets of the blocks are defined, each block in a set having a state reflected by a designated portion of the program that when executed transforms the values forming the block based on the function. Additionally, any dependencies among the blocks are specified by the user. Each dependency indicates a relationship between two blocks and requires the portion of the program associated with one of the two blocks to be executed before the portion of the program associated with the other block.
In accordance with another aspect of the present invention, methods, systems, and articles of manufacture, as embodied and broadly described herein, execute a data flow program in a multiprocessor computer system. Execution of the program involves selecting information in a queue identifying a block formed of a set of values associated with a function of the program and determining whether execution of a portion of the program associated with the selected block is dependent on a result of the execution of a portion of the program associated with another block. The portion of the program associated with the selected block is then executed when it is determined that execution of the portion of the program associated with the selected block is not dependent on a result of the execution of a portion of the program associated with the other block. This selection and determination is repeated when it is determined that execution of the portion of the program associated with the selected block is dependent on a result of the execution of a portion of the program associated with the other block.
In accordance with yet another aspect of the present invention, methods, systems, and articles of manufacture are provided that optimize execution of data flow programs in a multiprocessor computer system. A data flow program consists of memory region information, including block information and dependency information. The block information reflects multiple blocks that define a memory region. Each block is formed of a set of values associated with a function and has a state reflected by a designated portion of the program that when executed transforms the values forming the block based on the function. The dependency information reflects any dependencies among the blocks, each dependency indicating a relationship between two blocks and requiring the portion of the program associated with a first block of the relationship to be executed before the portion of the program associated with a second block of the relationship. A queue is formed organizing the memory region information in such a way as to optimize execution of data flow program.
In accordance with one aspect of the invention, as broadly described herein, the queue is formed by generating a directed acyclic graph based on the memory region information with each block having a corresponding node in the graph, traversing the directed acyclic graph according to a predetermined function, and placing information identifying each block in the queue based on the traversal of the directed acyclic graph. In accordance with another aspect of the invention, as broadly described herein, the queue may be divided into part, or multiple queues may be employed. In this case, each part of the queue or individual queue has a priority, and (i) the blocks are assigned to the parts or queues based on a priority associated with each block and (ii) selected from the parts or queues for execution in accordance with the queue assignment.