1. Field of the Invention
The invention relates generally to a system and method of compiling a computer program, and more particularly, to a system and method for compiling a computer program wherein the computer program is adapted for use with a data parallel computer.
2. Discussion of Related Art
A compiler is a computer program which receives a source program as input. The source program is written in a source language. The compiler translates the source program into an equivalent target program. As a general reference that describes the principles used to design compilers for serial computers see Aho et al., Compilers, Principles, Techniques and Tools., Addison-Wesley Publishing Co., (1988) which is hereby incorporated by reference in its entirety herein. The target program is written in target language. Many source and target languages are known. For example, source languages include: APL, Ada, Pascal, Fortran, C, and Lisp. Target languages include machine languages for computers having one or a great number of processors. Compilers which support parallel data processing allow the definitions and use of parallel variables. For reference purposes, such compilers are called data parallel compilers.
For Example, the Connection Machine.RTM. (CM) computer CM-2 system, designed by Thinking Machines Corp., Cambridge, Mass. 0.2142, is a massively parallel computer with up to 65,536 bit serial processors and 2048 floating point accelerator chips. The CM-2 evolved out of the CM-1, which did not have any floating point hardware. The primary interface used by CM-2 compilers has been the Paris assembly language. The Paris language is a low-level instruction set for programming the data parallel computer. The Paris language is described in the Thinking Machines Corporation documents Paris Reference Manual (Version 6.0, February 1991) and Revised Paris Release Notes (Version 6.0, February 1991). These documents are available from the Thinking Machines Corporation Customer Support Department at 245 First Street, Cambridge, Mass. Even though Paris is implemented in a way that uses the underlying floating point hardware to perform calculations, it still reflects the fat that the CM-1 had no registers: All Paris operations (also called fieldwise operations) are memory to memory. This places a memory bandwidth limit on the peak gigaflop rating of Paris, and therefore on compilers whose target is the Paris language. This limit is approximately 1.5-2.5 gigaflops in a full size CM-2 (64 K bit serial processors, 2K FPUs). The higher speeds can be attained by multiply-add instructions, which are only useful in special situations.
The CM-2 provides three dimensions of parallelism: superpiperlines, superscalar, and multiple processors. A more in depth discussion of the these concepts can be found in Almasi et al., Highly Parallel Compiling, Benjamin/Cummings Publishing Co. (1989), Hennessy et al., Computer Architecture A Quantitative Approach, Morgan Kaufmann Publishers (1990) and Johnson, Superscaler Microprocessor Design, Prentice-Hall (1991) which are hereby incorporated by reference in their entirety herein.
Generally, pipelining is an implementation technique whereby multiple instructions are overlapped in execution. Today, pipelining is one of the key implementation techniques used to build fast processors. A pipeline is like an assembly line: Each step is the pipeline completes a part of the instruction. Each of the steps is called a pipe stage or pipe segment. The stages are connected one to the next to form a pipe--instructions enter at one end, are processed through the stages, and exit at the other end. Pipelining is an implementation technique that exploits parallelism among the instructions in a sequential instruction stream. It has a substantial advantage over scalar sequential processing.
The throughput of the pipeline is determined by hot often an instruction exits the pipeline. Because the pipe stages are hooked together, all the stages must be ready to proceed at the same time. The time required between moving an instruction one step down the pipeline is a machine cycle. Pipelining yields a reduction in the average execution time per instruction.
The term superscalar describes a computer implementation that improves performance by concurrent execution of scalar instructions--Superscalar processors typically allow the widening the processors' pipeline. Widening the pipeline makes it possible to execute more than one instructions per cycle. Thus, superscalar refers to issuing more than one instruction per clock cyce. This allows the instruction-execution rate to exceed the clock rate.
In regard to the issue of multiple processor, designers of parallel computers tried a variety of methods in order to fully utilize the underlying hardware. For example, earlier parallel computer systems assumed a separate processor for every data element, so that one may effectively operate on all data elements in parallel. When one such instruction is is used, it is performed (possibly conditionally) by every hardware processor, each on its own data. Many of the usual arithmetic and logic instructions found in contemporary computer instruction sets (such as, substrate, multiply, divide, max, min, compare, logical and, logical or, logical exclusive or, and floating point instructions) are provided in this form.
A typical difficult with these computer systems is when the number of data elements in the problem to be solved exceeds the number of hardware processors. For example, if a machine provides 16,384 processors configured in a 128.times.128 two dimensional grid, and a problem requires the processing of 200.times.200 elements (total 40,000), the programming task is much more difficult because one can no longer assign one data element to each processor, but must assign two data elements to some processors. Even if a problem requires no more than 16,384 data elements, if they are to be organized as a 64.times.256 grid rather than a 128.times.128 pattern, programming is again complicated, this time because the problem communication structure does not match the hardware communication structure.
One solution to this problem was described in U.S. Pat. No. 4,827,403 to Steel, Jr. et al. The '403 patent describes a virtual processor mechanism which causes every physical hardware processor to be used to simulate multiple virtual processors. Each physical processor simulates the same number of virtual processors. However, the virtual processor model creates an artificial memory hiearchy. For example, FIG. 1 has sixteen virtual processor on one of the bit serial processors. The memory (m) would get sub-divided into sixteen blocks (m/16). The elements of an array [A(0)-A(N)], where the array element A(0-A(16) are placed in the sixteen virtual processors. This creates a problem. The gap between the elements, as shown at reference number 550, is very large. This creates a very series memory performance degradation. Instead of having one cycle per access you get two cycles per access or more. This division of memory is not what the user wanted. The user wanted to put sixteen elements next to each other and operate on them.
The goal of the compiler designer is to try and exploit all three levels of parallelism (i.e., superpipelines, superscalar, and multiple processors. This has presented a substantial problem. As stated by Hennessy et al. (pg. 581), compilers of the future have two obstacles to overcome: (1) how to lay out the data to reduce memory hierarchy and communication overhead, and (2) exploitation of parallelism. Parallelizing compilers have been under development since 1975 but progress has been slow.
Thus, it would be advantageous to provide a system and method for exploiting the inherent parallelism of parallel target machines and reducing the memory hierarchy, thus allowing the machine to go beyond the memory bottleneck.