Supercomputers are high performance computing platforms that employ a pipelined vector processing approach to solving numerical problems. Vectors are ordered sets of data. Problems that can be structured as a sequence of operations on vectors can experience one to two orders of magnitude increased throughput when executed on a vector machine (compared to execution on a scalar machine of the same cost). Pipelining further increases throughput by hiding memory latency through the prefetching of instructions and data.
A pipelined vector machine is disclosed in U.S. Pat. No. 4,128,880, issued Dec. 5, 1978, to Cray, the disclosure of which is hereby incorporated herein by reference. In the Cray machine, vectors are usually processed by loading them into operand vector registers, streaming them through a data processing pipeline having a functional unit, and receiving the output in a result vector register.
For vectorizable problems, vector processing is faster and more efficient than scalar processing. Overhead associated with maintenance of the loop-control variable (for example, incrementing and checking the count) is reduced. In addition, central memory conflicts are reduced (fewer but bigger requests) and data processing units are used more efficiently (through data streaming).
Vector processing supercomputers are used for a variety of large-scale numerical problems. Applications typically are highly structured computations that model physical processes. They exhibit a heavy dependence on floating-point arithmetic due to the potentially large dynamic range of values within these computations. Problems requiring modeling of heat or fluid flow, or of the behavior of a plasma, are examples of such applications.
Program code for execution on vector processing supercomputers must be vectorized to exploit the performance advantages of vector processing. Vectorization typically transforms an iterative loop into a nested loop with an inner loop of VL iterations, where VL is the length of the vector registers of the system. This process is known as “strip mining” the loop. In strip mining, the number of iterations in the internal loop is either fixed, or defined by the length of a vector register, depending on the hardware implementation; the number of iterations of the external loop is defined as an integer number of vector lengths. Any remaining iterations are performed as a separate loop placed before or after the nested loop, or alternately as constrained-length vector operations within the body of the vector loop.
Compilers exist that will automatically apply strip mining techniques to scalar loops within program code to create vectorized loops. This capability greatly simplifies programming efficient vector processing.
Some vector computers support only a limited number of data lengths for vector processing. For instance, vector hardware on a number of vector computers available from Cray Inc. supports only 32 and 64 bit data. In such systems, operations on 8 or 16 bit data must be done in scalar mode.
It is clear that there is a need for improved methods of vectorizing scalar loops so that the vector hardware can handle data lengths that are not available in the vector hardware.