Supercomputers are high performance computing platforms that employ a pipelined vector processing approach to solving numerical problems. Vectors are ordered sets of data. Problems that can be structured as a sequence of operations on vectors can experience one to two orders of magnitude increased throughput when executed on a vector machine (compared to execution on a scalar machine of the same cost). Pipelining further increases throughput by hiding memory latency through the prefetching of instructions and data.
A pipelined vector machine is disclosed in U.S. Pat. No. 4,128,880, issued Dec. 5, 1978, to Cray, the disclosure of which is hereby incorporated herein by reference. In the Cray machine, vectors are usually processed by loading them into operand vector registers, streaming them through a data processing pipeline having a functional unit, and receiving the output in a result vector register.
For vectorizable problems, vector processing is faster and more efficient than scalar processing. Overhead associated with maintenance of the loop-control variable (for example, incrementing and checking the count) is reduced. In addition, central memory conflicts are reduced (fewer but bigger requests) and data processing units are used more efficiently (through data streaming).
Vector processing supercomputers are used for a variety of large-scale numerical problems. Applications typically are highly structured computations that model physical processes. They exhibit a heavy dependence on floating-point arithmetic due to the potentially large dynamic range of values within these computations. Problems requiring modeling of heat or fluid flow, or of the behavior of a plasma, are examples of such applications.
Program code for execution on vector processing supercomputers must be vectorized to exploit the performance advantages of vector processing. Vectorization typically transforms an iterative loop into a nested loop with an inner loop of VL iterations, where VL is the length of the vector registers of the system. This process is known as “strip mining” the loop. In strip mining, the number of iterations in the internal loop is either fixed, or defined by the length of a vector register, depending on the hardware implementation; the number of iterations of the external loop is defined as an integer number of vector lengths. Any remaining iterations are performed as a separate loop placed before or after the nested loop, or alternately as constrained-length vector operations within the body of the vector loop.
Compilers exist that will automatically apply strip mining techniques to scalar loops within program code to create vectorized loops. This capability greatly simplifies programming efficient vector processing.
The memory to processor round trip time (in clock cycles) has grown rapidly as clock rates increase and the memory to processor interface becomes increasingly pipelined. Systems have been suggested that place processors closer to memory in order to reduce the number of cycles spent transferring data between processors and memory. In some processor-in-memory systems, the processor and the memory are collocated on the same board, or on the same piece of silicon. Such an approach is, however, expensive, requiring special hardware.
It is clear that there is a need for improved methods of balancing PIM operations against conventional processors in multiprocessor systems.