1. Field of the Invention
The present invention pertains generally to the field of high speed digital processing systems, and more particularly to a method of vectorizing loops containing recursive equations in a supercomputer.
2. Background of the Invention
Supercomputers are high performance computing platforms that employ a pipelined vector processing approach to solving numerical problems. Vectors are ordered sets of data. Problems that can be structured as a sequence of operations on vectors can experience one to two orders of magnitude increased throughput when executed on a vector machine (compared to execution on a scalar machine of the same cost). Pipelining further increases throughput by hiding memory latency through the prefetching of instructions and data.
A pipelined vector machine is disclosed in U.S. Pat. No. 4,128,880, issued Dec. 5, 1978, to Cray, the disclosure of which is hereby incorporated herein by reference. In the Cray machine, vectors are processed by loading them into operand vector registers, streaming them through a data processing pipeline having a functional unit, and receiving the output in a result vector register. A vector machine according to U.S. Pat. No. 4,128,880 supports fully parallel operation by allowing multiple pipelines to execute concurrently on independent streams of data.
For vectorizable problems, vector processing is faster and more efficient than scalar processing. Overhead associated with maintenance of the loop-control variable (for example, incrementing and checking the count) is reduced. In addition, central memory conflicts are reduced (fewer but bigger requests) and data processing units are used more efficiently (through data streaming).
Vector processing supercomputers are used for a variety of large-scale numerical problems. Applications typically are highly structured computations that model physical processes. They exhibit a heavy dependence on floating-point arithmetic due to the potentially large dynamic range of values within these computations. Problems requiring modeling of heat or fluid flow, or of the behavior of a plasma, are examples of such applications.
Program code for execution on vector processing supercomputers must be vectorized to exploit the performance advantages of vector processing. Vectorization typically breaks up a loop of the form: ##EQU1## into a nested loop of the form: ##EQU2## where VL is the length of the vector registers of the system. This process is known as "strip mining the loop". In strip mining, the number of iterations in the internal loop is defined by the length of a vector register. The number of iterations of the external loop is defined as an integer number of vector lengths. The remaining iterations are performed as a separate loop placed before the nested loop. Vector length arrays of data from the original data arrays are loaded into the vector registers for each iteration of the internal loop. Data from these vector registers can then be processed at the one element per clock period rate of a vector operation.
Compilers exist that will automatically apply strip mining techniques to scalar loops within program code to create vectorized loops. This capability greatly simplifies programming efficient vector processing. The programmer simply enters code of the form: ##EQU3## and it is vectorized.
There are, however, certain types of problems that resist vectorization. Equations of the form: ##EQU4## are difficult to vectorize due to the possibility that the same data point in X may be needed for different Y(i) within a single vector operation.
Present compilers are designed to recognize the possibility of recurring points in a loop and inhibit vectorization so that the equation is solved by a scalar loop. The reasons for this are varied. The range and contents of j(i) may not be known until run time, making it difficult to predict recurring data points. Also, it is difficult to allocate memory for expressions like X(j(i)) in which array X may be of an arbitrary length to be determined at run time. For example, it is a common practice within advanced programming languages to pass an array X(1) implying X is infinite in length. A practical compiler may never know the size of the array and, as long as the data does not overlap, the program will execute properly.
There are programming steps that can be taken to regain some of the performance lost in executing recursive equations in scalar loops. Some compilers offer a compiler directive that can be used to force vectorization of loops that would not vectorize otherwise due to the appearance of recurring data points. Programmers using such a directive take on the responsibility of making sure that there are, in fact, no recurring data points within a vector operation inside the loop.
In addition, there exist a number of algorithms that can be used to structure program code to prevent recurring data or detect if operations have been affected by the vectorization process. One such algorithm creates a temporary area in memory and uses it to separate operations on the same data point. This method is called the work vector method. Since the source of error in vectorizing recursive loops lies in the occurrence of a data point more than once within a vector in the internal loop, a work vector is created to serve as intermediate storage of the result of the calculation. If array X is of size M and the vector length of the machine is VL, work vector WV[k,l] will be a two-dimensional array of size M * VL. The result of each pass through the internal loop is accumulated in the work vector array at the location l defined by its place in the vector and the address k of the original data point. That result is used in future calculations on the same data point in the same location in the operand vector. Two operations on the same data point in a vector operation result in writes to two different locations in the work vector. The last step in performing the original loop is to add elements 1 through VL for each data point in X.
The work vector method is simple. It can lead to impressive gains in performance over the scalar approach. But as the size of array X becomes large, the memory requirements for the work vector array become burdensome. Also, in cases of little overlap or of limited operations within M (smaller n/M), addition of elements 1 through VL for each data point in X leads to a large number of zero additions.
An alternate approach was suggested in a paper entitled "Present Status of Computer Simulation at IPP", by Dr. Yoshihiko Abe, in Supercomputing 88: volume II, Science and Applications in 1989. Abe proposed a method of vectorizing recursive equations which consisted of breaking one large loop into a number of smaller loops, performing vector operations on the smaller loops, flagging recurring data points within the vector operation and correcting the result. The Abe method creates two new vectors L and K of size M and 2*VL, respectively. Vector L is a scratch pad vector used to record the identity of the last Y(i) added to a data point. Vector K is a scratch pad vector used to record the location of elements of Y(i) whose contribution to X was missed due to the vector operation.
The Abe method assumes that the amount of overlap is fairly small. Therefore the cost of tracking overlap and performing calculations on those elements which were missed due to overlap is low. The frequency of correction must be kept small in order to get good performance from this algorithm. But in a large class of problems (e.g. Monte Carlo simulations) the number of iterations is small in comparison to data point space. For these types of problems, the Abe method will give a substantial boost in performance.
There are drawbacks to the Abe technique, however. Memory must be set aside that is equivalent to the size of the array of data points. This memory (vector L above) is needed to record the identity of the last element used to operate on that data point. In addition, memory must be set aside (vector K) as a stack for recording the location of elements missed due to overlap. This additional memory can get large for large M. The total memory needed required by the Abe method is, however, smaller than that required by the work vector method.
Most importantly, the Abe method is difficult to implement in a practical compiler. The size M of array X is rarely known at compile time. Although it is possible for a compiler to allocate vector K based on the value of M calculated at run time, the time required to scan j in order to determine the maximum size required for K would severely curtail any performance benefits gained from the execution of Abe's method.
A different approach was presented by Giuseppe Paruolo in an article entitled "A Vector-Efficient and Memory-Saving Interpolation Algorithm for PIC Codes on a Cray X-MP" published in the Journal of Computational Physics 89 in 1990. Paruolo suggests a method in which two temporary work arrays are used to detect and compensate for recurring data points. Two working arrays are created, IAG and IAP. IAG is the same size as X. IAP is the same size as Y. IAP serves as a pool of pointers linking elements in Y to their corresponding data points in X. The method consists of writing the location of each element in Y to the location in IAG corresponding to the data point affected in X. Locations in IAG that are affected by more than one element of Y are written into more than once. After one pass through IAP, IAG contains pointers to no more than one element in Y for each data point in X. A new X is calculated using the values of Y pointed to by IAG and those elements of Y are removed from the pool.
The above process is repeated until all elements of Y are consumed, the pool of elements in Y falls beneath a threshold or the number of elements consumed in a pass falls beneath a threshold. The remaining iterations are then solved with a scalar loop.
Paruolo's method, like Abe, requires additional memory. Memory must be set aside that is equivalent to the sum of the size of array Y and the size of array X. This additional memory can get large for large n or large M. Like Abe, Paruolo's method is difficult to implement in a practical compiler due to the difficulty in determining the size of X prior to algorithm execution time.
Abe and Paruolo are geared to solving different types of problems. Abe works best in a large data space with little overlapping of elements. Paruolo works best in a data space with a great deal of overlap, as in the case where the number of elements in Y is much greater than the number of data points in X. The Paruolo method is optimized toward problems exhibiting a uniform distribution of elements of Y over the data points of X. However, it would perform no worse than the scalar calculation in the case of total overlap (i.e. all elements in Y affect one point in X).
Each of the three algorithms discussed offer ways of structuring loops containing recursive equations so as to take advantage of some level of vectorization. They share a requirement for large amounts of additional memory that limits the size of problems that can be executed and may force otherwise vectorizable calculations to be performed in scalar mode.
In addition, use of these algorithms requires a conscious effort on the part of programmers to structure their programs. The programmer must convert a problem that involves recurring data points into a different approach that can take advantage of these methods. It is difficult to construct a practical compiler that will automatically vectorize loops containing ambiguous references to possibly recurring data points using these methods. The ambiguous nature of references to data point arrays such as X means that compilers implementing these algorithms would be required to allocate memory at run time based on the contents of one or more data arrays. This approach would be complex and the time required to determine the required memory size would probably outweigh any performance increase gained from compiled execution.
It is clear that there is a need for improved methods of vectorizing scalar loops. A system which can operate to vectorize scalar loops containing recursive equations with no additional memory requirements is desired. In addition, there is a need for a method of vectorizing scalar loops containing recursive equations that can be placed into a compiler and automatically vectorize such loops.