1. Field of the Invention
The present invention relates to vector processors and, more particularly, to vector processors which perform operations in vector registers.
2. Description of the Related Art
A vector processor is a processor that performs operations on vectors, i.e., linear array of numbers. A typical vector operation might add two 64-element, floating point vectors to obtain a single 64-element result vector. The vector instruction is equivalent to an entire loop of code, with each iteration computing one of the 64 elements of the result, updating the indices and branching back to the beginning of the loop. Vector processors pipeline the operations on the individual elements of a vector. The pipeline includes not only the arithmetic operations but also memory accesses and effective address calculations.
Vector processors execute non-conditional loops very efficiently. The arrays in the loop and loop variant scalars are converted to vectors and are processed as such. A control mechanism applies the functionality requested to the consecutive pairs of operand vector or array elements. The result of the computation is written into consecutive elements of the result vector. Vector processing distributes the instruction issue controls and allows issuing multiple instructions per cycle without increasing the issue unit complexity dramatically. Using vector processing also allows the register file of the vector processor to be distributed and simpler.
The regular and uniform flow of vector elements is disrupted when the loop includes conditional operations such as branches, thus causing some and not all of the vector elements to be processed. Vector processors enforce the use of consecutive elements of a vector starting from the first element to an element specified by the length of the operation, the length not to exceed the length of the vector register. I.e., the first element of each operand vector is fetched and sent to the functional unit, followed by the second element, followed by the third element, etc. The result of the computation is sent to a destination vector register starting from element one, followed by element two, followed by element three, etc. No skipping of the vector elements is allowed. Accordingly, when a conditional operation is performed, either the operation cannot be vectorized or if vectorized, the operation is performed on all of the elements but the result is only stored for those elements that satisfy the condition.
Loops with conditional execution, If-Then-Else, structures can be vectorized, however as discussed above, the performance of such loops may be poor. There are a plurality of methods for vectorizing conditional loops. For example, the following loop sets forth an exemplative conditional loop.
DOI=1,N PA1 ENDDO PA1 MASK=V(1:N).GT.Y(1:N) PA1 T1(1:N)=B(1:N)+C(1:N) PA1 T2(1:N)=B(1:N)-C(1:N) PA1 A(1:N)=MERGE(T1(1:N).T2(1:N).MASK) PA1 MASK=V(1:N).GT.Y(1:N) PA1 TB1(1:M)=COMPRESS(B1:N).MASK) PA1 TC1(1:M)=COMPRESS(C(1:N).MASK) PA1 TB2(1:L)=COMPRESS(B(1:N).NOT.MASK) PA1 TC2(1:L)=COMPRESS(C(1:N).NOT.MASK) PA1 T1(1:M)=TB1(1:M)+TC1(1:M) PA1 T2(1:L)=TB2(1:L)-TC2(1:L) PA1 A(1:N)=EXPAND(T1(1:M).MASK) PA1 A(1:N)=EXPAND(T2(1:L).NOT.MASK)
IF (V(I).GT.Y(I)) PA2 THEN PA2 ELSE PA2 ENDIF
A(I)=B(I)+C(I) PA3 A(I)=B(I)-C(I)
This loop avoids vector length details and assumes that N is less than or equal to a legitimate vector length, thus avoiding the requirement of strip mining code.
With one approach for vectorizing this loop, a mask of the operation is first generated by comparing the two vectors and setting a bit in the mask based upon the presence of the loop condition between the elements of the vectors. After the mask is generated, the operation within the THEN portion of the loop is executed. After the operation within the THEN portion of the loop is executed, the result of this operation is stored in a temporary vector. After the result is stored, the operation within the ELSE portion of the loop is executed. After the ELSE portion of the loop is executed, the result of the ELSE loop is stored in a second temporary vector. After the result of the ELSE execution is stored, the results that are stored in the temporary registers are merged to provide the result of the conditional loop. The following sets forth exemplative code for providing this output.
Where U(1:N) sets forth elements 1 to N of vector U and the MERGE operation transfers either the elements of T1 or T2 to the corresponding elements of A according the value of MASK, where MASK is generated as discussed above. More specifically, if a corresponding mask bit is true, then the element of T1 is transferred, otherwise the element of T2 is transferred. In this approach, twice the number of necessary elements are computed with half of the computed elements then being discarded.
A second approach for vectorizing a conditional loop uses compress and expand operations to adjust the vectors to the portion of the computation that is required for the operations within the THEN and ELSE portions. More specifically, a mask for the operation is generated. After the mask is generated, the first operand vector is compressed based upon the mask. After the first operand vector is compressed, the second operand vector is compressed according to the mask. After the second operand vector is compressed, the THEN function is performed. After the THEN function is performed, the ELSE function is performed. After the ELSE function is performed, the result of THEN function is expanded based upon the mask. After the result of the THEN function is expanded, the result of the ELSE function is expanded based upon the complement of the mask. The following sets forth exemplative code for providing the conditional output of this operation.
The COMPRESS and EXPAND operations are vector functions based on the MASK vector or the complement of the MASK vector. The overhead associated with the compress and expand operations makes this approach relatively inefficient.