Loop routines are common in software programs because they allow codes to be written in a concise manner. Moreover, with the use of loops, programs can be written in an abstract manner which enhances readability. However, there are situations where a loop cannot be vectorized by a compiler, thus the program fails to make use of the capabilities of single instruction, multiple data (SIMD) processors. For example, certain loops cannot be completely vectorized due to the presence of conditional data dependencies or dependencies arising from potential memory aliasing. Microprocessor architectures provide new vector instructions to partially vectorize such loops. These instructions are used to determine the dynamic vector dependencies and generate a predicate mask that indicates the next set of independent scalar iterations (each scalar iteration maps to a vector element). Then by masking execution of the vector loop body with this predicate mask the loop can be partially vectorized. Often memory loads occur early in the dependence chain and due to memory safety reasons dynamic dependencies cannot be determined and hence the loop cannot be partially vectorized. To solve this problem, processor architectures introduced a form of speculative vector load/gather operations where the least significant vector element that has a TRUE predicate mask bit is non-speculatively loaded and the remaining mask enabled elements are speculatively loaded. This enables partial vectorization of certain loops. However these instructions are not sufficient to vectorize other loops, such as the loops illustrated in FIGS. 1A and 1B.
FIG. 1A is a snippet of a source code illustrating a class of loops where the loads requiring speculation are conditionally executed. Such loops are commonly found in general purpose integer codes and in system software. To vectorize the loop of FIG. 1A, we need to insert data dependency checks in the vector loop and break vector execution when the inter-iteration dependency on variable occurs (i.e., we need to break vector execution after condition “(a[i]<j∥b[a[i]])” evaluates to true). To determine where the inter-iteration dependency occurs the condition “(a[i]<j∥b[a[i]])” has to be evaluated for all remaining iterations within the vector width. Due to C short circuiting rules, load “b[a[i]]” should be performed only if “a[i]<j” evaluates to FALSE. If “a[i]<j” evaluates to FALSE for the first vector element (i.e., vector element 0) we know that “b[a[i]]” can be safely accessed for this element. However if “a[i]<j” evaluates to TRUE for the first vector element and evaluates to FALSE for one or more subsequent element(s), the load of “b[a[i]]” is unsafe for these elements and hence has to be done speculatively. Hence to vectorize this loop a speculative gather is needed that treats element 0 as non-speculative and all other elements as speculative.
FIG. 1B is a snippet of a source code illustrating another class of loops that cannot be efficiently vectorized using conventional processor architectures. The loop of FIG. 1B can be partially vectorized as long as variable “j” is not changed and hence variable “last” remains unchanged. This can be done by moving the computation of “last=b[j]” above the use of “last” in “if (a[i] !=last)” (similar to loop rotation) and using a vPropagateShiftTrue operation which shifts vector elements to the next higher element by one position, and copies a scalar input to the first element position. Here after performing the operations:
last=b[j];
last=vPropagateShijtTrue(0, last);
the vector representing variable last will contain 0 as the first element and “b[j]” in all other elements. We can then evaluate “a[i] !=last” and can partially vectorize the loop until and including the vector element where the condition “a[i] !=last” evaluates to TRUE, i.e., “j” is modified. To compute “last=b[j]” as the first operation (moved up as indicated earlier) “b[j]” has to be speculatively loaded since we are assuming that “j” is not modified before the access to “b[j]” in the original source code. To do this we need a non-faulting vector load that speculatively loads all its mask enabled vector elements. And if the loop contained the statement “last=a[b[j+i]]” instead of “last=b[j]” we would need a non-faulting gather that speculatively gathers all its mask enabled elements instead of a non-faulting vector load operation.