In parallel computing, many calculations are carried out simultaneously. Single instruction, multiple data (“SIMD”) is a type of parallel computing in which multiple processing elements perform the same operation on multiple data points, generally during the same processor clock cycle or pursuant to one instruction (which, due to page fault, interrupts, and the like, may be spread out over one or more clock cycles).
In SIMD processes, data is handled in blocks; a block or vector comprising a number of values can be loaded into SIMD memory—such as a vector register—with one instruction, rather than requiring a series of instructions. A common function can then be applied to all the values in the block. Thus, processor clock cycles and power can be saved by saving sets of data as one or more vector(s), loading the vector(s) in SIMD memory, and executing a function on the vector(s) and/or vector elements in vector.
SIMD is known to be particularly applicable to processing multimedia data, inasmuch as processing multimedia data often requires applying the same function across large sets of bits or bytes. For example, adjusting contrast in a digital image file may require adding or subtracting a single value from each pixel in an image. This can be performed by loading some or all of the pixels in the image into a single vector register and adding/subtracting the value to all of the pixel values in one instruction.
However, at least write-after-write (write-after-write also being known as output dependence) dependence can prevent a loop or function from operating on vectorized data without potentially causing errors.
For example, in the following pseudo-code in Table 1, indexes for accessing A[ ] array may potentially have the same values pointing to the same memory location. In this case, full vectorization of the loop is not possible, because the order of stores in a vector execution is different from the scalar execution; later execution with respect to an earlier store may overwrite a memory cell, producing an incorrect result.
TABLE 1for(i=0; i<N; i++){computation_without_dependencies; //no other accesses to A[ ] arrayA[index1[i]] = X; //block of stores potentially having dependenciesA[index2[i]] = Y;A[index3[i]] = Z;}
In another example, illustrated in the following pseudo-code in Table 2, values are stored with pointers p1, p2, p3 which may be aliased (equal or intersect randomly), and/or which may be computed in arbitrary (vectorizable) way on each iteration of the loop:
TABLE 2for(i=0; i<N; i++){computation_without_dependencies; //no other accesses to p1, p2 and p3pointersi1 = computation1(i) //any computation depending on iteration or loadfrom memoryi2 = computation2(i) //any computation depending on iteration or loadfrom memoryi3 = computation3(i) //any computation depending on iteration or loadfrom memoryp1[i1] = X; //block of stores potentially having dependenciesp2[i2] = Y;p3[i2] = Z;}
Legacy approaches to the problem of output dependence and vectorization are to i) serialize the entire loop execution, which foregoes the benefits which may come from vectorization or ii) separately serialize ordered regions of code and, potentially, perform parallel execution of code outside of serialized regions, as e.g., in Section 2.13.8, “ordered Construct” in “OpenMP Application Programming Interface”, version 4.5, November, 2015.
Although the following Detailed Description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art.