The present invention allows instruction code reduction since one instruction can result in many different operations. Also, as a lot of mathematic algorithms (especially vector and/or matrix arithmetic) rely on topology (location) of each element in the vector/matrix, the invention allows very efficient execution of numerical algorithms. Several other advantages of the present invention over existing arts include,
power saving: in cases where not all the PEs are required, a NOP modifier can be used to shut them down, saving power consumption;
non-SIMD based algorithms: in cases where the underlying algorithm is not a pure SIMD algorithm (for example, a filter operation or a transform operation), a SIMD implementation will not be efficient. Using the PE-grouping method provides much more flexibility, which results in a more efficient implementation;
multiple data set operations: in cases where the underlying algorithm operates on a data set smaller than the size of the PE-array, some of the PEs will not be utilize and the efficiency of the implementation will be low (consider a 2×2 matrix multiply on a 4×4 PE array). Using the PE-grouping method allows the implementation on multiple data set in parallel (four 2×2 matrix multiply executing at the same time);
unaligned loading: allowing a construction of non-align data elements (from memory) to be loaded to the same register (for example, register “r0”) of the PE-array, the PE-grouping method can be used to load multiple consecutive data elements (for example, two consecutive data element vectors: [v0, v1, . . . , v15] and [u0, u1, . . . , u15]) such that a subset is store in the same register (using the operand modifiers, for example, first half of the V vector is saved in r0, second half in r1, first half of the U vector in r1 and second in r0. This will result in the elements [v8, v9, . . . , v15, u0, u1, . . . , u7] stored in r1 register).