Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, SIMD vector registers. In SIMD execution, a single instruction operates on multiple data elements concurrently or simultaneously. This is typically implemented by extending the width of various resources such as registers and arithmetic logic units (ALUs), allowing them to hold or operate on multiple data elements, respectively.
The central processing unit (CPU) may provide such parallel hardware to support the SIMD processing of vectors. A vector is a data structure that holds a number of consecutive data elements. A vector register of size L may contain N vector elements of size M, where N=L/M. For instance, a 64-byte vector register may be partitioned into (a) 64 vector elements, with each element holding a data item that occupies 1 byte, (b) 32 vector elements to hold data items that occupy 2 bytes (or one “word”) each, (c) 16 vector elements to hold data items that occupy 4 bytes (or one “doubleword”) each, or (d) 8 vector elements to hold data items that occupy 8 bytes (or one “quadword”) each.
A number of applications have large amounts of data-level parallelism and may be able to benefit from SIMD support. However, some applications spend a significant amount of time in operations on a set of sparse locations. Scatter reductions are common operations in many applications. For example, a scatter-add operation can be used to enable multiple values of a first array to be reduced into (i.e., added to) selected elements of a second array according to a distribution of indices, which can often be random. But because of this, it may be difficult to efficiently process multiple elements concurrently (i.e., in SIMD mode). One concern is to ensure that scalar program order is preserved when necessary. Another concern is to ensure that when data is written back to memory, the resulting vector of memory addresses includes only unique addresses (i.e., there are no conflicting duplicate addresses).
For example, histogram calculations are common operations in many image processing applications. A histogram may be used to track the distribution of color values of pixels in an image, or of intensity gradients and/or edge directions in an image for computer vision and object detection. However, updates to the histogram array may be random, depending on input data to an array. In particular, indices of neighboring elements may point to the same histogram bin. Accordingly, conflict detection and resolution may be required to detect multiple dependent updates to the same locations and to ensure that scalar program order is preserved. This is precisely the kind of condition that can make it very difficult to process multiple data concurrently or simultaneously (i.e., using SIMD operations).
To date, potential solutions to sequential bottlenecks such as conflict concerns and related processing difficulties have not been adequately explored.