Many applications have large amounts of data-level parallelism and should be able to benefit from single-instruction multiple-data (SIMD) support. In SIMD execution, a single instruction operates on multiple data elements simultaneously. This is typically implemented by extending the width of various resources such as registers and arithmetic logic units (ALUs), allowing them to hold or operate on multiple data elements, respectively. However, many such applications spend a significant amount of time in atomic operations on a set of sparse locations and thus see limited benefit from SIMD, as current architectures do not support atomic vector operations.
In many applications, synchronization primitives and parallel reduction operations are often performed in multiprocessor systems. Synchronization primitives ensure a program executes in a correct order when multiple threads work cooperatively. These primitives are often implemented using an atomic read-modify-write operation. A reduction is a common operation found in many scientific applications. When multiple threads perform reductions in parallel, atomic read-modify-write sequences are typically used to ensure correctness in race conditions.
Modern parallel architectures come equipped with SIMD units to improve the performance of many applications with data-level parallelism. To maintain SIMD efficiency, such architectures allow not only SIMD arithmetic operations but also SIMD memory reads and writes (through gather-scatter units). However, none of these architectures support SIMD atomic operations. The result is that these atomic operations cannot be vectorized and therefore must be implemented using scalar code. This can degrade the SIMD efficiency considerably, especially when the SIMD width, i.e., the number of simultaneously processed elements, is large (e.g., 16).
Scatter reductions are common operations in many applications. For example, a scatter-add operation can be used to enable multiple values of a first array to be reduced into (i.e., added to) selected elements of a second array according to a distribution of indices, which can often be random. Because of this, it is difficult to efficiently process multiple elements concurrently (i.e., in SIMD mode).
Histogram calculations are common operations in many image processing applications. For example, a histogram is used to track the distribution of color values of pixels in an image. However, updates to the histogram array may be random, depending on input data to an array. In particular, indices of neighboring elements may point to the same histogram bin. This condition makes it very difficult to process multiple data concurrently (i.e., in SIMD mode).