Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single-instruction multiple-data (SIMD) vector registers. In SIMD execution, a single instruction operates on multiple data elements concurrently or simultaneously. This is typically implemented by extending the width of various resources such as registers and arithmetic logic units (ALUs), allowing them to hold or operate on multiple data elements, respectively.
The central processing unit (CPU) may provide such parallel hardware to support the SIMD processing of vectors. A vector is a data structure that holds a number of consecutive data elements. A vector register of size L may contain N vector elements of size M, where N=L/M. For instance, a 64-byte vector register may be partitioned into (a) 64 vector elements, with each element holding a data item that occupies 1 byte, (b) 32 vector elements to hold data items that occupy 2 bytes (or one “word”) each, (c) 16 vector elements to hold data items that occupy 4 bytes (or one “doubleword”) each, or (d) 8 vector elements to hold data items that occupy 8 bytes (or one “quadword”) each.
A number of applications have large amounts of data-level parallelism and may be able to benefit from SIMD support. To maintain SIMD efficiency, some architectures allow not only SIMD arithmetic operations but SIMD memory reads and writes and also SIMD shuffle and permutation operations. However, some applications spend a significant amount of time in operations on a set of sparse locations. Moreover, sometimes sequential and/or conditional operations are performed and so these applications may see only limited benefit from having SIMD operations.
For example, the Princeton Application Repository for Shared-Memory Computers (PARSEC) is a benchmark suite composed of multithreaded programs. The suite focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors. One of the PARSEC programs, streamcluster, solves online clustering problems by finding a predetermined number of medians so that each point may be assigned to its nearest center. The program spends most of its time evaluating the gain of opening a new center. The parallel gain computation is implemented in a function called, pgain, which includes the following loop:
 bool is_center[ ];int center_table[ ];int count = 0;for (int i = k1; i < k2; i++ ) { if ( is_center[i] ) {  center_table[i] = count++; }}.
The example loop above illustrates conditional operations that are performed on memory arrays, for which vectorization is difficult to achieve, and so limited benefit may be seen from processor architectures which allow SIMD operations.
To date, potential solutions to such performance limiting issues, sequential and/or conditional operations, and other bottlenecks have not been adequately explored.