Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers. The central processing unit (CPU) may then provide parallel hardware to support processing vectors. A vector is a data structure that holds a number of consecutive data elements. A vector register of size M may contain N vector elements of size O, where N=M/O. For instance, a 64-byte vector register may be partitioned into (a) 64 vector elements, with each element holding a data item that occupies 1 byte, (b) 32 vector elements to hold data items that occupy 2 bytes (or one “word”) each, (c) 16 vector elements to hold data items that occupy 4 bytes (or one “doubleword”) each, or (d) 8 vector elements to hold data items that occupy 8 bytes (or one “quadword”) each.
Another technique for increasing performance by exploiting parallelism is seen in multicore and/or multithreaded processors that incorporate multiple cores or hardware threads onto one or more dies. Multicore and/or multithreaded processors are a boon to throughput driven applications such as web servers, but the applicability of exploiting nearly independent code regions to run on separate processors does not always help general purpose applications, which may have a large number of significantly serialized tasks to perform.
An alternate approach, which has recently been explored, is to use pipeline parallelism wherein each loop iteration may be split into stages, and hardware threads may operate concurrently on different stages from different iterations. In this approach, a prior stage of an iteration, i, acts as a producer to a consumer, next stage of the iteration, i, and while one hardware thread operates on the next stage of iteration, i, another hardware thread operates concurrently on the prior stage of iteration, i+1. Thus a serial software process is queued from hardware thread to hardware thread and may exploit the parallelism of multicore and/or multithreaded processors.
One of the drawbacks to pipeline parallelism is that sharing data between multiple cores that is being queued in the multilevel cache hierarchies of modern multicore processors may incur significant performance delays, and cause expensive increases in coherency traffic, power use and energy consumption.
Some proposed solutions introduce a kind of message passing architecture, and/or software-managed memories without coherency support, to avoid the built-in cache coherency increases in traffic, power requirements and energy consumption, by putting the burden onto software. One drawback may be that considerable development and maintenance effort is added to the responsibilities of the software programmers.
To date, potential solutions to such performance limiting issues, high energy consumption worries, and other bottlenecks have not been adequately explored.