In large core systems, it is crucial for software developers to structure their programs to take advantage of the plurality of computational units in the core system. Software not optimized for multi-cores may not improve its performance when running on such core systems.
Crucial for hardware platforms may be the support offer to software developers to rapidly write optimized code benefits from existing and upcoming hardware features.
In particular, one important challenge for software developers is to deal with fine-grained inter-thread communications which may increase cache-coherency overhead, inter-chip communication, the impact of false sharing, the number of thread migrations, and which may saturate specific hardware constructs and limit scalability. This is particularly evident in modern managed languages, e.g. JAVA™, that provide automatic memory management.
Moreover, fine-grained inter-thread communication may reduce the performance of multi-thread applications executed on modern multi-core systems.
In case that two or more threads concurrently execute on different cores and communicate by means of a shared data structure, references to the accessed data have to be moved between the caches of corresponding cores.
On modern architectures, the last-level cache, e.g., L3 cache is conventionally shared between the cores that are part of the same CPU (CPU: Central Processing Unit).
However, this may not be the case for caches at other levels, e.g., L1 and L2, or for cores on different CPUs.
In case of fine-grained communication, the overhead introduced by cache-coherency protocols may severely impact the program execution time.
In this regard, FIG. 10 shows an example for conventional inter-thread communication between producer threads T1-T3 and consumer threads C1-C3. Each thread T1-T3, C1-C3 has an allocated, unshared cache M1-M6.
The producer threads T1-T3 communicate with the consumer threads C1-C3 by means of a shared data structure, embodied by a shared queue SQ in FIG. 10. References to this shared data structure SQ have to be frequently moved between the caches M1-M6 of the cores on which the corresponding threads T1-T3, C1-C3 are executing. This issue is even more evident in case of multiple producers T1-T3 and consumers C1-C3 accessing the same data structure SQ as exemplarily shown in FIG. 10.
Document US 2010/0332755 A1 describes a method and an apparatus for using a shared ring buffer to provide thread synchronization in a mulit-core processor system. Therein, synchronization between threads in a multi-core processor system is provided. Such an apparatus includes a memory, a first processor core, and a second processor core. The memory includes a shared ring buffer for storing data units, and stores a plurality of shared variables associated with accessing the shared ring buffer. The first processor core runs a first thread and has a first cache associated therewith. The first cache stores a first set of local variables associated with the first processor core. The first thread controls insertion of data items into the shared ring buffer using at least one of the shared variables and the first set of local variables. The second processor core runs a second thread and has a second cache associated therewith. The second cache stores a second set of local variables associated with the second processor core. The second thread controls extraction of data items from the shared ring buffer using at least one of the shared variables and the second set of local variables.
Document US 2010/0223431 A1 shows a memory access control system, a memory access control method, and a program thereof. In a multi-core processor of a shared-memory type, deterioration in the data processing capability caused by competitions of memory accesses from a plurality of processors is suppressed effectively. In a memory access controlling system for controlling accesses to a cache memory in a data read-ahead process when the multi core processor of a shared-memory type performs a task including a data read-ahead thread for executing data read-ahead and a parallel execution thread for performing an execution process in parallel with the data read-ahead, the system includes a data read-ahead controller which controls an interval between data read-ahead processes in the data read-ahead thread adaptive to a data flow which varies corresponding to an input value of the parallel process in the parallel execution thread. By controlling the interval between the data read-ahead processes, competitions of memory accesses in the multi-core processor are suppressed.
Document US 2010/0169895 A1 describes the method and system for inter-thread communication using processor messaging. In shared-memory computer systems, threads may communicate with one another using shared memory. A receiving thread may poll a message target location repeatedly to detect the delivery of a message. Such polling may cause excessive cache coherency traffic and/or congestion on various system buses and/or other interconnects. A method for inter-processor communication may reduce such bus traffic by reducing the number of reads performed and/or the number of cache coherency messages necessary to pass messages. The method may include a thread reading the value of a message target location once, and determining that this value has been modified by detecting inter-processor messages, such as cache coherence messages, indicative of such modification. In systems that support transactional memory, a thread may use transactional memory primitives to detect the cache coherence messages. This may be done by starting a transaction, reading the target memory location, and spinning until the transaction is aborted.
Document US 2010/0131720 A1 shows the management of ownership control and data movement in shared-memory systems. It is a method to exchange data in a shared memory system includes the use of a buffer in communication with a producer processor and a consumer processor. The cache data is temporarily stored in the buffer. The method includes for the consumer and the producer to indicate intent to acquire ownership of the buffer. In response to the indication of intent, the producer, consumer, buffer are prepared for the access. If the consumer intends to acquire the buffer, the producer places the cache data into the buffer. If the producer intends to acquire the buffer, the consumer removes the cache data from the buffer. The access to the buffer, however, is delayed until the producer, consumer, and the buffer are prepared.
Document US 2009/0106495 describes a fast inter-strand data communication for processors with write-through L1 caches. Therein, a non-coherent store instruction is used to reduce inter-thread communication latency between threads sharing a level one write-through cache. When a thread executes the non-coherent store instruction, the level one cache is immediately updated with the data value. The data value is immediately available to another thread sharing the level-one write-through cache. A computer system having reduced inter-thread communication latency is disclosed. The computer system includes a first plurality of processor cores, each processor core including a second plurality of processing engines sharing a level one write-through cache. The level one caches are connected to a level two cache via a crossbar switch. The computer system further implements a non-coherent store instruction that updates a data value in the level one cache prior to updating the corresponding data value in the level two cache.
Further, thread-to-thread communication is described in US 2005/0289555 A1. A method for programmer-controlled cache line eviction policy is shown in US 2006/0143396 A1. Further background is described in references [1] and [2].