A “multi-core processor” is a single computing component comprising a number of independent processors each of which is able to read and execute program instructions. The cores may be integrated onto a single chip, or may be discrete components interconnected together. A multi-core processor allows different or the same sets of instructions to be executed in parallel, significantly increasing processing power as compared to single core processors. Of course, significant challenges are encountered when writing and handling code for use with multi-core processors. FIG. 1A illustrates schematically a single-core processor memory architecture comprising a main memory (off chip) and a single-core on-chip processor with layer 1 (L1) and layer 2 (L2) caches. FIG. 1B illustrates schematically a multi-chip processor architecture again with a (common) off chip main memory.
A particular problem that is encountered with multi-core processors concerns memory access. This is known as the “shared state problem” and arises when individual cores of the system try to access the same data (shared data) from the some location (of a memory) at the some time. If two different cores of the system are allowed to access the same data at the same time, the consistency of that data may be compromised and the system becomes unreliable.
Two approaches to mitigate this shared state problem are (i) using locks and (ii) using hardware or software transactional memory. Locks are resources that may be owned by only one processing instance (processor or thread). If a core acquires “ownership” of a lock, that core is guaranteed exclusive access to the underlying resources (such as data). In the software transactional memory (TM) approach, concurrent access to data by cores is allowed. However, in the event that a conflict arises between first and second accessing cores trying to access the same data at the same time, the first accessing core is stopped and all changes performed by that core are rolled back to a safe state. Thereafter, only the second accessing core is allowed to act on the shared data. After the second accessing core has finished acting on the shared data, the first accessing core is allowed to act on the shared data.
Considering further the lock approach, this may be considered non-composable, i.e., two pieces of otherwise correct program code, when combined, may not perform correctly, resulting in hard-to-detect deadlock or live-lock situations. The transactional memory approach on the other hand, while composable, results in a large processing overhead (usually requiring hardware support). In addition, the transactional memory approach is not scalable, i.e., addition of further cores to an existing system results in lower performance. The multi-core system may become increasingly inefficient as the number of cores trying to access the same data is increased. Furthermore, neither locks nor the TM approach are predictable and deterministic, i.e., it is difficult, and in some cases impossible, to calculate a reliable upper-bound for an execution time required by the accessing cores. This behaviour is not suitable for at least real-time applications.
The literature on cache coherency protocols is significant and includes so-called “snoopy” protocols:
J. R. Goodman, “Using Cache Memory to Reduce Processor-Memory Traffic”, Proc. of the 10th International Symposium on Computer Architecture, pp. 124-131.
R. H. Katz, S. J. Eggers, D. A. Wood, C. L. Perkins, and R. G. Sheldon. Implementing a Cache Consistency Protocol. Proc. 12'th International Symposium on Computer Architecture, pp. 276-283.
M. Papamarcos and J. Patel. A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories. Proc. of the 11th International Symposium on Computer Architecture, pp. 348-354.
P. Sweazey, A. J. Smith. A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus. Proc. of 13th International Symposium on Computer Architecture. pp. 414-423.
as well as directory based protocols:
D. Chaiken, C. Fields, K. Kurihara, A. Agarwal: Directory-Based cache Coherence in Large-Scale Multiprocessors. IEEE Computer 23(6): 49-58.
A. Gupta, W. D. Weber, T. C. Mowry: Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. Proc. of ICPP (1): 312-321.
H. Nilsson and P. Stenström. The Scalable Tree Protocol—A Cache Coherence Approach for Large-Scale Multiprocessors. Proc. of 4th IEEE Symposium on Parallel and Distributed Processing, pp. 498-507.
These protocols, including commercial solutions, rely on the principle of delivering memory data, required by a specific processor core, to the private cache of that processor core. Existing cache coherence solutions tend to have high complexity and require a significant design and verification effort due to the large number of special cases that need to be taken care of in the presence of truly concurrent access to the memory and presence of the same memory blocks in multiple caches in the memory hierarchy. Another drawback of cache coherence is that it moves the data to the computation which can potentially cause significant inefficiencies.
In contrast to these known protocols, more recent work [see for Vajda, A. Handling of Shared Memory in Many-core systems without Locks and Transactional Memory. 3rd Workshop on Programmability Issues for Multi-core Computers (MULTIPROG), and Suleman, M. A., Mutlu, O., Qureshi, M. K., Patt, Y. N. Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems] take a different approach, employing the principle of moving the computation to the data. The solution proposed by Suleman et al relies on concentrating all access to shared memory in one single, powerful core, while that proposed by Vajda proposes a generalized solution, based on software driven allocation of memory blocks to processor cores. In a further paper [Vajda A. The Case for Coherence-less Distributed Cache Architecture. 4th Workshop on Chip Multiprocessor Memory Systems and Interconnects] a preliminary analysis is offered on the impact that such solutions can have on chip architectures and memory models.
WO2010/020828 describes a method and architecture for sharing data in a multi-core processor architecture. Foong. A et al, An Architecture for Software-based iSCSI on Multiprocessor Servers describes the use of a software implementation of iSCSI in the context of chip multiprocessing (CMP).