In SMAs (shared memory architecture), data and program partitioning is typically carried out by placing data requiring processing by multiple threads into the shared memory and splitting program more independently to processors, thus making programming easier compared to message passing (MPA) architectures in which processing happens always locally and the programmer is responsible for moving data around accordingly. Unfortunately most SMAs use a distributed shared memory architecture consisting of multiple interconnected processor-cache pairs, which makes cache coherency (and therefore latency tolerance) and synchronicity maintenance very expensive. This may even ruin their performance in communication intensive problems.
To tackle e.g. the above problem, the emulated shared memory (ESM), or shared memory emulation, architectures have been introduced. They incorporate a set of multithreaded processors that are connected via a high-throughput intercommunication network to a common uniformly and synchronously accessible shared memory. The memory system latency is hidden by overlapping on-going memory references and a special low-cost synchronization mechanism is established guaranteeing synchronicity at machine instruction level. The ESM systems provide the user with perception of ideal shared memory even though the actual hardware architecture comprises a physically distributed memory. From a theoretical standpoint, these architectures attempt to emulate the abstract parallel random access machine (PRAM) that is commonly used as a model for describing and analyzing the intrinsic parallelism of computational problems as well as performance and cost of executing parallel algorithms due to its simplicity and expressivity. A PRAM model generally refers to a set of processors working under the same clock and a uniform single step accessible shared memory connected to them.
Accordingly, ESM is a feasible technique to address programmability and performance scalability concerns of chip multiprocessors (CMP) as it yields implied synchrony in the execution of machine instructions, efficient latency hiding technique, and sufficient bandwidth to route all the memory references even with heavy random and concurrent access workloads. Synchronous execution is considered to make programming easier as a programmer does not need to synchronize the threads of execution explicitly after each global memory access but can rely on the hardware to take care of that automatically, whereas e.g. in message passing architectures (MPA), a programmer is responsible for explicitly defining communication, synchronizing subtasks, and describing data and program partitioning between threads making MPAs difficult to program. Latency hiding used in shared memory emulation makes use of the high-throughput computing scheme, where other threads are executed while a thread refers to the global shared memory. Since the throughput computing scheme employs parallel slackness extracted from the available thread-level parallelism, it is considered to provide enhanced scalability in contrast to traditional symmetric multiprocessors and non-uniform memory access (NUMA) systems relying on snooping or directory-based cache coherence mechanisms and therefore suffering from limited band-width or directory access delays and heavy coherence traffic maintenance.
Recently, scalable ESM architectures have been suggested incorporating step caches to implement the concurrent read concurrent write (CRCW) memory access variant of PRAM, which further simplifies programming and increases performance by a logarithmic factor in certain cases. Also a mechanism to support constant execution time multi(-prefix)operations—implementing even stronger multioperation concurrent read concurrent write (MCRCW) variant of the PRAM model—has been implemented with the help of scratchpads that are attached to step caches in order to bound the associativity of step caches. For instance, publications 1: M. Forsell, Step Caches—a Novel Approach to Concurrent Memory Access on Shared Memory MP-SOCs, In the Proceedings of the 23th IEEE NORCHIP Conference, Nov. 21-22, 2005, Oulu, Finland, 74-77, 2: M. Forsell, Reducing the associativity and size of step caches in CRCW operation, In the Proceeding of 8th Workshop on Advances in Parallel and Distributed Computational Models (in conjunction with the 20th IEEE International Parallel and Distributed Processing Symposium, IPDPS'06), Apr. 25, 2006, Rhodes, Greece, 3: M. Forsell, Realizing Multioperations for Step Cached MP-SOCs, In the Proceedings of the International Symposium on System-on-Chip 2006 (SOC'06), Nov. 14-16, 2006, Tampere, Finland, 77-82., 4: M. Forsell, TOTAL ECLIPSE—An Efficient Architectural Realization of the Parallel Random Access Machine, In Parallel and Distributed Computing Edited by Alberto Ros, INTECH, Vienna, 2010, 39-64., and 5: M. Forsell and J. Roivainen, Supporting Ordered Multiprefix Operations in Emulated Shared Memory CMPs, In the Proceedings of the 2011 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'11), Jul. 18-21, 2011, Las Vegas, USA, 506-512, contemplate different aspects of such a solution and are thereby incorporated herein by reference in their entireties. Multi(-prefix)operations can be defined for many basic operations, e.g. ADD, SUB, MAX etc., and considered as parallel primitives due to the capability to express parallel algorithms. They can be used for synchronization and parallel data structures simultaneously accessed by several processors without race conditions and other anomalies of architectures executing threads asynchronously.
In FIG. 1, a high-level illustration of a scalable architecture 100 to emulate shared memory on a silicon platform is shown. It comprises a set of processors (cores) P1, P2, P3, . . . , Pp 102 connected to a physically distributed, but logically shared (data) memory M1, M2, M3, . . . , Mp 112 via a physically scalable high bandwidth interconnection network 108. Active memory units 110 in connection with data memory 112 can be considered as memory control logic units utilized to process the memory references. The active memory units 110 are arranged to manage computation related to cases in which multiple memory references are targeted to the same memory location during, e.g., multioperations that may include multiprefix operations, for instance. Instruction memory modules I1, I2, I3, . . . , Ip 104 are configured to carry the program code for each processor 102. To efficiently emulate shared memory by the distributed memory-based implementation, the processors 102 are multithreaded utilizing a Tp-stage cyclic, interleaved inter-thread pipeline (Tp≥the average latency of the network). The PRAM model is linked to the architecture such that a full cycle in the pipeline corresponds typically to a single PRAM step. During a step of multi-threaded execution (regarding the pipeline in overall, i.e. all pipeline stages including the actual execution stage), each thread of each processor of the CMP executes an instruction including at most one shared memory reference sub-instruction. Therefore a step lasts for multiple, at least Tp+1 clock cycles.
In the depicted architecture, step caches are generally associative memory buffers in which data stays valid only to the end of ongoing step of multithreaded execution. The main contribution of step caches to concurrent accesses is that they step-wisely filter out everything but the first reference for each referenced memory location. This reduces the number of requests per location from P×Tp down to P allowing them to be processed sequentially on a single ported memory module assuming Tp≥P. Scratchpads are addressable memory buffers that are used to store memory access data to keep the associativity of step caches limited in implementing multioperations with the help of step caches and minimal on-core and off-core ALUs (arithmetic logic unit) that take care of actual intra-processor and inter-processor computation for multioperations. Scratchpads may be coupled with step caches to establish so-called scratchpad step cache units S1, S2, S3, . . . , Sp 106.
One underlying idea of the reviewed solution is indeed in the allocation of each processor core 102 with a set of threads that are executed efficiently in an inter-leaved manner and hiding the latency of the network. As a thread makes a memory reference, the executed thread is changed and the next thread can make its memory request and so on. No memory delay will occur provided that the reply of the memory reference of the thread arrives to the processor core before the thread is put back to execution. This requires that the bandwidth of the network is high enough and hot spots can be avoided in pipelined memory access traffic. Synchronicity between consecutive instructions can be guaranteed by using an elastic synchronization wave between the steps, for instance.
FIG. 2 shows, at 200, one illustration of an ESM CMP architecture incorporating e.g. the aforementioned active memory units 112B (with ALU and fetcher) in connection with data memory modules 112 and scratchpads 206B. The network 108 may be a mesh-like interconnection network acting as a high-bandwidth pipelined memory system with switches 108B. The memory access latency is hidden by executing other threads while a thread is referencing the uniformly accessible distributed shared memory via the network 108. Congestion of references and hot spots in communication are avoided with an efficient dead-lock free intercommunication architecture featuring high bandwidth (bisection BW≥P/4) and randomized hashing of memory locations over the distributed memory modules. Execution of instructions happens in steps corresponding to a single PRAM step during which each thread executes a single instruction.
Despite of the many aforementioned advantages, ESM systems have appeared difficult to realize in truly optimal fashion. A physically feasible memory unit (MU) making use of step cache and scratchpad techniques to support strong con-current memory access and multioperations is easily comprehensible as one key component of powerful emulated shared memory architecture like REPLICA (REmoving Performance and programmability LImitations of Chip multiprocessor Architectures), which is basically a configurable ESM. Such MU sends the outgoing memory references to the shared memory system as well as waits and receives possible replies therefrom. Unfortunately, in the prior art MU solution described below in more detail, the low-level implementation details are non-existent and the proposed arrangement requires relatively complex multiport step caches and scratchpads or complex sorting arrays and large node-wise buffers. In addition, the receive logic of the prior solution accesses both step cache and scratchpad during a single clock cycle and the performance of the latter one is spoiled by two step minimum latency for all memory operations. All this rules the already-available MU solution rather impractical.
FIG. 3 represents, at 300, a high-level block diagram and pipeline of a typical MCRCW ESM processor making use of step caches. A processor in a step cache-based MCRCW (C)ESM CMP comprises A ALUs, M memory units (MU), a distributed or unified register block, a sequencer and some glue logic. In the figure Ax 302 refers to ALU x, IF 308 refers to instruction fetch logic, MEM 304 refers to memory unit stage, OS 306 refers to operand selection logic and SEQ 310 refers to sequencer. As implied in the figure, there are ALUs 302 logically positioned prior to and after the memory unit wait segment.
With reference to FIG. 4, prior memory unit (MU) architecture 400 is reviewed hereinafter in more detail. In the visualized architecture, a (baseline) MU comprises a hash and compose unit (HCU) 402, send logic 410, dual array step cache 404, scratchpad 406, reply receive logic 412 and receive ALU 414.
The HCU 402 is responsible for computing the hash address for memory references and composing the outgoing memory reference messages out of memory data given in the MDR (memory data register) register, memory address given in the MAR (memory address register) register, thread identifier, memory operation from the operation register, and LSBs of the current step provided by a step counter.
Based on memory (sub)instruction, the send logic 410 accesses the step cache 404 and scratchpad 406 and sends the reference (memory request) on its way to the shared memory system 408 or calculates the internal memory operation result against the scratchpad data and gives the result as a fast reply directly to the reply receive logic 412.
The reply receive logic 412 is configured to receive incoming memory request reply messages from the shared memory system and send logic 410 (fast reply). It compares the address fields of these two and tries to set the received data registers of the threads in the receive logic. Two comparators per pipeline stage are needed because both replies and send logic data are moving against each other.
Step caches 404 (in which data stays valid only to the end of ongoing step of multithreaded execution) are utilized for implementing CRCW access to the shared memory system 408 without cache coherence problems by filtering processor and step-wisely out everything but the first references for each referenced memory location as mentioned hereinbefore. Each cache line preferably contains data, initiator (thread id of the first thread referring to the location specified by the address), and address tags. Two cache arrays and two ports are needed since the MU needs to access step cache from both send and receive stages of the pipeline that may carry threads belonging to different steps.
Each time the processor refers the shared memory system 408 a step cache search is performed. A hit is detected on a cache line if the line is in use, the address matches the address tag of the line, and the least significant bits of the current step of the reference match the two LSBs of the step of the line. In the case of a hit, a write is just ignored while a read is just completed by accessing the data from the cache. In the case of a miss, the reference is stored into the cache using the replacement policy and marked as pending (for reads). At the same time with storing the reference information to the cache line, the reference itself is sent to the shared memory system 408. When a reply of a read arrives from the shared memory system 408, the data is put to the data field of the step cache line storing the reference information and the pending field is cleared. Predetermined cache decay logic is exploited to care of invalidating the lines before their step field matches again to the least significant bits of current step.
In order to implement multioperations with step caches of limited associativity, separate dual port processor-level multioperation memories, scratch-pads, 406 are used. This is because there is a need to store the id of the initiator thread of each multioperation sequence to the step cache and internal initiator thread id register as well as reference information to a storage that saves the information regardless of possible conflicts that may wipe away information on references from the step cache. The scratchpad 406 has fields for data, address and pending for each thread of the processor. Multioperations can be implemented by using sequences of two instructions so that data to be written in the step cache 404 is also written to the scratchpad 406, id of the first thread referencing a certain location is stored to the step cache 406 and the Initiator register (for the rest of references), the pending bit for multioperations is kept in the scratchpad 406, the reply data is stored to the scratchpad 406, and reply data for the ending operation is retrieved from the scratchpad 406 (see Algorithm 1 in Table 1 below).
TABLE 1Algorithm 1: Implementation of a two-step MPADD multioperation inthe baseline MUPROCEDURE Processor::Execute::BMPADD ( Write_data ,Write_Address )Search Write_Address from the StepCache and put the result inmatching_index IF not found THEN IF the target line pending THEN  Mark memory system busy until the end of the current cycle ELSE Read_data := 0  StepCache[matching_index].data := Write_data  StepCache[matching_index].address := Write_Address  StepCache[matching_index].initiator_thread := Thread_id  ScratchPad[Thread_id].Data := Write_data  ScratchPad[Thread_id].Address := Write_Address  Initiator_thread:=Thread_idELSE Read_data := StepCache[matching_index].data StepCache[matching_index].data:= Write_data + Read_data ScratchPad[Initiator_thread].Data := Write_data + Read_data Initiator_thread := StepCache[matching_index].Initiator_threadPROCEDURE Processor::Execute::EMPADD ( Write_data ,Write_Address )IF Thread_id<>Initiator_thread THEN IF ScratchPad[Initiator_thread].pending THEN  Reply_pending := True ELSE  Read_data := Write_data + ScratchPad[Initiator_thread].DataELSE IF Write_Address = ScratchPad[Initiator_thread].Address THEN  Send a EMPADD reference to the memory system with  - address = Write_Address   - data = ScratchPad[Initiator_thread].Data  ScratchPad[Thread_id].pending := True ELSE  Commit a Multioperation address error exceptionPROCEDURE Module::Commit_access::EMPADD ( Data , Address )Temporary_data := Memory [Address]Memory[Address] := Memory[Address] + DataReply_data := Temporary_dataPROCEDURE Processor::Receive_reply::EMPADD ( Data , Address ,Thread )Read_Data[Thread] := Data | Read_Data[Thread]+Data(if Thread≠Thread_id)ScratchPad[Thread].Data := DataScratchPad[Thread].Pending := FalseReplyPending[Thread_id] := FalseFOR each successor of Thread DO IF ReplyPending[successor] THEN  Read_data[successor] := Read_data[successor] + Data  ReplyPending[successor] := False
The starting operation (BMPxxx for arbitrary ordered multiprefix operations) executes a processor-wise multioperation against a step cache location without making any reference to the external memory system. The ending operation (EMPxxx for arbitrary ordered multi-prefix operations) performs the rest of the multioperation so that the first reference to a previously initialized memory location triggers an external memory reference using the processor-wise multioperation result as an operand. The external memory references that are targeted to the same location may be processed in the active memory units of the corresponding memory modules according to the type of the multioperation and the reply data is sent back to the scratchpads of participating processors.
Alternative realizations of concurrent access ESM systems use cacheless variants of the Ranade's algorithm or real multiport memories: The Ranade's algorithm implements a shared memory abstraction relying on sorting and messaging of memory references on a top a physically distributed memory system. While this solution provides some advantages like an ability to implement fast multioperations (the MCRCW PRAM model), it also leaves room for improvements since every access includes consecutive sort and access phases requiring complex hardware sorters and buffers to store combined references on the nodes of the network. In addition, the existing architectures rely on a non-scalable interconnection topology. Real multiport memories provide concurrent access down to memory cell-level by replicating the access circuitry of each memory cell array by the number of concurrent memory ports. This increases the area taken by the access circuitries quadratically with respect to the number of ports (the size of the cells increase only linearly) making them impractical if the number of ports exceeds, say, four. Ordinary cache based architecture in which caches are located next to the processors cannot be used in advanced shared memory computing since cache coherence is at least as expensive to retain with efficient shared memory algorithms as the multiport memories.