The present invention generally relates to multiprocessing. More specifically, the invention relates to reducing consumption of bandwidth by communications maintaining coherence between accelerators and CPUs, which could be on the same chip or different chips.
Early computer systems comprised a single central processing unit (CPU), along with the CPU's associated memory, input/output (I/O) devices, and mass storage systems such as disk drives, optical storage, magnetic tape drives, and the like.
However, the increasing demand for processing power beyond the capabilities of a single processor has lead to a number of solutions to relieve the increasing strain from demand on processors. One such solution is to use an accelerator coupled with the CPU. Accelerators are autonomous units that are either programmable or perform a specific function. When a request for the performance of such a function is received by the CPU, the performance of the function may be delegated to the accelerator. While the accelerator processes a task to which it is assigned, the CPU may proceed to another task, thereby reducing the strain on the CPU and improving efficiency.
FIG. 1 illustrates an exemplary CPU 100 coupled with an accelerator 110 over an interconnect bus 120. The CPU 100 may be connected to a memory device 102. Memory device 102, for example, may be a Dynamic Random Access Memory (DRAM) device. Additionally, CPU 100 may also contain local cache memory 101 to facilitate fast accesses to data being processed. Accelerator 110 may be connected to the CPU over interconnect bus 120 to perform a specific function. For example, accelerator 110 may be a graphics accelerator that performs specialized graphical computations and transformations. The accelerator may have its own memory 112 and cache 111.
When a request for processing graphics is received by the CPU 100, accelerator 110 may be delegated the task of processing the graphics data. For example, block 1 contained in memory 102 may contain graphics data that requires processing. When the processing request is sent to the CPU, block 1 may be transferred to cache 111 (or accelerator memory 112) for processing by the accelerator. When the processing of block 1 is completed, it may be written back to memory 102 by the accelerator.
One problem with the prior art is that accelerators and CPUs are connected without memory coherence. Because the processor and the accelerator may share data contained in the memories 102 and 112, there is a need for coordination between the accelerator and the CPU when working on shared data. Coherency is required to ensure that the accelerator and the CPU do not access different data addressed by the same memory location. For example, in FIG. 1, the CPU may receive a request to process block 1 after block 1 has been sent to accelerator 110 for processing. If the new request is not a graphics processing request, the CPU may cache block 1 for processing. If the accelerator completes processing the block before the CPU processes the block, the data cached by the CPU will be outdated. Therefore, the CPU will process incorrect data. Memory coherence requires that the most recently modified copy of the data be available to all processing devices.
One solution to this problem is to implement a snooping protocol to update the obsolete data in caches. Addresses and commands issued to each processor may be transmitted to every other processor and/or accelerator. A bus monitor may be used to monitor address lines for memory accesses. If a cache contains a copy of a memory block being addressed on the bus, the cache may update its copy of the memory block. For example, in FIG. 1, a bus monitor may monitor bus 120. If a write operation by accelerator 110 is detected by cache 101 when it contains a copy of block 1, cache 101 may update its own copy of block 1 so that it contains the most recent and accurate copy of block 1 for processing by the CPU.
In other embodiments, cache 101 may invalidate its copy of block 1 in response to detecting a memory write to block 1 in memory 102. Therefore, when the CPU attempts to access block 1 from cache, a fresh copy of block 1 may be retrieved from memory.
However, in a multiprocessing environment with multiple accelerators, CPUs, and shared memory, enforcing cache coherence means all memory accesses must be propagated to all coherent units. Each coherent unit may then snoop the memory access and respond to the initiator of the access indicating whether they need an update. This sort of communication between devices at each access to shared memory may consume much of the inter-node bandwidth and greatly reduce the efficiency of the system. A node may consist of a group of CPUs or accelerators that share a common physical bus through which the CPUs and accelerators perform coherent memory accesses. Often, but not necessarily, nodes are on different chips.
Therefore what is needed are methods and systems to efficiently maintain cache coherence between multiple CPUs and accelerators.