The present invention relates generally to computer systems, and more particularly to low latency, high bandwidth local data exchange between processing elements in a computer system.
In computer systems with distributed execution of tasks transfer of data between processing elements in a computer system can affect system performance and latency. In systems including several levels of cache, a communication of data includes copying the data to each cache level as the data is transferred to or from different processing elements. Copying of data to each cache level can increase latency and power consumption in the computer system.
Oftentimes processing elements need high bandwidth low latency data exchange depending upon the operations being performed. For example, in order to perform efficient reductions of partial results or to reuse A-column data in a matrix multiply subroutine, a low latency, high bandwidth system is desired. In addition, operations such as passing along chained results for further processing in other processing elements can cause unwanted latency, as this often involves passing through several layers of cache.