Modern computer systems often include multiple processors and/or multiple processing cores that must communicate with on another. For example, shared memory systems that maintain coherency between data on caches in different processing cores often deploy complex cache coherence protocols that broadcast many messages between the processing elements. Additionally, parallel programs often depend on point-to-point, broadcast, scatter/gather, and other message communication patterns among the multiple processing cores of a computer system. Architectural design trends indicate that future systems will have even higher processing core counts.
As the number of processing elements in computer systems continues to increase, both academic and forward-looking industry projects have focused on finding communications solutions that are capable of scaling to large processing core counts while maintaining low communication latency. Some such projects have proposed the use of interconnection networks as a replacement for conventional shared buses and ad-hoc wiring solutions. For example, on-chip interconnects (a.k.a., networks-on-chip) have been used to connect multiple processing cores on a single chip to one another according to various network topologies, such as two or three-dimensional grids (i.e., mesh) with links between logically adjacent cores.
In traditional interconnects, messages are often sent as packets (or as portions of packets known as “flits”), which must traverse multiple cores before arriving at a final destination core. Since the flit must often traverse a multi-stage router pipeline at each intermediate core en route to its final destination core, messages between topologically distant cores on the interconnect can accumulate significant end-to-end latencies due to pipeline-traversal overheads.