Core-to-core (“C2C”) communication is critical in many computer applications today such as packet processing, high-performance computing, machine learning, and data center/cloud workloads. In chip multi-processor (“CMP”) architectures, as the number of cores increases, C2C communication often becomes a limiting factor for performance scaling when workloads share data. On a general purpose platform, shared memory space between cores is often employed to realize efficient C2C communication. However, the need to carefully manage the shared memory space by software, together with the increase in hardware coherency traffic, tend to incur significant overhead. As a result, CPU cores and network-on-chip (“NoC”) designs that share coherent caches typically experience substantially longer latency and higher data traffic, while expending considerable resources to carry-out communication-related work. This keeps CPU cores and NoCs from performing their intended data processing tasks.
In general, software queues such as the classic Lamport algorithm are commonly used on CMP platforms to enable C2C communication. There are two types of overhead generated in a traditional software queue. The first consist of cycles consumed by queue structure maintenance and synchronization, as well as by flow control and management of shared memory. This type of overhead is referred to as control plane overhead. The second type of overhead comprises of cycles spent on moving of data from one core to another. This type of overhead if referred to as data plane overhead. The sum of control plane and data plane overhead constitute the total overhead required to transfer data across cores. There are both software and hardware optimizations available for alleviating these overheads. The RTE-ring code from the DPDK library (a software optimization) and the hardware-accelerated queueing utilizing Freescale's DPAA technology (a hardware optimization) are examples of the optimization techniques that exist today. However, none of these existing optimizations are ideal at reducing core-to-core communications overhead. This is especially true when it comes to simultaneously reducing both the control-plane overhead and the data-plane overhead.