Field of the Invention
This invention relates generally to the field of computer processors. More particularly, the invention relates to an apparatus and method for triggered prefetching to improve I/O and producer-consumer workload efficiency.
Description of the Related Art
There are two emerging trends in communication and data center we target to address in this disclosure. First, Software Defined Networking (SDN) and Network Function Virtualization (NFV) are leading to rapid development of software-based packet processing on general purpose servers, which often require high speed communication among multiple cores with packets passing through different processing stages. Additionally, network line speeds continue to increase fast. Today, 10/40 Gbps switch chips and network interface cards (NICs) are quickly becoming a commodity, with 100 Gbps links on the brink of entering the market.
Both trends call for efficient and minimum latency data communication among cores and between NICs to cores. However, as of today, Core-to-Core (C2C) and NIC-to-Core (N2C) communication suffer significant overhead due to coherency protocols. Specifically, for C2C communication with producer-consumer type workloads, the first core produces new data sets and second consumes the data sets once they have been completed. In these workloads the producer and consumer threads must coordinate through the use of synchronization primitives such as flags/locks to ensure that the producer has completed its work on the data set before the consumer begins its work. When producing the dataset, the producer thread will acquire ownership of the dataset cache lines that must be modified via setting the flag. In doing this, it will invalidate other copies of the data that reside on other cores, including the consumer thread. Likewise, when the consumer thread uses the data set it must obtain a copy of data set. Since the producer and consumer threads will often execute on different cores and store their data in different caches, the cache lines containing the data set as well as the lock will continually bounce back and forth between the producer and consumer caches. Each time these cache lines move from cache to cache, the performance of the thread running on the destination core will suffer as it incurs a cache miss for every cache line it must request from the other core's cache.
Similarly, when an NIC, using a circular data buffer, sends data to the last level cache (LLC) via Data Direct I/O (DDIO), it must invalidate the copy in the mid-level cache (MLC) and/or Level 1 (L1) cache that was just touched by the core. When the core loads the data, it has to go to the LLC to fetch it. The LLC load latency is ˜45 cycles, much longer than the ˜14 cycles MLC latency.
The coherence overhead described above in both cases (C2C and N2C) impacts the performance, especially for high speed processing.