A Network Interface Controller (NIC)—which may be, for example, network interface circuitry, such as within a system on a chip (SoC)—is typically used to couple one or more processors to a packet network through at least one interface, called a port. NIC circuitry has been an area of rapid development as advanced packet processing functionality and protocol offload have become common for so called “smart NICs”.
Parallel computer systems provide economic, scalable, and high-availability approaches to computing solutions. From the point of view of managing computer systems including parallel-processor systems, there is a need for a cache coherence system and control in order to obtain the desired system operation. Cache coherence typically offers savings in programmers' time and leads to more robust applications, and quicker time to solution. Conventional hierarchical cache systems provide small fast cache memories physically near fast information processing units, and larger slower memories that are further away in time and space. It is too expensive to make a fast memory large enough to hold all of the data for a large computer program, and when memories are made larger, the access times slow down and power consumption and heat dissipation also becomes a problem.
The Cache Coherent non-Uniform Memory Access (ccNUMA) is one known method to scale coherent memory to multiple nodes, in this case, such as scaling cache coherence to multiple SoC's. Modern computer systems typically include a hierarchy of memory systems. For example, a multi-processor SoC might have an L0 and L1 private cache next to each processor, and a common share L2 cache per processor cluster. The L0 cache is typically the smallest, perhaps 16 to 256 kilobytes (KB), and runs at the fastest speed thereby consuming the most power. An L1 and L2 cache might be placed next to each processor unit. These L1 and L2 caches are the next smallest, perhaps 0.5 to 8 megabytes (MB), and run at the next fastest speed. An L2 cache, if implemented, might be placed next to each processor cluster. An L3 SoC cache, common to all the caching agents within the SoC, of size 16 MB would typically represent the last level of cache memory on the SoC.
A large main memory, typically implemented using one or more banks of DDR SDRAMs (double-data-rate synchronous dynamic random-access memories) is then typically provided per SoC. Beyond that, a solid-state drive (SSD) and/or hard disc drive (HDD) disc array provides mass storage at a slower speed than main memory, and a tape farm can even be provided to hold truly enormous amounts of data, accessible within seconds, minutes or hours. At each level moving further from the processor, there is typically a larger store running at a slower speed. For each level of storage, the level closer to the processor typically contains a proper subset of the data that is in the level further away (inclusion property). For example, in order to purge data in the main memory leaving that data in the disc storage, one must first purge all of the portions of that data that may reside in the L0, L1, L2, and/or L3 levels of cache. Conventionally, this may not lead to any performance problems, since the processor is finished with the data by the time that the main memory is purged.
However, as more processors and more caches are added to a system, there is a need to scale out to systems consisting of multiple SoC, and there can be more competition for scarce cache resources. It can also be beneficial to scale out coherence to handheld devices as this can e.g. simplify the coordination of data on server machines and a subset of that data on the handheld devices. There is a need to maintain coherence of data (i.e., ensuring that as data is modified, that all cached copies are timely and properly updated, ensuring consistency of all copies that are stored in various caches). Thus there is a need for improved methods and apparatus to improve system performance while also maintaining system integrity and cache coherence.