The present application generally relates to a parallel computing environment. More particularly, the present application relates to invoking a cache coherence operation at a cache memory device in the parallel computing environment.
A multi-threaded computer application, e.g., a cache coherent multiprocessor, may require cache coherence operations in memory devices and memory management structures to make copies of data in memory devices consistent and coherent (e.g., having same data value across copies in a main memory and cache memories regarding a memory location at a particular logical address). Such a thread(s) may refer to a computer processor or other processing devices (e.g., a network adaptor) which can read or write to a (cache or main) memory location. When a thread or other actions change the value of a memory location in one memory device, that device then invokes cache coherence operations to other memory devices in order to maintain cache coherence across the memory devices. Those cache coherence operations include, but are not limited to: invalidating a cache line, updating a cache line, etc. Morrow et al., “Methods and apparatus for low intrusion snoop invalidation,” U.S. Patent Application Publication No. 20100211744, filed Feb. 19, 2009, wholly incorporated by reference, describes cache coherence operations in detail. Known examples of cache coherence protocols include, but are not limited to: MSI (Modified, Shared and Invalid) protocol, MESI (Modified, Exclusive, Shared and Invalid) protocol, MOESI (Modified, Owned, Exclusive, Shared and Invalid) protocol, MERSI protocol (Modified, Exclusive, Read Only or Recent, Shared and Invalid) protocol, MESIF protocol (Modified, Exclusive, Shared, Invalid, Forward) protocol, Write-once protocol, Firefly protocol, and Dragon protocol.
A traditional store-operate instruction reads from, modifies, and writes to a memory location as an atomic operation. The atomic property allows the store-operate instruction to be used as a synchronization primitive across multiple threads. For example, the store-and instruction atomically reads data in a memory location, performs a bitwise logical-and operation of data (i.e., data described with the store-add instruction) and the read data, and writes the result of the logical-and operation into the memory location. The term store-operate instruction also includes the fetch-and-operate instruction (i.e., an instructions that returns a data value from a memory location and then modifies the data value in the memory location). An example of a traditional fetch-and-operate instruction is the fetch-and-increment instruction (i.e., an instruction that returns a data value from a memory location and then increments the value at that location).
In a multi-threaded environment, the use of store-operate instructions may improve application performance (e.g., better throughput, etc.). Because atomic operations are performed within a memory unit, the memory unit can satisfy a very high rate of store-operate instructions, even if the instructions are to a single memory location. For example, a memory system of IBM® Blue Gene®/Q computer can perform a store-operate instruction every 4 processor cycles. Since a store-operate instruction modifies the data value at a memory location, it traditionally invokes a memory coherence operation to other memory devices. For example, on the IBM® Blue Gene®/Q computer, a store-operate instruction can invoke a memory coherence operation on up to 15 level-1 (L1) caches (i.e., local caches). A high rate (e.g., every 4 processor cycles) of traditional store-operate instructions thus causes a high rate (e.g., every 4 processor cycles) of memory coherence operations which can significantly occupy computer resources and thus reduce application performance.