The number of central processing unit (CPU) cores on a chip and the number of CPU cores connected to a shared memory continues to grow significantly to support growing workload capacity demand. The increasing number of CPUs cooperating to process the same workloads puts a significant burden on software scalability; for example, shared queues or data-structures protected by traditional semaphores become hot spots and lead to sub-linear n-way scaling curves. Traditionally this has been countered by implementing finer-grained locking in software, and with lower latency/higher bandwidth interconnects in hardware. Implementing fine-grained locking to improve software scalability can be very complicated and error-prone, and at today's CPU frequencies, the latencies of hardware interconnects are limited by the physical dimension of the chips and systems, and by the speed of light.
Implementations of hardware Transactional Memory (HTM, or in this discussion, simply TM) have been introduced, wherein a group of instructions—called a transaction—operate in an atomic manner on a data structure in memory, as viewed by other central processing units (CPUs) and the I/O subsystem (atomic operation s also known as “block concurrent” or “serialized” in other literature). The transaction executes optimistically without obtaining a lock, but may need to abort and retry the transaction execution if an operation, of the executing transaction, on a memory location conflicts with anther operation on the same memory location. Previously, software transactional memory implementations have been proposed to support software Transactional Memory (TM). However, hardware TM can provide improved performance aspects and ease of use over software TM.
U.S. Patent Application Publication No 2004/0044850 titled “Method and apparatus for the synchronization of distributed caches” filed 2002 Aug. 28 and incorporated by reference herein teaches A method and apparatus for the synchronization of distributed caches. More particularly, the present embodiment to cache memory systems and more particularly to a hierarchical caching protocol suitable for use with distributed caches, including use within a caching input/output (I/O) hub.
U.S. Pat. No. 5,586,297 titled “Partial cache line write transactions in a computing system with a write back cache” filed 1994 Mar. 24 and incorporated by reference herein teaches A computing system is presented which includes a memory, an input/output adapter and a processor. The processor includes a write back cache in which dirty data may be stored. When performing a coherent write from the input/output adapter to the memory, a block of data is written from the input/output adapter to a memory location within the memory. The block of data contains less data than a full cache line in the write back cache. The write back cache is searched to determine whether the write back cache contains data for the memory location. When the search determines that the write back cache contains data for the memory location a full cache line which contains the data for the memory location is purged.