The number of central processing unit (CPU) cores on a chip, and the number of CPU cores connected to a shared memory, continues to grow significantly to support growing workload capacity demand. The increasing number of CPUs cooperating to process the same workloads puts a significant burden on software scalability; for example, shared queues or data-structures protected by traditional semaphores become hot spots and lead to sub-linear n-way scaling curves. Traditionally, this has been countered by implementing finer-grained locking in software, and with lower latency/higher bandwidth interconnects in hardware. Implementing fine-grained locking to improve software scalability can be very complicated and error-prone and, at today's CPU frequencies, the latencies of hardware interconnects are limited by the physical dimension of the chips and systems, and by the speed of light.
Implementations of hardware Transactional Memory (HTM, or in this discussion, simply TM) have been introduced, wherein a group of instructions—called a transaction—operate in an atomic manner on a data structure in memory, as viewed by other central processing units (CPUs) and the I/O subsystem (atomic operation is also known as “block concurrent” or “serialized” in other literature). The transaction executes optimistically without obtaining a lock, but may need to abort and retry the transaction execution if an operation, of the executing transaction, on a memory location conflicts with another operation on the same memory location. Previously, software transactional memory implementations have been proposed to support software Transactional Memory (TM). However, hardware TM can provide improved performance aspects and ease of use over software TM.
U.S. Pat. No. 7,269,694 titled “Selectively Monitoring Loads to Support Transactional Program Execution,” filed Aug. 8, 2003, by Tremblay et al. (“Tremblay 2003”), and incorporated by reference herein in its entirety, teaches a system that selectively monitors load instructions to support transactional execution of a process, wherein changes made during the transactional execution are not committed to the architectural state of a processor until the transactional execution successfully completes. Upon encountering a load instruction during transactional execution of a block of instructions, the system determines whether the load instruction is a monitored load instruction or an unmonitored load instruction. If the load instruction is a monitored load instruction, the system performs the load operation, and load-marks a cache line associated with the load instruction to facilitate subsequent detection of an interfering data access to the cache line from another process. If the load instruction is an unmonitored load instruction, the system performs the load operation without load-marking the cache line.
U.S. Pat. No. 8,209,499 titled “Method of Read-Set and Write-Set Management by Distinguishing Between Shared and Non-Shared Memory Regions,” filed Jan. 15, 2010, by Chou (“Chou 2010”), and incorporated by reference herein in its entirety, teaches a method of read-set and write-set management that distinguishes between shared and non-shared memory regions. A shared memory region, used by a transactional memory application, which may be shared by one or more concurrent transactions is identified. A non-shared memory region, used by the transactional memory application, which is not shared by the one or more concurrent transactions is identified. A subset of a read-set and a write-set that access the shared memory region is checked for conflicts with the one or more concurrent transactions at a first granularity. A subset of the read-set and the write-set that access the non-shared memory region is checked for conflicts with the one or more concurrent transactions at a second granularity. The first granularity is finer than the second granularity.