Multiple threads of execution in a computer system allow a program to fork or split into independent concurrently running tasks. Multithreading as a programming and execution model allows multiple threads to exist within the context of a single process, sharing resources with independent and concurrent execution. Threads in the same program or process share memory and some other resources. Threads within different processes may be prevented from sharing memory or other resources.
A challenge in writing multithreaded programs is ensuring consistent access to data. If two threads concurrently access the same variables, one thread may see the intermediate results of another thread's operation. One approach employs locks coupled with careful programming to ensure that only one thread accesses shared data at a time. Improper use of locks can lead to deadlock or poor performance.
Transactional memory (“TM”) promises to simplify multithreaded programming. A transaction may execute a series of reads and writes to shared memory. Transactions provide mutual exclusion of threads from a resource without the program deadlocking, and without reliance on assignment of locks to data structures.
A TM approach may effectively use the threads offered by chips with multiple cores and/or multi-threaded cores. A TM system lets a programmer invoke a transaction and rely on the system to make its execution appear atomic (e.g., all or nothing) and isolated (e.g., no intermediate states are visible). A successful transaction commits, while an unsuccessful one that conflicts with a concurrent transaction aborts or stalls. Some TM systems operate completely in software as software transactional memory (“STM”) systems. Another implementation employs hardware support and comprises a hardware transactional memory (“HTM”) system.
Hardware serves to accelerate transactional memory with desirable capabilities. Hardware provides isolation with conflict detection. The hardware detects conflicts among transactions by recording the read-set (addresses read) and write-set (addresses written) of a transaction. A conflict occurs when an address appears in the write-set of two concurrent transactions or the write-set of one and the read-set of another. Hardware provides atomicity with version management. Hardware stores both the new and old values of memory written by a transaction, so that the side effects of a transaction can be reversed.
Some implementations of HTMs make demands on L1 cache structures, for example, read/write (R/W) bits for read-set and write-set tracking, flash clear operations at commits/aborts, and write buffers for speculative data. Some implementations of HTMs depend on broadcast coherence protocols that preclude implementation on directory-based systems.
An HTM referred to as LogTM decouples version management from L1 cache tags and arrays. With LogTM, a transactional thread saves the old value of a block in a per-thread log and writes the new value in place (eager version management). LogTM's version management uses cacheable virtual memory that is not tied to a processor or cache. LogTM does not force writebacks to cache speculative data, because LogTM does not exploit cache incoherence, for example, where the L1 cache holds new transactional values and the L2 holds the old versions. Instead, caches are free to replace or write back blocks at any time. No data moves on commit, because new versions are in place, but on abort a software handler walks the log to restore old versions. LogTM does not decouple conflict detection, because LogTM maintains R/W bits in the L1 cache.
An HTM referred to as Bulk decouples conflict detection by recording read-sets and write-sets in a hashed signature separate from L1 cache tags and arrays. A simple 1K-bit signature might logically OR the decoded ten least-significant bits of block addresses. On transaction commit, Bulk broadcasts the write signature and all other active transactions compare the write signature against their own read and write signatures. A non-null intersection indicates a conflict, triggering an abort. Due to aliasing, non-null signature intersection may occur even when no actual conflict exists (a false positive) but no conflicts are missed (no false negatives). Bulk's support of multi-threading and/or nested transactions through replication of signatures avoids use of L1 structures.
Bulk does not decouple version management from the L1 cache. The cache controller performs writeback of committed but modified blocks before making speculative updates. The cache controller saves speculatively modified blocks in a special buffer on cache overflow. The cache controller allows only a single thread of a multi-threaded processor to have speculative blocks in any single L1 cache set. Bulk depends on broadcast coherence for atomicity. Bulk employs global synchronization for ordering commit operations.
Application programmers reason about threads and virtual memory, while hardware implements multi-threaded cores, caches, and physical memory. Operating systems (OSes) provide programmers with a higher-level abstraction by virtualizing physical resource constraints, such as memory size and processor speed, using mechanisms such as paging and context switching. To present application programmers an abstraction of transactional memory, the OS (1) ensures that transactions execute correctly when it virtualizes the processor or memory, and (2) virtualizes the HTM's physical resource limits. In cache victimization, caches may need to evict transactional blocks when a transaction's data size exceeds cache capacity or associativity. Multi-threaded cores make this more likely and unpredictable, due to interference between threads sharing the same L1 cache.
Operating systems use thread suspension and migration to increase processing efficiency and responsiveness by suspending threads and rescheduling them on any thread context in the system. To support thread context switch and migration, the OS removes all of a thread's state from its thread context, stores it in memory, and loads it back, possibly on a different thread context on the same or a different core. For HTMs that rely on the cache for either version management or conflict detection, moving thread state is difficult because the transactional state of a thread may not be visible to the operating system. In addition, with a non-broadcast coherence protocol, coherence messages may not reach the thread at its new processor.