The present disclosure generally relates to data processing systems, and more specifically, to a multi-level history buffer design for managing speculative transactions (e.g., transaction memory) in a processing unit.
In speculative parallelization systems, also known as thread-level speculation (TLS) or multi-scalar systems, a compiler, runtime system, or programmer may divide the execution of a program among multiple threads, i.e. separately managed sequences of instructions that may execute in parallel with other sequences of instructions (or “threads”), with the expectation that those threads will usually be independent, meaning that no thread will write data that other threads are reading or writing concurrently. Due to the difficulty in statically determining the memory locations that will be accessed by threads at compilation time, this expectation is not always met. The parallel threads may actually make conflicting data accesses. Such parallelization systems use speculative execution to attempt to execute such threads in parallel. It is the responsibility of the system to detect when two speculative threads make conflicting data accesses, and recover from such a mis-speculation.
Each parallel thread corresponds to a segment of the original sequential code, and the parallel threads are therefore ordered with respect to one another according to their sequence in the sequential version of code. It is the responsibility of the system to ensure that the results of a speculative thread are not committed until all prior speculative threads in this sequence are known to be free of conflicts with the committing thread. Once it has been determined that the thread does not conflict with any threads in the prior sequence, and prior threads have committed, that thread may commit.
Systems that support transactional memory typically include a subset of the requirements of a system that supports speculative parallelization. Transactional memory attempts to simplify concurrent or parallel programming by allowing a group of load and store instructions to execute in an atomic manner, i.e. it is guaranteed that either (1) all instructions of the transaction complete successfully or (2) no effects of the instructions of the transactions occur, i.e. the transaction is aborted and any changes made by the execution of the instructions in the transaction are rolled-back. In this way, with atomic transactions, the instructions of the transaction appear to occur all at once in a single instant between invocation and results being generated.
Hardware transactional memory systems may have modifications to the processors, caches, and bus protocols to support transactions or transaction blocks, i.e. groups of instructions that are to be executed atomically as one unit. Software transactional memory provides transactional memory semantics in a software runtime library with minimal hardware support. Transactional memory systems seek high performance by speculatively executing transactions concurrently and only committing transactions that are non-conflicting. A conflict occurs when two or more concurrent transactions access the same piece of data, e.g. a word, block, object, etc., and at least one access is a write. Transactional memory systems may resolve some conflicts by stalling or aborting one or more transactions. Transactional blocks are typically demarcated in a program with special transaction begin and end annotations. Transactional blocks may be uniquely identified by a static identifier, e.g., the address of the first instruction in the transactional block. Dynamically, multiple threads can concurrently enter a transactional block, although that transactional block will still share the same static identifier.
High performance processors currently used in data processing systems today may be capable of “superscalar” operation and may have “pipelined” elements. Such processors may include multiple execution/processing slices that are able to operate in parallel to process multiple instructions in a single processing cycle. Each execution slice may include a register file and history buffer that includes the youngest and oldies copies, respectively, of architected register data. Each instruction that is fetched may be tagged by a multi-bit instruction tag. Once the instructions are fetched and tagged, the instructions may be executed (e.g., by an execution unit) to generate results, which are also tagged. A Results (or Writeback) Bus, one per execution slice, feeds all slices with the resultant instruction finish data. Thus, any individual history buffer generally includes one write port per Results/Writeback bus.
However, including numerous write ports on a history buffer can be expensive to implement in the circuit. For example, as the number of ports associated with the history buffer increases, the circuit area of the history buffer in the processing unit can grow rapidly. This, in turn, creates a compromise on the number of history buffer entries that can be supported in a given circuit area. For example, smaller history buffers generally fill up faster and can impact performance, stalling the dispatch of new instructions until older instructions are retired and free up history buffer entries. On the other hand, larger history buffers are generally expensive to implement and lead to larger circuit size. Further, the size of the history buffer can also be affected by transactional memory operations in the processing unit.