To boost processor performance, contemporary general-purpose computer architectures try to exploit instruction-level parallelism (ILP), which characterizes the inherent parallelism of a program algorithm by scheduling instructions for execution out-of-order (OoO), i.e., in an order that is different from the order of instructions in the original program code, which is sequential.
As load instructions may have unpredictable latencies due to cache misses, the ability to reorder them with store instructions efficiently is of high importance as it can give significant performance benefits due to increased overlapping of execution between independent instructions. A load instruction can be reordered to execute before another load instruction that precedes it in the program order without violating any data dependencies. However, data dependencies may arise in the following scenarios: when a load is reordered to execute before a preceding store, a store is reordered to execute before a preceding load, or a store is reordered to execute before a preceding store. In such instances, the true, output, or anti-data dependencies respectively could be violated if the two instructions access the same memory location (i.e. have overlapping memory address ranges). This incorrect reordering of memory instructions (accessing the same memory location) to exploit ILP may lead to wrong execution of a program. Thus for any processor using storage elements (memory or registers) for passing data from one instruction to another, right memory access is crucial to ensure correct execution semantics.
The data dependencies described above often require that memory access instructions be reordered at the execution stage and/or the retirement stage of a processor pipeline. If instructions are reordered incorrectly for execution, the instructions may be executed using incorrect data (e.g., source operands). Similarly, as instructions are retired from the pipeline, the contents of the instructions must be committed (written into the storage elements) in the correct order so as to maintain memory consistency.
Previous approaches for ordering memory accesses can be found in general-purpose superscalar architectures, Very Long Instruction Word (VLIW) architectures, some implicitly multithreaded architectures such as multiscalar, and various research architectures that use Speculative Versioning Cache (SVC) or a variant of SVC. These approaches have significant limitations, which prevent them from being used for ordering memory accesses in multi-strand OoO processors.
In superscalar and VLIW processors, instructions are fetched in-order, and the information for correct retirement (or commit) of memory instructions is naturally provided through intentional ordering of the instructions in a single, sequential stream by the compiler.
In superscalar processors, the memory instructions are arranged based on the order of instructions in the stream by giving each instruction a dynamic sequence number. The ordering of memory instructions is usually performed in a buffer, which keeps each instruction along with the address of the associated memory access. The entries of the buffer are indexed by the sequence number. The buffer can also be split into two: one for load instructions, called load buffer (LDB) or load queue, and the other for store instructions, called store buffer (STB) or store queue. If a load instruction is to be issued, the buffer is checked to ensure that no earlier store (which has a lower sequence number) to the same address or an unresolved address is pending. If a store instruction is to be issued, the buffer is checked to ensure that no earlier load or store (which has a lower sequence number) to the same address or an unresolved address is pending.
Because superscalar and VLIW processors rely on fetching instructions strictly in-order to extract the relative order of load and store instructions from the total order of instruction in the program, it is difficult to extract the same information for a multi-strand OoO processor which fetches instructions OoO.
Multiscalar processors issue loads speculatively, with the expectation that a predecessor task won't store a value into the same memory location at a later time. A check must be made dynamically to ensure that no predecessor task writes a value into a memory location, at a future time, currently being read by a successor task. If this check identifies dependent load and store instructions that don't occur in the proper program order, the later task must be squashed and appropriate recovery action must be initiated. The squashing of a task results in the squashing of all tasks in execution following the task.
In the multiscalar processor, update of the data cache by processing elements isn't performed speculatively. To hold speculative instructions (which belong to other tasks except the head task), check violations of data dependencies and initiate recovery actions, an Address Resolution Buffer (ARB) is used. The ARB holds values of instructions which are speculatively executed, but updates the data cache only when the status of these instructions changes from speculative to non-speculative, i.e., in order of task assignment. The ARB tracks the units which executed the instructions using load and store bits. Data dependence violation is detected by checking these bits. Because the ARB in multiscalar processors only updates the data cache in order of task assignment, the size of the instruction scheduling window would be limited in a multi-strand OoO context, since it wouldn't be possible to initiate speculative execution of a task (e.g., a strand in a thread) without first initiating execution of a previous one. This results in under-utilization of ILP.
Some experimental architectures use hierarchical execution models in which an SVC (or a variant of SVC) is used instead of an ARB. These models use the SVC to order memory accesses between different processors, as the SVC functionality is based on task assignment information. Tasks are committed in assignment order and when a data misspeculation is detected, the successor tasks are squashed. In this manner, SVC guarantees program order among loads and stores from different processors. The order among memory instructions executed by an individual processor is ensured by a conventional combination of a load queue and a store queue.
A multi-strand OoO processor is a machine that processes multiple strands (and instruction pointers) in parallel so that (1) instructions of a strand in respect to instructions of other strands are fetched, issued and executed out of program order; (2) instructions from each individual strand are fetched, issued and executed in the program order in respect to each other. A strand is a sequence of instructions predominantly data dependent on each other that is arranged by binary translator (BT) at program compilation time. Instructions belonging to the same strand are executed by a multi-strand OoO processor in-order. Because the SVC (or variants thereof) commits tasks in assignment order, implementing SVC in a multi-strand OoO processor (where strand assignment order is not known and multiple strands are executing in parallel) would incur huge ILP under-utilization as a result of continuously assigning the strands in order (as is the case with multiscalar processors). Additionally, a huge overhead is incurred in connection with snoop requests between SVCs of each individual strand (which is how the SVC mechanism checks data dependency violations). There is also a strand squashing overhead associated with misspeculation (which is an essential part of any SVC based synchronization mechanism).
Accordingly, a need exists for a method that allows for correct reconstruction of real program order of memory accesses in a multi-strand OoO processor, while facilitating better utilization of ILP.