1. Technical Field
The present invention relates in general to data processing systems and in particular to instruction processing in multiprocessor data processing systems. Still more particularly, the present invention relates to a method and processor architecture for improving processing efficiency by enabling full, un-throttled execution of instructions beyond barrier operations.
2. Description of the Related Art
The need for faster and more efficient processing of computer instructions has typically been at the forefront of development in processors and data processing systems. Improved processing speeds led to development of processors with weakly consistent processor architectures that permit some amounts of speculation (such as branch speculation) and out-of-order execution of instructions. With out-of-order execution and speculation, the processor has to be provide with some way of ensuring that correct dependencies in processes and/or data are maintained. The processor typically assigns a series of instructions (e.g., load, store, and compare instructions) to a group when no dependencies exist between instructions within that group. Instructions within a group can be executed in parallel or out-of-order (i.e., later instructions executed before earlier instructions). However, due to possible data dependencies between groups, instructions in each group are executed in program order with respect to instructions in a next group to ensure correct processing results.
State-of-the-art superscalar processors provide a branch prediction mechanism by which branch instructions are permitted to be speculatively executed and later resolved. A superscalar processor may comprise, for example, an instruction cache for storing instructions, one or more execution units for executing sequential instructions, branch prediction and branch resolution logic for processing branch instructions, instruction sequencing logic for routing instructions to the various execution units, and registers for storing operands and result data.
When initially executed, conditional branch instructions are classified as unresolved. In order to minimize execution stalls, some processors speculatively execute unresolved branch instructions by predicting whether or not the indicated branch will be taken. Utilizing the result of the prediction, the instruction sequencing logic is then able to speculatively fetch instructions within a target execution path prior to the resolution of the branch. Presently, the more accurate branch prediction methodologies, such as branch history tables, yield correct predictions more than 92% of the time, which in terms of overall processor efficiency is widely considered to provide a significant improvement.
Typically, however, when a processor begins executing instructions within a speculatively predicted path (i.e., target or in-line path), processing of instructions within that path can only be completed up to the first barrier operation in the instruction sequence, and the processor waits until an acknowledgment is received for the barrier operation before continuing to process the instruction sequence down the branch path.
In multiprocessor systems, the correct completion of operations within code or instructions executing on a first processor may be dependent on operations on a second interconnected processor. For example, with load and store instructions executed by a load/store unit (LSU) of a first processor, a previous instruction that stores a value to a particular location must be executed before a later instruction that loads the value of that location.
Barrier instructions are placed within the instruction sequence to separate groups of instructions and ensure that all instructions within a first group are fully executed (i.e., the corresponding operations and results are visible to all other processors) before any instruction within a subsequent group is executed. The instruction set architecture (ISA) supported by most commercially available processors includes a barrier instruction, which initiates a barrier operation on the system. In the PowerPC™ family of processors, for example, one barrier instruction that is employed to establish a processing boundary is the “sync” instruction, and the corresponding transaction on the system bus is called a synchronization operation (sync op). Other barrier instructions exist within the instruction set, but synch ops will be utilized generally within the present document to refer to global barrier instructions.
Barrier instructions are particularly necessary when the multiprocessor system includes superscalar processors supporting out-of-order instruction execution and weak memory consistency. However, there are implied barrier instructions utilized within in-order processor systems.
In slower processors, which operate at, for example, 100 MHz, each barrier instruction, such as a sync op, may require approximately 10 processor cycles to complete. In commercial server workloads, the sync ops typically degrade processing efficiency by approximately 5 percent. With faster processors, however, such as those operating in the Ghz range, a sync may complete in approximately 200 processor cycles and degrades processing efficiency by approximately 10 percent. Thus, syncs place a significant burden on processor efficiency, particularly because, in typical commercial software, syncs regularly occur every 500-1000 instructions. Each occurrence of a sync causes processors in a data processing system to be throttled for a lengthy time while the issuing processor waits on the sync operation to complete.
The inherent performance limitations of throttling the processor after each occurrence of a barrier instruction becomes even more acute with newer, high speed processor architectures, which have deep execution pipelines, large instruction fetch latencies, and processes instructions with a high level of accuracy. Thus, throttling a processor from continuing along an execution path because of a barrier operation significantly limits processor efficiency.
The present invention recognizes that it would therefore be desirable to provide a method and processor architecture for enabling full processor speculation by executing all instructions beyond barrier operations to reduce processor throttling while waiting on a sync ack and thereby increase processor speed and efficiency.