The speculative execution of instructions in microprocessors is beneficial in improving system performance. A state-of-the-art microprocessor typically includes an instruction cache for storing instructions, one or more execution units for executing sequential instructions, a branch unit for executing branch instructions, instruction sequencing logic for routing instructions to the various execution units, and registers for storing operands and result data.
An application program for execution on a microprocessor includes a structured series of macro instructions that are stored in sequential locations in memory. A current instruction pointer within the microprocessor points to the address of the instruction currently being executed, and a next instruction pointer within the microprocessor points to the address of the next instruction for execution. During each clock cycle, the length of the current instruction is added to the contents of the current instruction pointer to form a pointer to a next sequential instruction in memory. The pointer to the next sequential instruction is provided to logic that updates the next instruction pointer. If the logic determines that the next sequential instruction is indeed required for execution, then the next instruction pointer is updated with the pointer to the next sequential instruction in memory. Thus, macro instructions are fetched from memory in sequence for execution by the microprocessor.
Since a microprocessor is designed to execute instructions from memory in the sequence they are stored, it follows that a program configured to execute macro instructions sequentially from memory is one which will run efficiently on the microprocessor. For this reason, most application programs are designed to minimize the number of instances where macro instructions are executed out of sequence. These out-of-sequence instances are known as jumps or branches.
A program branch presents a problem because most conventional microprocessors do not simply execute one instruction at a time. Modern microprocessors typically implement a number of pipeline stages, each stage performing a specific function. Instructions, inputs, and results from one stage to the next are passed in synchronization with a pipeline clock. Hence, several instructions may be executing in different stages of the microprocessor pipeline within the same clock cycle. As a result, when logic within a given stage determines that a program branch is to occur, then previous stages of the pipeline, that is, stages that are executing instructions following in sequence, must be cast out to begin execution of sequential macro instructions beginning with the instruction directed to by the branch, or the branch target instruction. This casting out of previous pipeline stages is known as flushing and refilling the pipeline.
Branch instructions executed by the branch unit of the processor can be classified as either conditional or unconditional branch instructions. Unconditional branch instructions are branch instructions that change the flow of program execution from a sequential execution path to a specified target execution path and which do not depend upon a condition supplied by the occurrence of an event. Thus, the branch in program flow specified by an unconditional branch instruction is always taken. In contrast, conditional branch instructions are branch instructions for which the indicated branch in program flow may or may not be taken, depending upon a condition within the processor, for example, the state of a specified condition register bit or the value of a counter.
A conditional branch is a branch that may or may not occur, depending upon an evaluation of some specified condition. This evaluation is typically performed in later stages of the microprocessor pipeline. To preclude wasting many clock cycles associated with flushing and refilling the pipeline, present day microprocessors also provide logic in an early pipeline stage that predicts whether a conditional branch will occur or not. If it is predicted that a conditional branch will occur, then only those instructions prior to the early pipeline stage must be flushed, including those in the instruction buffer. Even so, this is a drastic improvement, as correctly predicted branches are executed in roughly two clock cycles. However, an incorrect prediction takes many more cycles to execute than if no branch prediction mechanism had been provided in the first place. The accuracy of branch predictions in a pipeline processor therefore significantly impacts processor performance.
Yet, present day branch prediction techniques chiefly predict the outcome of a given conditional branch instruction in an application program based upon outcomes obtained when the conditional branch instruction was previously executed within the same instance of the application program. Historical branch prediction, or dynamic branch prediction, is somewhat effective because conditional branch instructions tend to exhibit repetitive outcome patterns when executed within an application program. The historical outcome data is stored in a branch history table that is accessed using the address of a conditional branch instruction (a unique identifier for the instruction). A corresponding entry in the branch history table contains the historical outcome data associated with the conditional branch instruction. A dynamic prediction of the outcome of the conditional branch instruction is made based upon the contents of the corresponding entry in the branch history table.
However, since most microprocessors have address ranges on the order of gigabytes, it is not practical for a branch history table to be as large as the microprocessor's address range. Because of this, smaller branch history tables are provided, on the order of kilobytes, and only low order bits of a conditional branch address are used as an index into the table. This presents another problem. Because low order address bits are used to index the branch history table, two or more conditional branch instructions can index the same entry. This is known as an alias or synonym address. As such, the outcome of a more recently executed conditional branch instruction will replace the outcome of a formerly executed conditional branch instruction that is aliased to the same table entry. If the former conditional branch instruction is encountered again, its historical outcome information is unavailable to be used for a dynamic prediction.
Because dynamic predictions are sometimes not available, an alternative prediction is made for the outcome of a conditional branch instruction, usually based solely upon some static attribute of the instruction, such as the relative direction of a branch target instruction as compared to the address of the conditional branch instruction. This alternative prediction is called a static prediction because it is not based upon a changing execution environment within an application program. The static branch prediction is most often used as a fallback in lieu of a dynamic prediction. Hence, when a dynamic prediction is unavailable, the static prediction is used.
As described above, prediction techniques can cover a wide range. On one end of the spectrum are simple static prediction techniques, such as cases where overflow is usually not present or the usual case does not raise an exception. To improve predictive accuracy, advanced dynamic predictors have been developed, including, one bit predictors, bimodal predictors, gshare predictors, gskew predictors, and tournament predictors. Such advanced predictors are usually employed in conjunction with branch prediction.
Speculative execution is a performance optimization. It is only useful when speculative execution consumes less time than non-speculative execution would, and the net savings sufficiently compensates for the possible time wasted computing a value which is never used, discarding that value, and recomputing the value non-speculatively.
While predictive techniques have been successfully applied to branch prediction, other instruction types, including tagged pointer loads, have thus far not benefited from the use of such advanced predictors. There is thus a need for efficiently and accurately predicting the execution behavior of different types of instructions and exploiting such predictions to improve instruction execution performance.
A tagged architecture is a hardware implementation where each memory word is segmented into a data and “tagged” section. The data section is large enough to accommodate a memory address and the tagged section is an encoded representation of the data type. All load instructions executed by an application code must perform a tag verification operation. In prior art, this requirement diminished load instruction performance relative to a non-tagged architecture. Since load instructions may comprise up to 30% of issued instructions, if each load experiences increased latency, overall performance can be significantly diminished.
Tagged architectures can simplify hardware design and facilitate software development. With tagging, a data word could represent an indexed array descriptor, an indirect reference word, or a program control word. Any reference to a variable could automatically redirect processing, provide an index into an array, or initiate a subroutine and pick up a returned value that was left on the stack.
The virtual memory system in most modern operating systems reserves a block of logical memory around address 0x00000000 as unusable. This means that, for example, a pointer to 0x00000000 is never a valid pointer and can be used as a special null pointer value to indicate an invalid pointer.
Pointers to certain types of data will often be aligned with the size of the data (4 bytes, 8 bytes, etc.), which may leave a few bits of the pointer unused. As long as the pointer properly masks out these bits, the pointer can be tagged with extra information.
Taking advantage of the alignment of pointers provides more flexibility because it allows pointers to be tagged with information about the type of data pointed to, conditions under which it may be accessed, or other similar information about the pointer's use. This information can be provided along with every valid pointer. In contrast, null pointers and sentinels provide only a finite number of tagged values distinct from valid pointers.
The major advantage of tagged pointers is that they take up less space than a pointer along with a separate tag field. This can be especially important when a pointer is a return value from a function or part of a large table of pointers.
A more subtle advantage is that by storing a tag in the same place as the pointer, it is often possible for an operating system to significantly improve performance because the tag allows the data type to be recognized or interpreted more quickly. Furthermore, tagging pointers increases system stability and security, by avoiding data corruption by detecting when the processor atemots to use a data words which are not tagged as pointers to access memory due to a program error, or an unallowed data access attempt.
The Load Tagged Pointer (ltptr) instruction was defined for the IBM iSeries processor architecture (PowerPC AS, also known as AS/400) to improve performance when operating on tagged pointers in certain important OS/400 (iSeries operating system) environments. A tagged pointer handling apparatus is explained in detail in commonly assigned U.S. Pat. No. 4,241,396, herein incorporated by reference. In accordance with this apparatus, an ltptr instruction loads a pointer from a specified address if an associated tag indicates the memory location to hold a valid address, and an associated specifier matches the expected pointer specifier. Otherwise, if the specified storage location either does not have a tag indicating a valid pointer, or the pointer specifier is not matched, a NULL address is loaded to the target register. The LTPTR instruction advantageously eliminates a sequence of prior tag testing instructions with a single instruction. The performance objective for ltptr was to have it ultimately execute with the same load-use latency as the Load Doubleword (ld) instruction, which has proven difficult to achieve.