1. Field of the Invention
This invention relates to the field of superscalar microprocessors and, more particularly, to data prediction mechanisms in superscalar microprocessors.
2. Description of the Relevant Art
Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term xe2x80x9cclock cyclexe2x80x9d refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions. For example, superscalar microprocessors are typically configured with multiple instruction processing pipelines which process instructions. The processing of instructions includes the actions of fetching, dispatching, decoding, executing, and writing back results. Each action may be implemented in one or more pipeline stages, and an instruction flows through each of the pipeline stages where an action or portion of an action is performed. At the end of a clock cycle, the instruction and the values resulting from performing the action of the current pipeline stage are moved to the next pipeline stage. When an instruction reaches the end of an instruction processing pipeline, it is processed and the results of executing the instruction have been recorded.
Since superscalar microprocessors execute multiple instructions per clock cycle in their multiple instruction processing pipelines and the clock cycle is short, a high bandwidth memory system is required to provide instructions and data to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions or data to be provided, then would execute the received instructions and/or instructions dependent upon the received data in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles. Superscalar microprocessors are, however, ordinarily configured into computer systems with a relatively large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.
Because superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions and data, superscalar microprocessors are often configured with caches. Caches include multiple blocks of storage locations configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction or data bytes. The bytes can be transferred from the cache to the destination (a register or an instruction processing pipeline) quickly; commonly one or two clock cycles are required as opposed to a large number of clock cycles to transfer bytes from a DRAM main memory.
Unfortunately, the latency of one or two clock cycles within a data cache is becoming a performance problem for superscalar microprocessors as they attempt to execute more instructions per clock cycle. The problem is of particular importance to superscalar microprocessors configured to execute instructions from a complex instruction set architecture such as the x86 architecture. As will be appreciated by one skilled in the art, x86 instructions allow for one of their xe2x80x9coperandsxe2x80x9d (the values that the instruction operates on) to be stored in a memory location. Instructions which have an operand stored in memory are said to have an implicit (memory read) operation and an explicit operation which is defined by the particular instruction being executed (i.e. an add instruction has an explicit operation of addition). Such an instruction therefore requires an implicit address calculation for the memory read to retrieve the operand, the implicit memory read, and the execution of the explicit operation of the instruction (for example, an addition). Typical superscalar microprocessors have required that these operations be performed by the execute stage of the instruction processing pipeline. The execute stage of the instruction processing pipeline is therefore occupied for several clock cycles when executing such an instruction. For example, a superscalar microprocessor might require one clock cycle for the address calculation, one to two clock cycles for the data cache access, and one clock cycle for the execution of the explicit operation of the instruction. Conversely, instructions with operands stored in registers configured within the microprocessor retrieve the operands before they enter the execute stage of the instruction processing pipeline, since no address calculation is needed to locate the operands. Instructions with operands stored in registers would therefore only require the one clock cycle for execution of the explicit operation. A structure allowing the retrieval of memory operands for instructions before they enter the execute stage is therefore desired.
The problems outlined above are in large part solved by a data address prediction structure for a superscalar microprocessor in accordance with the present invention. The data address prediction structure predicts a data address that a group of instructions is going to access while that group of instructions is being fetched from the instruction cache. The data bytes associated with the predicted address are placed in a relatively small, fast buffer. Decode stages of instruction processing pipelines in the microprocessor access the buffer with addresses generated from the instructions. If the associated data bytes are found in the buffer, the data bytes are conveyed to the reservation station associated with the requesting decode stage. Therefore, an implicit memory read operation associated with an instruction is performed prior to the instruction arriving in a functional unit (which forms the execute stage of the instruction processing pipeline). The functional unit is occupied by the instruction for a fewer number of clock cycles since it need not perform the implicit memory operation. The functional unit performs the explicit operation indicated by the instruction. Performance may be advantageously increased by using the clock cycles that would have been used to execute the implicit operation in the functional unit to execute other instructions.
Broadly speaking, the present invention contemplates a method for predicting a data address which will be referenced by a plurality of instructions residing in a basic block when the basic block is fetched. The method comprises several steps. First, a data prediction address is generated. The data prediction address is then fetched from a data cache into a data buffer. The data buffer is later accessed for load data.
The present invention further contemplates a data address prediction structure comprising an array for storing data prediction addresses, a data buffer for storing data bytes associated with predicted data addresses, and a decode stage of an instruction processing pipeline configured to access the data buffer. The data prediction addresses are stored into a particular storage location within the array according to the address of a branch instruction which begins a basic block containing an instruction which generates the data prediction address.