High performance processors currently used in data processing systems today may be capable of "superscalar" operation and may have "pipelined" elements. A superscalar processor has multiple elements which operate in parallel to process multiple instructions in a single processing cycle. Pipelining involves processing instructions in stages, so that the pipelined stages may process a number of instructions concurrently. In a typical first stage, referred to as a "instruction fetch" stage, an instruction is fetched from memory. Then, in a "dispatch" stage, the instruction is decoded into different control bits, which designate a type of functional unit for performing the operations specified by the instruction, source operands for the operation, and destination registers for results of operations. The decoded instruction is dispatched to an issue queue where instructions wait for data and an available execution unit. Next, in the "issue" stage, an instruction in the issue queue is issued to a unit having an "execution stage." The execution stage processes the operation as specified by the instruction. Executing an operation specified by an instruction includes accepting one or more operands and producing one or more results.
A "completion" stage addresses program order issues which arise from concurrent instruction execution, wherein multiple, concurrently executed instructions may deposit results in a single register. The completion stage also handles issues arising from instructions dispatched after interrupted instruction deposits results in a same destination registers. In the completion stage, an instruction waits for the point at which there is no longer a possibility of an interrupt before storing a data value, so that a depositing of a result will not violate a program order. At this point, the instruction is considered "complete." It should be noted that buffers to store execution results before results are deposited into the destination register and buffers to back up contents at specified checkpoints are provided in the instance an interrupt reverts the register content to its pre-checkpoint value. Either types of buffers may be utilized in a particular implementation. At completion, the results of the execution buffer and the holding buffer will be deposited into the destination register and the back-up buffer will be released.
Pipelined data processing systems typically implement load instructions to load information from a storage location to a register in a pipelined fashion. Typically, such load instructions execute in one cycle and, therefore, do not adversely affect the timing associated with pipelined flow of the data processor. However, under certain conditions, execution of the load instruction will take more than one cycle and, thus, the timing of the pipeline flow will be adversely affected. One of the conditions which may require a load operation to execute in more than one cycle occurs when an address of the load instruction must be translated. Address translation requires a look-up and address compare operation using a translation lookaside buffer (TLB). When an address translation operation involving a TLB is executed by a data processing system, an extra cycle is required for control signals corresponding to the load instruction to access the TLB. If there is a corresponding real address stored in the TLB, then the address translation is considered successful. However, if there is no corresponding entry in the TLB, then a table walk operation is initiated. Such a table walk operation requires all page table entries to be fetched from a memory. The table walk operation consumes a substantial number of cycles during execution of a load operation.
Additionally, a load operation may take more than one cycle when the load instruction is made to an address which is not aligned. An address is not aligned when data for that load instruction spans across the address boundary for which the data processing system is designed. In this case, an additional cycle is required to assemble data that spans the boundary of the fetch address. Unlike a load operation which requires address translation and may take an undetermined number of cycles to execute, an unaligned load operation only requires a fixed number of additional cycles.
During operation of a typical data processor, an instruction issuing algorithm issues instructions with available operands to execution units for execution. If a load executes, but its data is not in the cache, then subsequent instructions which depend on the load instruction are marked as not having available data and are not issued to the execution unit. When data required for a load instruction is not in a cache at the time the load instruction executes, a "cache miss" results. A cache miss operation indicates that the data is not accessible from a cache memory which may be quickly accessed. When a cache miss occurs, the processor is forced to initiate a memory request for the load data. As previously mentioned, this memory request may take several cycles to complete. However, before the cache miss condition is detected, issued logic already marks the instruction which depends on the load operation as "ready" for execution. When the cache miss condition is reported, the issue logic must reset the mark made on the instruction to indicate that the instruction that depends on the load is not ready for execution. A reset operation typically takes several cycles to perform. This reset operation is utilized even when the execution of the load instruction is delayed by a cycle. That is, the issue logic does not distinguish a condition that causes a multiple cycle delay for the data (i.e., a cache miss) from a condition which causes a single cycle delay (i.e., translation buffer access or an unaligned access). By not distinguishing the two conditions, the issue logic forces all instructions that depend on the load that has delayed a cycle to incur the same penalty as a load operation delayed for more than one cycle, due to the reset operation.
Therefore, a need exists for a mechanism in a data processing system which more efficiently utilizes the resources of the data processing system when an instruction is delayed by a minimum number of cycles.