1. Technical Field
The present invention relates in general to instruction processing systems and in particular to a method and system for ordering instruction fetch requests. Still more particularly, the present invention relates to a method and system for implementing just-in-time delivery of instructions requested by instruction fetch request.
2. Description of the Related Art
In conventional symmetric multiprocessor (SMP) data processing systems, all of the processors are generally identical. The processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies. For example, a conventional SMP data processing system may comprise a system memory, a plurality of processing elements that each include a processor and one or more levels of cache memory and a system bus coupling the processing elements to each other and to the system memory.
Conventional SMP data processing system processors have a number of execution units. Superscalar multiprocessors typically have more than one of each execution unit. They typically have two floating point units (FPUs), two fixed point units (FXUs) and two load/store units (LSUs). The processors are designed for high frequency and their corresponding internal caches are typically very small in order to operate with the high frequency processor. In part due to their relatively small size, these internal caches sustain a large number of cache misses during requests for instruction. Instructions are thus stored in lower level (L2) caches to maximize processing speed. The processors typically send multiple instruction fetch requests simultaneously or within close proximity to each other. This is particularly true in multithreaded or superscalar processors with multiple IFUs.
Traditionally, processors execute program instructions in order. With state-of-the-art processors, out-of-order execution of instructions is often employed to maximize the utilization of execution unit resources within the processor, thereby enhancing overall processor efficiency. Further, in these state-of-the-art processors that support out-of-order execution of instructions, instructions may be dispatched out of program order, executed opportunistically within the execution units of the processor, and completed in program order. The performance enhancement resulting from out-of-order execution is maximized when implemented within a superscalar processor having multiple execution units capable of executing multiple instructions concurrently.
Instructions are typically stored according to program order in a cache line within an instruction cache (I-cache) of a processor. Furthermore, each unit of access to the I-cache is generally more than one instruction. For example, for a processor architecture that has a four-byte instruction length, each I-cache access may be 32 bytes wide, which equals to a total of eight instructions per I-cache access. Even with the simplest I-cache design, these instructions must be multiplexed into an instruction buffer having eight or less slots, before sending to the issue queue.
During fetching of instructions all eight instructions are initially read from the I-cache. The fetch address of the first instruction is then utilized to control an 8-to-1 multiplexor to gate the first four instructions into an instruction buffer with, for example, four slots. The fetch address is also utilized to select a target instruction along with the next three instructions from the eight instructions, to gate into the instruction buffer. All four instructions are gated into the instruction buffer in execution order instead of program order. With this arrangement, when the fetch address is the result of a (predicted or actual) branch instruction, the first instruction to be gated into the instruction buffer may be any one of the eight instructions. The target address of the branch instruction may point to the last instruction of the I-cache access and then not all four slots within the instruction buffer will be completely filled.
Branch processing for example, results in a delay in processing particularly when the branch is speculative and is guessed incorrectly. The branch instruction and subsequent instructions from instruction path taken utilizes the cache resources which have to be re-charged when the path is incorrectly predicted. This results in a loss of many clock cycles and leads to less efficient overall processing.
Processors today often run numerous cycles ahead of the instruction stream of the program being executed. Also, on these processors, instruction fetch requests are issued as early as possible in order to xe2x80x9chidexe2x80x9d the cache access latencies and thus allow ensuing dependent instructions to execute with minimal delay. These techniques lead to requests for instructions which may not be required immediately. Also, this often leads to bubbles in the pipeline of instructions. Finally, an L2 cache has a limited amount of wired connections for returning instructions. When an instruction is sent prior to the time it is required, it utilizes valuable wired cache line resources which may be required for more immediate or important instructions.
In the prior art instruction fetch requests may be issued out of order. Often times this results in an instruction queue occupying valuable cache line resources or register space for many cycles before it is utilized by the program. When a large number of instruction fetch requests are present, this results in loading down the critical cache and queue resources resulting in less efficient processing.
When the instruction cache is xe2x80x9cbombardedxe2x80x9d with instruction fetch requests, no ordering information is included. The instruction cache is oblivious as to which load instruction to process and in which order. In traditional processors, ordering information is typically implied based on a xe2x80x9cFirst Come First Servexe2x80x9d prioritization scheme. However, in some cases an instruction is often not required by the processor or program at the time or in the order it is requested.
Thus many hardware and software limitations exist in the current method of fetching instructions from an instruction cache. It is obvious that a more efficient means of fetching instructions from an instruction cache needs to be developed. A processor should be able to issue its fetch requests so that the instruction cache can more optimally deliver the instruction only when it is actually required, while preventing bubbles in the pipeline.
It would therefore be desirable to provide a method and system for improving the efficiency of instruction fetch request processing and subsequent fetching of instructions. It is further desirable to provide a method and system which allows for just-in-time delivery and/or time-ordered delivery of instructions during execution of an instruction set thus allowing instructions to be fetched from an instruction cache at the time when needed within the program execution stream.
It is therefore one object of the present invention to provide an improved instruction processing system.
It is another object of the present invention to provide an improved method and system for efficiently managing multiple instruction fetch requests to an instruction cache,
It is yet another object of the present invention to provide a method and system for implementing just-in-time delivery of instruction requested by instruction fetches.
The foregoing objects are achieved as is now described. A system for time-ordered issuance of instruction fetch requests (IFR) is disclosed. More specifically, the system enables just-in-time delivery of instructions requested by an IFR. The system consists of a processor, an L1 instruction cache with corresponding L1 cache controller, and an instruction processor. The instruction processor manipulates an architected time dependency field of an IFR to create a Time of Dependency (ToD) field. The ToD field holds a time dependency value which is utilized to order the IFRs in a Relative Time-Ordered Queue (RTOQ) of the L1 cache controller. The IFR is issued from RTOQ to the L1 instruction cache so that the requested instruction is fetched from the L1 instruction cache at the time specified by the ToD value. In an alternate embodiment the ToD is converted to a CoD and the instruction is fetched from a lower level cache at the CoD value.
The above as well as additional objects, features, and advantages of an illustrative embodiment will become apparent in the following detailed written description.