1. Field of the Invention
This invention is related to the field of processors and, more particularly, to instruction fetch mechanisms within processors.
2. Description of the Related Art
Superscalar processors attempt to achieve high performance by dispatching and executing multiple instructions per clock cycle, and by operating at the shortest possible clock cycle time consistent with the design. To the extent that a given processor is successful at dispatching and/or executing multiple instructions per clock cycle, high performance may be realized. In order to increase the average number of instructions dispatched per clock cycle, processor designers have been designing superscalar processors which employ wider issue rates. A xe2x80x9cwide issuexe2x80x9d superscalar processor is capable of dispatching (or issuing) a larger maximum number of instructions per clock cycle than a xe2x80x9cnarrow issuexe2x80x9d superscalar processor is capable of dispatching. During clock cycles in which a number of dispatchable instructions is greater than the narrow issue processor can handle, the wide issue processor may dispatch more instructions, thereby achieving a greater average number of instructions dispatched per clock cycle.
In order to support wide issue rates, it is desirable for the superscalar processor to be capable of fetching a large number of instructions per clock cycle (on the average). For brevity, a processor capable of fetching a large number of instructions per clock cycle (on the average) will be referred to herein as having a xe2x80x9chigh fetch bandwidthxe2x80x9d. If the superscalar processor is unable to achieve a high fetch bandwidth, then the processor may be unable to take advantage of the wide issue hardware due to a lack of instructions being available for issue.
Several factors may impact the ability of a particular processor to achieve a high fetch bandwidth. For example, many code sequences have a high frequency of branch instructions, which may redirect the fetching of subsequent instructions within that code sequence to a branch target address specified by the branch instruction. Accordingly, the processor may identify the branch target address upon fetching the branch instruction. Subsequently, the next instructions within the code sequence may be fetched using the branch target address. Processors attempt to minimize the impact of branch instructions on the fetch bandwidth by employing highly accurate branch prediction mechanisms and by generating the subsequent fetch address (either branch target or sequential) as rapidly as possible.
Another factor which may impact the ability of a particular processor to achieve a high fetch bandwidth is the hit rate and latency of an instruction cache employed by the processor. Processors typically include an instruction cache to reduce the latency of instruction fetches (as compared to fetching from main memory external to the processor). By providing low latency access to instructions, instruction caches may help achieve a high fetch bandwidth. Furthermore, the low latency of access to the instructions may allow branch instructions to be rapidly detected and corresponding branch target addresses to be rapidly generated for subsequent instruction fetches.
Modern processors have been attempting to achieve shorter clock cycle times in order to augment the performance gains which may be achieved with high issue rates. Unfortunately, the short clock cycle times being employed by modern processors tend to limit the size of an instruction cache which may be employed. Generally, larger instruction caches have a higher latency than smaller instruction caches. At some size, the instruction cache access time (i.e. latency from presenting a fetch address to the instruction cache and receiving the corresponding instructions therefrom) may even exceed the desired clock cycle time. On the other hand, larger instruction caches typically achieve higher hit rates than smaller instruction caches.
Both high hit rates in the instruction cache and low latency access to the instruction cache are important to achieving high fetch bandwidth. If hit rates are low, than the average latency for instruction access may increase due to the more frequent main memory accesses required to fetch the desired instructions. Because larger instruction caches are capable of storing more instructions, they are more likely to be storing the desired instructions (once the instructions have been accessed for the first time) than smaller caches (which replace the instructions stored therein with other instructions within the code sequence more frequently). On the other hand, if the latency of each cache access is increased (due to the larger size of the instruction cache), the average latency for fetching instructions increases as well. As mentioned above, low average latency is important to achieving high fetch bandwidth by allowing more instructions to be fetched per clock cycle at a desired clock cycle time and by aiding in the more rapid detection and prediction of branch instructions. Accordingly, an instruction fetch structure which can achieve both high hit rates and low latency access is desired to achieve short clock cycle times as well as high fetch bandwidth.
The problems outlined above are in large part solved by a processor in accordance with the present invention. The processor employs a first instruction cache, a second instruction cache, and a fetch unit employing a fetch/prefetch method among the first and second instruction caches designed to provide high fetch bandwidth. The fetch unit selects a fetch address based upon previously fetched instructions (e.g. the existence or lack thereof of branch instructions within the previously fetched instructions) from a variety of fetch address sources. Depending upon the source of the fetch address, the fetch address is presented to one of the first and second instruction caches for fetching the corresponding instructions. If the first cache is selected to receive the fetch address, the fetch unit may select a prefetch address for presentation to the second cache. The prefetch address is selected from a variety of prefetch address sources and is presented to the second instruction cache. Instructions prefetched in response to the prefetch address are provided to the first instruction cache for storage.
In one embodiment, the first instruction cache may be a low latency, relatively small cache while the second instruction cache may be a higher latency, relatively large cache. Fetch addresses from many of the fetch address sources may be likely to hit in the first instruction cache. For example, branch target addresses corresponding to branch instructions having small displacements may be likely to hit in the first instruction cache, which stores the most recently accessed cache lines. Also, return addresses corresponding to return instructions may be likely to hit in the first instruction cache since the corresponding call instruction may have been recently executed. Other fetch addresses may be less likely to hit in the first instruction cache. For example, branch target addresses corresponding to branch instructions having large displacements or branch target addresses formed using an indirect method may be less likely to hit in the first instruction cache. Accordingly, these fetch addresses may be immediately fetched from the second instruction cache, instead of first attempting to fetch from the first instruction cache. The latency of attempting an access in the first instruction cache may thereby be avoided.
By generating prefetch addresses for the second instruction cache when the fetch address is conveyed to the first instruction cache, the fetch unit attempts to increase the likelihood that subsequent fetch addresses hit in the first instruction cache. Hits in the first instruction cache may provide the lowest latency, and hence may operate to improve the fetch bandwidth. Furthermore, in one embodiment the first instruction cache may provide multiple cache lines in response to fetch addresses. Accordingly, a relatively larger number of instructions may be provided per fetch than if only one cache line is provided. Fetch bandwidth may thereby be further improved.
Broadly speaking, the present invention contemplates a processor comprising a first instruction cache configured to store instructions; a second instruction cache configured to-store instructions; and a fetch unit. Coupled to the first instruction cache and the second instruction cache, the fetch unit is configured to generate a fetch address responsive to previously fetched instructions. The fetch unit is configured to select one of the first instruction cache and the second instruction cache from which to fetch instructions stored at the fetch address. Additionally, the fetch unit is configured to select the one of the first instruction cache and the second instruction cache dependent upon a source of the fetch address.
The present invention further contemplates a method for fetching instructions in a processor. A fetch address is selected from a plurality of fetch address sources responsive to previously fetched instructions. One of the first instruction cache within the processor and the second instruction cache within the processor is selected to receive the fetch address dependent upon which one of the plurality of fetch address sources is selected. Instructions are fetched from the selected one of the first instruction cache and the second instruction cache.
Moreover, the present invention contemplates a computer system, comprising a processor, a memory, and an input/output (I/O) device. The processor is configured to select a fetch address from one of a plurality of fetch address sources within the processor. The processor is further configured to fetch instructions from one of a first instruction cache and a second instruction cache included within the processor dependent upon which one of the plurality of address sources from which the fetch address is selected. Coupled to the processor, the memory is configured to store instructions. The processor is configured to fetch the instructions from the memory if the instructions miss in the first instruction cache and the second instruction cache. Coupled to the processor, the I/O device is configured to communicate between the computer system and a second computer system to which the I/O device is coupled.