1. Field of the Invention
This invention relates to the field of microprocessors and, more particularly, to caching mechanisms within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. On the other hand, superpipelined microprocessor designs divide instruction execution into a large number of subtasks which can be performed quickly, and assign pipeline stages to each subtask. By overlapping the execution of many instructions within the pipeline, superpipelined microprocessors attempt to achieve high performance. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Superscalar microprocessors demand high memory bandwidth due to the number of instructions attempting concurrent execution and due to the increasing clock frequency (i.e. shortening clock cycle) employed by the superscalar microprocessors. Many of the instructions include memory operations to fetch (read) and update (write) memory operands. The memory operands must be fetched from or conveyed to memory, and each instruction must originally be fetched from memory as well. Similarly, superpipelined microprocessors demand high memory bandwidth because of the high clock frequency employed by these microprocessors and the attempt to begin execution of a new instruction each clock cycle. It is noted that a given microprocessor design may employ both superscalar and superpipelined techniques in an attempt to achieve the highest possible performance characteristics.
Microprocessors are often configured into computer systems which have a relatively large, relatively slow main memory. Typically, multiple dynamic random access memory (DRAM) modules comprise the main memory system. The large main memory provides storage for a large number of instructions and/or a large amount of data for use by the microprocessor, providing faster access to the instructions and/or data then may be achieved from a disk storage, for example. However, the access times of modern DRAMs are significantly longer than the clock cycle length of modern microprocessors. The memory access time for each set of bytes being transferred to the microprocessor is therefore long. Accordingly, the main memory system is not a high bandwidth system. Microprocessor performance may suffer due to a lack of available memory bandwidth.
In order to allow high bandwidth memory access (thereby increasing the instruction execution efficiency and ultimately microprocessor performance), microprocessors typically employ one or more caches to store the most recently accessed data and instructions. A relatively small number of clock cycles may be required to access data stored in a cache, as opposed to a relatively larger number of clock cycles are required to access the main memory.
Unfortunately, the number of clock cycles required for cache access is increasing in modern microprocessors. Where previously a cache latency (i.e. the time from initiating an access to the corresponding data becoming available from the cache) might have been as low as one clock cycle, cache latencies in modern microprocessors may be two or even three clock cycles. A variety of delay sources are responsible for the increased cache latency. As transistor geometries characteristic of modern semiconductor fabrication technologies have decreased, interconnect delay has begun to dominate the delay experienced by circuitry upon on integrated circuit such as a microprocessor. Such interconnect delay may be particularly troubling within a large memory array such as a cache. Additionally, semiconductor fabrication technology improvements have enabled the inclusion of increasing numbers of functional units (e.g. units configured to execute at least a subset of the instructions within the instruction set employed by the microprocessor). While the added functional units increase the number of instructions which may be executed during a given clock cycle, the added functional units accordingly increase the bandwidth demands upon the data cache. Still further, the interconnect delay between the data cache and the functional units increases with the addition of more functional units (both in terms of length of the interconnect and in the capacitive load thereon). Another concept closely related to cache latency is referred to as the "load to use" penalty. Generally, the load to use penalty characteristic of a microprocessor is the number of clock cycles which elapse between the execution of a load memory operation and execution upon the referenced memory operand.
Instructions awaiting memory operands from the data cache are stalled throughout the cache latency period. Generally, an instruction operates upon operands specified by the instruction. An operand may be stored in a register (register operand) or a memory location (memory operand). Memory operands are specified via a corresponding address, and a memory operation is performed to retrieve or store the memory operand. Overall instruction throughput may be reduced due to the increasing cache latency experienced by these memory operations. The x86 microprocessor architecture is particularly susceptible to cache latency increase, since relatively few registers are available. Accordingly, many operands in x86 instruction code sequences are memory operands. Other microprocessor architectures are detrimentally affected as well.