1. Field of the Invention
This invention relates to the field of microprocessors and, more particularly, to load/store dependency checking within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. On the other hand, superpipelined microprocessor designs divide instruction execution into a large number of subtasks which can be performed quickly, and assign pipeline stages to each subtask. By overlapping the execution of many instructions within the pipeline, superpipelined microprocessors attempt to achieve high performance. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Superscalar microprocessors demand high memory bandwidth (and more particularly low memory latency) due to the number of instructions attempting concurrent execution and due to the increasing clock frequency (i.e. shortening clock cycle) employed by the superscalar microprocessors. Many of the instructions include memory operations to fetch (read) and update (write) memory operands. The memory operands must be fetched from or conveyed to memory, and each instruction must originally be fetched from memory as well. Similarly, superpipelined microprocessors demand low memory latency because of the high clock frequency employed by these microprocessors and the attempt to begin execution of a new instruction each clock cycle. It is noted that a given microprocessor design may employ both superscalar and superpipelined techniques in an attempt to achieve the highest possible performance characteristics.
Microprocessors are often configured into computer systems which have a relatively large, relatively slow main memory. Typically, multiple dynamic random access memory (DRAM) modules comprise the main memory system. The large main memory provides storage for a large number of instructions and/or a large amount of data for use by the microprocessor, providing faster access to the instructions and/or data then may be achieved from a disk storage, for example. However, the access times of modem DRAMs are significantly longer than the clock cycle length of modem microprocessors. The memory access time for each set of bytes being transferred to the microprocessor is therefore long. Accordingly, the main memory system is not a high bandwidth, low latency system. Microprocessor performance may suffer due to a lack of available memory bandwidth and the long latency.
In order to allow high bandwidth, low latency memory access (thereby increasing the instruction execution efficiency and ultimately microprocessor performance), microprocessors typically employ one or more caches to store the most recently accessed data and instructions. A relatively small number of clock cycles may be required to access data stored in a cache, as opposed to a relatively larger number of clock cycles required to access the main memory.
Unfortunately, the number of clock cycles required for cache access is increasing in modern microprocessors. Where previously a cache latency (i.e. the time from initiating an access to the corresponding data becoming available for execution) might have been as low as one clock cycle, cache latencies in modem microprocessors may be two or even three clock cycles. A variety of delay sources are responsible for the increased cache latency. As transistor geometries characteristic of modern semiconductor fabrication technologies have decreased, interconnect delay has begun to dominate the delay experienced by circuitry upon on integrated circuit such as a microprocessor. Such interconnect delay may be particularly troubling within a large memory array such as a cache. Additionally, semiconductor fabrication technology improvements have enabled the inclusion of increasing numbers of functional units (e.g. units configured to execute at least a subset of the instructions within the instruction set employed by the microprocessor). While the added functional units increase the number of instructions which may be executed during a given clock cycle, the added functional units accordingly increase the bandwidth demands upon the data cache. Still further, the interconnect delay between the data cache and the functional units increases with the addition of more functional units (both in terms of length of the interconnect and in the capacitive load thereon).
Instructions awaiting memory operands from the data cache are stalled throughout the cache latency period. Generally, an instruction operates upon operands specified by the instruction. An operand may be stored in a register (register operand) or a memory location (memory operand). Memory operands are specified via a corresponding address, and a memory operation is performed to retrieve or store the memory operand. Overall instruction throughput may be reduced due to the increasing cache latency experienced by these memory operations. The x86 microprocessor architecture is particularly susceptible to cache latency increase, since relatively few registers are available. Accordingly, many operands in x86 instruction code sequences are memory operands. Other microprocessor architectures are deleteriously affected as well.
Generally, registers and memory (or memory locations) are different types of storage locations. Registers are private to the microprocessor (i.e. cannot be directly accessed by other devices in the computer system besides the microprocessor). Memory is generally accessible to many devices in a computer system. Registers are typically identified directly by a field in an instruction, while memory locations are identified by an address which is generated from instruction operands (e.g. by adding the instruction operands).
The size of buffers (such as reservation stations and the load/store buffer for storing memory operand requests, or memory operations) may be increased to offset the delay in fetching memory operands. With increased the buffer sizes, the microprocessor may be able to execute instructions which are subsequent to instructions stalled awaiting memory operands (i.e. the microprocessor may have an increased ability to detect parallelism in instruction code being executed). Unfortunately, increased buffer size poses problems as well. Particularly, the load/store buffer generally performs dependency checking upon memory accesses to the data cache, to ensure that a prior store memory operation within the buffer is not bypassed by a subsequent load memory operation. Not only do a large number of comparisons need to be made, but then a large amount of logic is often needed to prioritize the comparison results when multiple dependencies are detected. As the number of buffer entries increases, this logic becomes more complex and requires even more time to evaluate. A method for rapidly performing dependency checking in a load/store buffer is therefore desired.