In the field of microprocessors, the number of instructions executed per second is a primary performance measure. As is well known in the art, many factors in the design and manufacture of a microprocessor impact this measure. For example, the execution rate depends quite strongly on the clock frequency of the microprocessor. The frequency of the clock applied to a microprocessor is limited, however, by power dissipation concerns and by the switching characteristics of the transistors in the microprocessor.
The architecture of the microprocessor is also a significant factor in the execution rate of a microprocessor. For example, many modern microprocessors utilize a "pipelined" architecture to improve their execution rate if many of their instructions require multiple clock cycles for execution. According to conventional pipelining techniques, each microprocessor instruction is segmented into several stages, and separate circuitry is provided to perform each stage of the instruction. The execution rate of the microprocessor is thus increased by overlapping the execution of different stages of multiple instructions in each clock cycle. In this way, one multiple-cycle instruction may be completed in each clock cycle.
By way of further background, some microprocessor architectures are of the "superscalar" type, where multiple instructions are issued in each clock cycle for execution in parallel. Assuming no dependencies among instructions, the increase in instruction throughput is proportional to the degree of scalability.
Another known technique for improving the execution rate of a microprocessor and the system in which it is implemented is the use of a cache memory. Conventional cache memories are small high-speed memories that store program and data from memory locations which are likely to be accessed in performing later instructions, as determined by a selection algorithm. Since the cache memory can be accessed in a reduced number of clock cycles (often a single cycle) relative to main system memory, the effective execution rate of a microprocessor utilizing a cache is much improved over a non-cache system. Many cache memories are located on the same integrated circuit chip as the microprocessor itself, providing further performance improvement.
According to each of these architecture-related performance improvement techniques, certain events may occur that slow the microprocessor performance. For example, in both the pipelined and the superscalar architectures, multiple instructions may require access to the same internal circuitry at the same time, in which case one of the instructions will have to wait (i.e., "stall") until the priority instruction is serviced by the circuitry.
One type of such a conflict often occurs where one instruction requests a write to memory (including cache) at the same time that another instruction requests a read from the memory. If the instructions are serviced in a "first-come-first-served" basis, the later-arriving instruction will have to wait for the completion of a prior instruction until it is granted memory access. These and-other stalls are, of course, detrimental to microprocessor performance.
It has been discovered that, for most instruction sequences (i.e., programs), reads from memory or cache are generally more time-critical than writes to memory or cache, especially where a large number of general-purpose registers are provided in the microprocessor architecture. This is because the instructions and input data are necessary at specific times in the execution of the program in order for the program to execute in an efficient manner in contrast, since writes to memory are merely writing the result of the program execution, the actual time at which the writing occurs is not as critical since the execution of later instructions may not depend upon the result.
By way of further background, write buffers have been provided in microprocessors, such write buffers logically located between on-chip cache memory and the bus to main memory. These conventional post-cache write buffers receive data from the cache for a write-through or write-back operation; the contents of the post-cache write buffer are written to main memory under the control of the bus controller, at times when the bus becomes available.
By way of further background, pipelined microprocessors are known to be vulnerable to certain hazards commonly referred to as data dependencies. In general, data dependencies arise when two instructions at different stages in the pipeline require access to the same register or memory location, as the pipeline may access the register or memory location for the later instruction (in program order) before the earlier instruction has written data thereto, which results in erroneous operation. Techniques for detecting such data dependencies in conventional pipelined microprocessors are known in the art, as described in Patterson and Hennessy, Computer Architecture: A Quantitative Approach (Morgan Kaufmann, 1990), pp. 257-78. According to conventional techniques, detection of a data dependency or hazard is handled by stalling the pipeline until the earlier instruction (in program order) is completed, after which the later instruction can be processed. Of course, pipeline stalls result in loss of performance for the microprocessor.
It is therefore an object of the present invention to provide a microprocessor architecture which allows for storage of execution results in a write buffer prior to retiring data to cache or memory in a manner in which data dependencies may be detected.
It is a further object of the present invention to provide for special handling of data dependencies so as to avoid pipeline stalls.
It is a further object of the present invention to provide such special handling in such a manner that only the last in a series of writes to the same location is sourced to the CPU core.
It is a further object of the present invention to avoid such special handling for non-cacheable write operations.
It is a further object of the present invention to provide for allocation of write buffer locations with an indication that an otherwise apparent data dependency is in fact not a data dependency.
It is a further object of the present invention to provide such an architecture which is implemented in a superpipelined superscalar microprocessor architecture.
Other objects and advantages of the present invention will be apparent to those of ordinary skill in the art having reference to the following specification in combination with the drawings.