1. Field of the Invention
The present invention relates to computational circuits and, more specifically, to a computational circuit that predicts values for a pipelined architecture.
2. Description of the Prior Art
Many modern computing systems use a processor having a pipelined architecture to increase instruction throughput. In theory, pipelined processors can execute one instruction per machine cycle when a well-ordered, sequential instruction stream is being executed. This is accomplished even though the instruction itself may require a number of separate microinstructions to be executed. Pipelined processors operate by breaking up the execution of an instruction into several stages that each require one machine cycle to complete. Latency is reduced in pipelined processors by initiating the processing of a second instruction before the execution of the first instruction is completed. In fact, multiple instructions can be in various stages of processing at any given time. Thus, the overall instruction execution latency of the system (which, in general, can be thought of as the delay between the time a sequence of instructions is initiated, and the time it is finished executing) can be significantly reduced.
Further improvement can be accomplished through the use of cache memory. Cache memory is a type of memory that is typically faster than main memory in a computer. A cache is typically coupled to one or more processors and to a main memory. A cache speeds access by maintaining a copy of the information stored at selected memory addresses so that access requests to the selected memory addresses by a processor are handled by the cache. Whenever an access request is received for a memory address not stored in the cache, the cache typically retrieves the information from the memory and forwards the information to the processor.
The benefits of a cache are maximized whenever the number of access requests to cached memory addresses, known as “cache hits”, are maximized relative to the number of access requests to non-cached memory addresses, known as “cache misses”. One way to increase the hit rate for a cache is to increase the size of the cache. However, adding size to a cache memory may increase costs associated with the computer and may extend the access time associated with the cache.
As the increase in frequency continues to outpace the raw transistor performance increases in silicon technology generations the depth of microprocessor pipelines becomes ever greater, where the time of access on L1 data cache becomes 3 or 4 cycles. This long cache fetch latency has a pronounced negative effect on commercial code and integer code where address and data dependencies are common. Further, the drive to high frequency also tends to reduce the obtainable size of an L1 data cache so that only a half or a quarter size cache is implementable at higher frequencies. The microprocessor industry needs a relatively simple solution to the dependency limited execution performance of integer code and the inability to scale data cache size with frequency, causing excessively high L1 cache miss rates.
It has been found by experiment recently that integer code, and in particular commercial and operating system code, perform a majority of their load and ALU instructions where the target of these instructions is a constant or nearly constant over many execution invocations. Thus, if a method can be found to remember this value from a previous execution of the code and quickly access it as a “guess” value for along latency load or other instruction target, then significant performance improvement can be gained. Provisions must still be made for determining if the “guess” value predicted is actually incorrect, and then allowing for corrective action to fix up the pipeline to flush these incorrect speculative results and to re-execute based on the slower but non-speculative load execution. However, when the “guess” target value is correct, a significant advantage is gained in that the next instruction after a load is often dependent on the load target value and must normally stall N cycles where N+1 is the load instruction latency.
Therefore, there is a need for a system that predicts values associated with instructions that are executed in a pipeline.