Modern microprocessors are pipelined microprocessors. That is, they operate on several instructions at the same time, within different blocks or pipeline stages of the microprocessor. Hennessy and Patterson define pipelining as, “an implementation technique whereby multiple instructions are overlapped in execution.” Computer Architecture: A Quantitative Approach, 2nd edition, by John L. Hennessy and David A. Patterson, Morgan Kaufmann Publishers, San Francisco, Calif., 1996. They go on to provide the following excellent illustration of pipelining:                A pipeline is like an assembly line. In an automobile assembly line, there are many steps, each contributing something to the construction of the car. Each step operates in parallel with the other steps, though on a different car. In a computer pipeline, each step in the pipeline completes a part of an instruction. Like the assembly line, different steps are completing different parts of the different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe—instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line.        
Synchronous microprocessors operate according to clock cycles. Typically, an instruction passes from one stage of the microprocessor pipeline to another each clock cycle. In an automobile assembly line, if the workers in one stage of the line are left standing idle because they do not have a car to work on, then the production, or performance, of the line is diminished. Similarly, if a microprocessor stage is idle during a clock cycle because it does not have an instruction to operate on—a situation commonly referred to as a pipeline bubble—then the performance of the processor is diminished.
A potential cause of pipeline bubbles is branch instructions. When a branch instruction is encountered, the processor must determine the target address of the branch instruction and begin fetching instructions at the target address rather than the next sequential address after the branch instruction. Furthermore, if the branch instruction is a conditional branch instruction (i.e., a branch that may be taken or not taken depending upon the presence or absence of a specified condition), the processor must decide whether the branch instruction will be taken, in addition to determining the target address. Because the pipeline stages that ultimately resolve the target address and/or branch outcome (i.e., whether the branch will be taken or not taken) are typically well below the stages that fetch the instructions, bubbles may be created.
To address this problem, modern microprocessors typically employ branch prediction mechanisms to predict the target address and branch outcome early in the pipeline. An example of a branch prediction mechanism is a branch target address cache (BTAC) that predicts the branch outcome and target address in parallel with instruction fetches from an instruction cache of the microprocessor. When a microprocessor executes a branch instruction and definitively resolves that the branch is taken and its target address, the address of the branch instruction and its target address are written into the BTAC. The next time the branch instruction is fetched from the instruction cache, the branch instruction address hits in the BTAC and the BTAC supplies the branch instruction target address early in the pipeline.
An effective BTAC improves processor performance by potentially eliminating or reducing the number of bubbles that would otherwise be suffered waiting for the branch instruction to be resolved. However, when the BTAC makes an incorrect prediction, portions of the pipeline having incorrectly fetched instructions must be flushed, and the correct instructions must be fetched, which introduces bubbles into the pipeline while the flushing and fetching occurs. As microprocessor pipelines get deeper, the effectiveness of the BTAC becomes more critical to performance.
The effectiveness of the BTAC is largely a function of the hit rate of the BTAC. One factor that affects the BTAC hit rate is the number of different branch instructions for which it stores target addresses. The more branch instruction target addresses stored, the more effective the BTAC is. However, there is always limited area on a microprocessor die and therefore pressure to make the size of a given functional block, such as a BTAC, as small as possible. A factor that affects the physical size of the BTAC is the size of the storage cells that store the target addresses and related information within the BTAC. In particular, a single-ported cell is generally smaller than a multi-ported cell. A BTAC composed of single-ported cells can only be read or written, but not both, during a given clock cycle, whereas a BTAC composed of multi-ported cells can be read and written simultaneously during a given clock cycle. However, a multi-ported BTAC will be physically larger than a single-ported BTAC. This may mean, assuming a given physical size allowance for the BTAC, that the number of target addresses that can be stored in a multi-ported BTAC must be smaller than the number of target addresses that could be stored in a single-ported BTAC, thereby reducing the effectiveness of the BTAC. Thus, a single-ported BTAC is preferable in this respect.
However, the fact that a single-ported BTAC can only be read or written, but not both, during a given clock cycle may reduce the BTAC effectiveness due to false misses. A false miss occurs when a single-ported BTAC is being written, such as to update the BTAC with a new target address or to invalidate a target address, during a cycle in which the BTAC needs to be read. In this case, the BTAC must generate a miss to the read, since it cannot supply the target address, which may be present in the BTAC, because the BTAC is currently being written.
Therefore what is needed is a method and apparatus for reducing false misses in a single-ported BTAC.
Another phenomenon that can reduce the effectiveness of a BTAC is a condition in which the BTAC is storing a target address for the same branch instruction multiple times. This phenomenon can occur in a multi-way set-associative BTAC. Because BTAC space is limited, this redundant storage of target addresses reduces BTAC effectiveness because the redundant BTAC entries could be storing a target address of other branch instructions. The longer the pipeline, i.e., the greater the number of stages, the greater the likelihood that redundant target addresses will get stored in a BTAC.
The most common situation in which the same branch instruction gets cached multiple times in the BTAC is in a tight loop of code. A branch instruction is executed a first time and its target address is written into the BTAC, for example, to way 2 since way 2 is the least recently used way. However, before the target address is written into the BTAC, the branch instruction is encountered again, i.e., the BTAC looks up the instruction cache fetch address which misses since the target address has not yet been written into the BTAC. Consequently, the target address is written a second time into the BTAC. If an intervening BTAC read of a different branch instruction in the set causes way 2 to no longer be the least recently used way, then a different way, for example way 1, is selected to write the target address into the second time. Now the target address for the same branch instruction is present in the BTAC twice. This is a waste of BTAC space and reduces the effectiveness of the BTAC since it is highly likely that the second write replaced a valid target address of another branch instruction.
Therefore, what is needed is a method and apparatus for avoiding the waste of valuable BTAC space, caused by redundant caching of a target address for the same branch instruction.
Furthermore, a certain combination of conditions associated with the speculative nature of a BTAC can cause a deadlock situation in the microprocessor. The combination of BTAC speculative branch predictions, a branch instruction that wraps across an instruction cache line boundary, and the fact that processor bus transactions for speculative instruction fetches can cause error conditions, can result in deadlock in certain cases.
Therefore, what is needed is a method and apparatus for avoiding a deadlock condition in a microprocessor employing a speculative BTAC.