To process an instruction in a processor, the stages of processing may include fetch (to get the instruction), decode (to break down the instruction into the operation and the operands, e.g., Operand A plus Operand B), retrieve operands from the register file, execute the instruction, and writeback the result (e.g., the sum of Operand A plus Operand B). Before pipelining, a processor would execute all of the stages for one instruction before proceeding to the next instruction. To increase computing speed, pipelines were implemented in processors in order to break apart the different stages of processing an instruction. Thus, one stage of processing an instruction may be executed while another stage of processing a subsequent instruction is executed during the same clock cycle. For example, while a first instruction is decoded during a first clock cycle, a second instruction may be fetched during the first clock cycle. Then, while operands are being retrieved from the register file for the first instruction during a second clock cycle, the second instruction may be decoded and a third instruction may be fetched during the second clock cycle. The simultaneous processing of multiple instructions through pipelining may increase the computing speed of the processor.
A branch instruction may point the processor to begin computing an instruction at a different position in the program. For example, a fifth instruction being a branch instruction may jump the processor to begin processing a twentieth instruction. Using pipelining, though, the processor may begin retrieving and decoding subsequent instructions (e.g., instructions six, seven, and eight) before executing the branch instruction. Thus, if a branch is taken, the instructions in the middle of being processed are removed and the new instruction branched to is processed. Processing instructions that are not to be processed costs time and thus affects processing speed. In addition, mis-speculated instructions also waste energy.
As a result, a processor may include branch prediction logic in order for a processor to predict which instruction after a branch instruction is to be fetched (i.e., determine whether or not a branch is to be taken). The branch prediction logic reduces the number of times a processor incorrectly fetches an instruction because of a missed branch. Branch prediction logic may include a branch history table and/or a branch target cache. The branch history table stores some variation of the branch history for each branch instruction to be predicted. The branch history is a record of whether a branch is or is not taken for each execution of the branch instruction. The branch history table may be a two by n bit table, wherein each of the n rows corresponds to a different branch instruction of the program and the two bits for each of the n rows are used by the processor to predict whether or not to take the branch for the branch instruction corresponding to the row. More rows means more branch instructions may be predicted. The two bits may act as a counter where equaling 1-1 may mean to predict taking the branch, 0-0 may mean to predict not taking the branch, and 1-0 or 0-1 may mean unsure. The bits are trained by observing the branch history for each branch instruction to be predicted. If a branch is taken, then the counter is incremented (until reaching 1-1). If the branch is not taken, then the counter is decremented (until reaching 0-0)
The branch target cache stores the destination for each branch instruction to be predicted. In one embodiment, the branch target cache may store a destination for a number of branch instructions equal to the number of rows or registers of the branch target cache (e.g., one row or register per branch target). The branch target cache may store the address of the instruction pointed to by a branch for a branch instruction. One problem with including a branch history table and/or a branch target cache in a processor is that the branch history table and branch target cache are additional logic, thus increasing area of and power consumption by the processor.