In high performance processors it is common practice to decompose an instruction into several steps each performed by different step-processing units. Each such unit can have the capability of accepting a specific step for successive instructions every cycle. It is common practice to thereby overlap the successive steps in executing an instruction on a cycle by cycle basis with each following instruction with a one cycle offset. Ideally, this allows one instruction to be handled each cycle even though any given instruction takes several cycles to complete.
This ideal overlap is not always possible for several reasons. A major reason is the frequent occurrence of branch instructions. These have two significant attributes: the branch may or may not be taken, introducing a temporary uncertainty as to which of two instructions is next; and if it is taken, the next instruction must be obtained from an address usually specified in the branch.
A number of patents are directed to branch prediction mechanisms, each having certain advantages and disadvantages. For example, U.S. Pat. No. 4,370,711 to Smith discloses a branch predictor for predicting in advance the result of a conditional branch instruction in a computer system. The principle upon which the system is based is that a conditional branch instruction is likely to be decided in the same way as the instructions most recent executions.
U.S. Pat. No. 4,251,864 to Kindell et al, discloses a branch predictor for manipulation of signal groups having boundaries not coinciding with boundaries of signal group storage space. When a word containing an operand boundary is transferred to the central processing unit, non-operand data is also transferred with the word. The non-operand data occurring in the boundary word is removed from the operand signal group and stored in the central processing unit. After manipulation of the operand by the central processing unit, the non-operand data is reinserted in the boundary words in the signal position previously occupied and the word group containing the manipulation or the resulting operand is stored in the memory location from which it was originally removed.
U.S. Pat. No. 3,800,291 to Cocke et al, is a branch prediction mechanism in which branch instructions may branch to the address of the information on the same or on another page. The branch instruction includes an indicator as to whether the branch address is a physical address on the same or another page, or a virtual address on another page.
U.S. Pat. No. 4,181,942 to Forster et al, discloses a program branching method and apparatus in which a special branch instruction used in a computing system serves as a conditional branch or as a non-conditional branch as determined by the state of an internal register. This special branch instruction is used for conditional branching within or at the end of a program loop and for unconditional branching outside of such a loop.
U.S. Pat. No. 3,325,785 to Stephens, sets forth a branch prediction mechanism which efficiently utilizes control storage and its access controls. A simple strategy for handling branches is to suspend overlap until the branch is fully completed: resolved as taken or not taken and if taken, the target instruction is fetched from memory. However, this strategy results in several cycles per branch which are lost from the ideal overlap. Another strategy is to make a fixed choice based on the type of branch and statistical experience as to whether the branch will be taken. When the choice indicates not taken normal overlap is continued on a conditional basis pending the actual outcome. If the choice proves wrong the conditionally initiated instructions are abandoned and the target instruction is fetched. The cycles devoted to the conditional instructions are lost as well as the cycles to fetch the target. However, the latter is often avoided by prefetching the target at the time the branch is decoded.
A more effective strategy is embodied in U.S. Pat. No. 3,559,183 to Sussenguth, which patent is assigned to the assignee of the present invention. It is based on the observation that most branches, considered individually, are consistently either taken or not taken and if taken, will have a consistent target address. In this strategy a table of taken branches is constructed. Each entry in the table consists of the address of the taken branch followed by the target address of the branch. This table is a hardware construct and so it has a predetermined size, typically from 1024 and 4096 entries. Entries are made only for taken branches as they are encountered. When the table is full making a new entry requires displacing an older entry. This can be accomplished by a Least Recently Used (LRU) basis as in caches.
In principle, each branch in the stream of instructions being executed is looked up in the table, by its address, and if it is found, its target is fetched and becomes the next instruction in the stream. If the branch is not in the table it is presumed not taken. All actions based on the table are checked as instruction execution proceeds. If the table is found to be wrong corrections are made. If the branch predicted to be taken is not taken, the table entry is deleted. If a branch predicted not taken is taken a new entry is made for it. If the predicted target address is wrong the corrected address is entered.
In practice, the foregoing is modified slightly. It is desirable to find taken branches early enough so that the target can be fetched before or at least as soon as it is needed, so that no delay will occur in the pipeline. This condition is usually not met if the table is accessed only after a branch is located and identified. Therefore, the table is usually organized and addressed on the basis of the instruction fetching packet of the machine. Currently, this packet is a double word (DW). The practical procedure is then as follows. When the machine fetches a double word into its instruction buffer, the DW address is also supplied to the table. If an entry exists the target (DW) is fetched as soon as cache priority permits. In turn, this target DW is supplied to the table, continuing the process.
The prior art described above is called a Branch History Table (BHT) and handles a great majority of branches successfully but there is a several cycle penalty when it is wrong. For practical sizes of the table (say 256 entries or approximately 2K bytes) this penalty almost offsets the gain from its use. Although a larger table (4K entries or approximately 32K bytes) would reduce the percentage of wrong predictions, hence the penalty, the problem is that the table hardware must be packaged in the speed critical instruction fetch and preparation area of the machine. It would be important to reduce rather than increase the table hardware in this area, because the more hardware that must be put in the area the longer the wiring distances and the greater the number of logic delays which must be reckoned in the critical paths determining the cycle time. These would, of course, lengthen the cycle time and a longer cycle time works constantly to decrease machine speed. Few organizational improvements, the BHT included, are good enough to offset much of an increase in cycle time which they may cause. We seek, therefore, improvements which will not place more hardware in the critical area.
According to the present invention, a Pageable Branch History Table (PBHT) is described, which does not add hardware to the critical area and in fact reduces it. A superficial analogy can be drawn to the relation between a cache memory and main memory. Let the full BHT be held in main memory (as it would be) and let the PBHT be the cache. Only the small cache (PBHT) must be in the speed critical area, the main memory (full BHT) can be elsewhere. Note, importantly, that the full BHT is no longer limited in size by hardware or cycle time considerations. It can be as large as provides a useful advantage.
However, there are two things which distinguish the PBHT from the superficial cache analogy. First, the contents of the PBHT are not based on recency of past reference, as with a cache, but on a relative certainty of future use. The PBHT control mechanism utilizes the information maintained within the larger BHT concerning branch action to fetch the relevant information about future branches on a timely basis. The processor requires fast access only to those branches which are in close (logical) proximity to the current instruction being processed. This represents only a small fraction of all the branching information contained in the full BHT. The full BHT is required because its ability to maintain information about many branches assures its high accuracy. The PBHT is managed with this information to provide a fast access to the small subset which is immediately relevant.
The second aspect of the PBHT which distinguishes it from a cache is the manner of its autonomous control. Unlike a cache which is driven only by processor activity and maintains its relevance based on pure chance, the PBHT actively manages its contents independently of the processor and assures its relevance. The PBHT maintains its own relevance by constantly fetching into itself the next branches which the processor can encounter.