1. Field of the Invention
The invention relates in general to caches in a data processing system and more particularly to an apparatus for improving cache performance by using a branch history table to prefetch for the cache.
2. Description of the Prior Art
Since the usage of pipelining in high performance computer processors first began it has been known that a primary inhibitor to high performance was the presence of conditional branches in the code to be processed. Prior to the use of pipelining, operation of a digital computer could be most conveniently described by the following sequence of six events:
1. Fetch an instruction; PA1 2. Decode the instruction; PA1 3. Access the relevant registers and perform an appropriate calculation to determine the address of an input operand; PA1 4. Fetch the relevant input operand; PA1 5. Perform the operation as specified by the instruction using the input operand; PA1 6. Store the results of the operation.
A nonpipelined machine performs these steps sequentially, then fetches the next instruction and repeats the sequence.
In a pipelined machine, each of these steps are physically implemented as separate hardware units in parallel. All units operate concurrently on a sequence of instruction as follows: while an instruction is storing its final results, the next instruction performs a functional operation, while the next instruction fetches an input operand, while the next instruction calculates an operand address, while the next instruction is decoded, while the next instruction is fetched from memory. As long as the sequence in which the instructions are to be processed is known, it is possible to fetch instructions and then send them into the pipeline at the rate of one per cycle. Thus, instead of requiring a theoretical six cycles per instruction each instruction can be executed in what amounts to one machine cycle. This scenario, while very effective for strictly sequential instructions is less effective if the instructions are not executed in sequence. Out of sequence operation is caused by branch instructions. A branch instruction allows a computer to make decisions based on data available to it and in fact this capability is responsible for much of the power of modern day computers. The effect of a taken branch is to discontinue the decoding of instructions that are sequentially subsequent to the branch, and to begin decoding instructions from another location in memory. A conditional branch is a branch instruction that only performs this transfer of instruction execution (the branch) pending some condition (typically, the outcome of the instruction that immediately precedes the branch instruction).
Thus, branch instructions disrupt the smooth sequencing of pipeline operation, since at the time they are decoded, it is generally unknown whether they will be taken and where they will be taken to (the target). Various attempts have been made in the prior art to reduce the penalty inflicted upon a pipelined machine due to branches. The present invention uses the information provided by a branch prediction mechanism to reduce the effects of what are known as cache misses.
A cache is a local memory that is typically much smaller and much faster than the main memory of the computer. Virtually all high-performance digital computers use a cache, and even some commercially available microprocessors have local caches. The rationale for the use of the cache is partially motivated by the use of pipelining, since the splitting of a serial processor into several parallel pipeline stages causes the processing time per stage to be significantly less than the overall processing time of a single instruction. Thus, a pipelined machine necessarily has a much shorter cycle time. Therefore, a memory with a fixed access time requires more of the shorter pipeline cycles to perform an access than is required of the longer nonpipeline cycles. A side effect of pipelining a machine is to make the access time to memory to appear larger with respect to the number of processor cycles.
Caches were developed because it was not possible to build extremely large memories at a reasonable cost that could operate at a cycle time commensurate with the pipeline. It was, however, possible to build small memories that could keep up with the pipeline. In a cache, the processor stores the items, (both instructions and data) that it has referenced most recently. If the processor references the items again, then it can obtain the items from the cache at the pipeline cycle time. These references are called cache hits. When the processor references an item that is not in the cache, then the processor must obtain the item from the considerably slower main memory. A reference to the main memory is caused by what is known as a cache miss.
It has been observed that items, (either instruction or data) once referred to, can be referred to again in the near future. This property is known as the "temporal locality of reference," and it is the rational for keeping the most recently referenced items in the cache. It is also heuristically observable that if an item is referenced, then other items that are physically close to the referenced item are also likely to be referenced. This second property is known as "spatial locality of reference," and it is the rational for keeping cache lines that are blocks of contiguous items.
As previously discussed, a typical computer includes a large main memory and a pipelined processor with a local cache. Conceptually, the processor fetches instructions from its cache, decodes them, fetches operands from its cache, and executes the instructions. In practice there is usually an intermediate level of buffering between the cache and the processor that holds the instructions that are to be decoded in the immediate future. This is known as the instruction buffer.
Like the cache, (except on a much smaller scale) the instruction buffer holds contiguous blocks of instructions. However, where a cache line may be typically 64 or 128 bytes, a block in the instruction buffer is typically 8 bytes. In mainframe computers, on the average, an 8 byte block holds two instructions. Therefore, every instruction fetch initiated to the cache results in possibly two useful instructions being transferred to the instruction buffer. This alleviates some of the bandwidth requirement that is placed on the cache and this is the first essential purpose of the instruction buffer. The second essential purpose of the instruction buffer is to contain controls that allow for the blocks to be shifted such that instructions can be transferred to the decode unit of the pipeline in proper alignment.
The two main ways in which the instruction buffer differs from the cache are its small size and its absence of reuse. The size of the instruction buffer is typically in the range of 8 to 32 bytes while the size of the cache is typically 64 k to 256 k bytes. The cache must emulate the main memory in every way and all accesses to the cache are done with real addresses. On the other hand, when blocks of instructions are placed into the instruction buffer, they are only recognized on the basis of virtual addresses. Since the instruction buffer is small, and since it has no directory (to identify the associated real address) items in the buffer are used once and discarded. Thus, when a block of instructions is placed in the instruction buffer, the processor merely decodes the sequential instructions until a branch is reached. If the branch is taken, then a new block of instructions is transferred to the instruction buffer, and decoding resumes.
It should be noted that the term "prefetch" when used in the context of cache prefetching is not the same as "prefetch" when used in the context of fetching instructions into the instruction buffer.
The term prefetch, when used in the context of an instruction buffer, merely means to place instructions in the buffer prior to the time that they will be decoded (although there need not be any doubt that the prefetched instructions will be decoded). The term prefetch, when used in the context of a cache, always means to guess that a particular line in memory will be used, and to transfer that line to the cache prior to the time that it is known with certainty that the line will be referenced by the processor. Therefore, prefetching into an instruction buffer may not require any special mechanisms, i.e., this form of prefetching can be done on a strict demand basis with very little control. Prefetching into the cache, however, almost always requires some type of clever mechanism that initiates fetching based on events that may be only peripherally related to the actual demand fetches.
In a memory hierarchy, such as the one described here, various forms of fetching all take place concurrently. Each of the mechanisms uses its own address registers based on the type of fetching for which the mechanism is designed. The decoding stage in the processor fetches individual instructions from the instruction buffer using what is sometimes called the "Program Counter"(PC). The Program Counter contains the address of the instruction to be decoded.
Instructions are aligned on halfword (2 byte) boundaries, so that a halfword address represents the state of the machine, i.e., the position in the program that the machine has logically assumed The mechanism that fetches doubleword (8 byte) blocks into the instruction buffer uses what is called the "Buffer Fetch Address." This is a doubleword address that can be generated on a demand basis, or it can be generated by a branch prediction mechanism. Finally, the address used to fetch lines into the cache is generated by the miss facility, either on the basis of demand fetches or on the basis of a cache prefetch mechanism.
If there were no such things as branch instructions, there would never be any doubt as to what instruction or block of instructions or cache lines to fetch. The processor would merely fetch sequential instructions from the instruction buffer on a continuous basis, the instruction buffer prefetch mechanism would fetch sequential blocks from the cache on a continuous basis, and the cache miss facility would fetch sequential lines from the main memory on a continuous basis. Therefore, branches are the only source of uncertainty in any of the instruction fetching processes. There are three types of branch prediction mechanisms known in the art. They are described to facilitate an understanding of how each of them is different, with respect to the fetching process.
Two of the mechanisms used are variations on what is called a "decode-time" mechanism. The last mechanism is known as a "fetch-time" mechanism. A decode-time mechanism is a mechanism that predicts the outcome of a branch at the time that the branch is decoded. A fetch-time mechanism is a mechanism that predicts the outcome of a branch at the time that a block of instructions that contains the branch is transferred from the cache to the instruction buffer.
The simplest decode-time mechanism known in the art is known as a "static predictor." This mechanism merely guesses the branch outcome based on the type of branch that has been decoded (i.e., based on static information). As an example, it may guess that all Branch on Count (BCT) instructions will be taken but that no branch on index high (BXH) instructions will be taken. Thus, when the decoder decodes a BXH, the mechanism will cause the decoder to continue decoding sequential instructions, since it automatically guesses that the BXH will not pass control to a different part of the program. When the decoder decodes a BCT, the mechanism will always cause the decoder to stop decoding, since it always guesses that the control will be transferred to a different part of the program.
In the case of the BCT, the instruction first enters the pipeline, performs an address calculation to determine the address of the target instruction and then fetches the target block from the cache into the instruction buffer. The block is aligned in the buffer, and the target instruction is passed to the decoder where the decoding resumes. Note that for correctly predicted not-taken branches, the static predictor prevents the pipeline flow from being disrupted. For taken branches, the flow is disrupted so that the target address can be calculated, and the target instruction can be fetched. Decoding, however, can resume prior to the time that the branch instruction is actually executed. Therefore, in the case of correctly predicted taken branches, the static predictor does save a processor cycle.
Other more elaborate decode-time mechanisms known as "Decode History Tables" (DHT) are discussed in Losq et al., U.S. Pat. No. 4,477,872 and "Decode Branch History Table," IBM TDB Vol. 25, No. 5. These use mechanisms that affect the pipeline in exactly the same manner as a static prediction mechanism but additionally use historical information to obtain higher predictive accuracy. A Decode History Table (DHT) is a table of bits, where each bit-entry is accessed with the halfword address (the Program Counter) associated with a branch instruction at the time that the instruction is decoded. The table is used to record the historical information, that is, whether a branch was previously taken or not taken, as the branches are executed. On subsequent execution of these same branches, the table provides information about the last historical outcome and a guess is made that the branch will behave the same way that it did on its previous execution.
Although decode-time mechanisms help the pipeline to the extent that they are able to correctly predict a branch, they fail to fully capitalize on taken branches that are guessed correctly. That is, although correctly guessing a branch does save some of the pipeline delay associated with the branch, the pipeline still must wait until the target address is computed, and the target instruction fetched. A fetch-time mechanism is one that predicts the outcome of a branch at the time that the branch instruction is fetched and then immediately fetches the target instruction if it predicts that the branch will be taken. Thus, a fetch-time mechanism eliminates all pipeline delay for taken branches that it predicts correctly.
A fetch-time mechanism is generally known as a "Branch History Table" (BHT). The Branch History Table is discussed in Sussenguth, U.S. Pat. No. 3,559,183, and Guenthner et al., U.S. Pat. No. 4,594,659, where it is called a "Transfer and Indirect Prediction Table." A Branch History Table is similar to a Decode History Table in that it stores past history but differs in two significant ways.
First, since the branch outcome must be predicted at the time that the branch instruction is fetched from the cache into the instruction buffer (i.e., prior to the time that the branch instruction is decoded) it must recognize that there is a branch based on a match with the doubleword Buffer Fetch Address (and not the Program Counter as used with a Decode History Table). The second difference arises from the fact that since the Branch History Table must initiate an instruction fetch for the target instruction prior to the time that the branch instruction is decoded, it must remember the historical target as well as the historical action of the branch.
Hence, the Branch History Table is a table that contains the addresses of the target instructions to branches that were recently taken. The Branch History Table is driven by the Buffer Fetch Address as follows. When a block of instructions is transferred into the instruction buffer, the Buffer Fetch Address is used to access the Branch History Table. If the Branch History Table has an entry for a branch instruction that is contained in that block of instructions, then the target address is supplied by the entry and a transfer of that block of instructions that contains the target instruction is initiated into the instruction buffer. The Branch History Table is then searched using the address of the target instruction and the operation continues.
Therefore, in a Branch History Table driven machine, the instruction buffer, which uses the Buffer Fetch Address, is driven by the Branch History Table, while the decoder, using the Program Counter, runs independently. Note that since transfers into the instruction buffer are done on a doubleword basis at the rate of one doubleword per cycle, and since the Branch History Table must be searched on every such transfer, the Branch History Table must be organized on a doubleword basis although it must recognize the individual halfword and target addresses.
In U.S. Pat. No. 4,561,052, to Tateno, a simple processing system having a memory, a processing element, and an instruction buffer is described. The memory system is able to service exactly one request on a given cycle, i.e., it is possible to fetch exactly one instruction, or exactly one data operand on a cycle, but not both. Tateno outlines the logic that is required to determine whether a given cycle will be used to fetch an instruction or to fetch data. The logic causes an instruction to be fetched when the instruction buffer is empty or when the currently executing instruction does not require an operand fetch. Tateno uses no cache and there is no mention of branches or of their prediction.
In Potash, U.S. Pat. No. 4,435,756, a decode-time branch prediction is described. There is, however, no discussion of cache misses or Branch History Tables.
In U.S. Pat. No. 4,200,927, to Hughes et al., a description of a prefetching mechanism is made. Hughes describes three instruction buffers and uses static prediction. When a conditional branch is encountered, one buffer receives the instructions along the "fall through" path, and another buffer receives the instructions down the taken-branch path. The static predictor selects one of the two streams to preexecute. If it reaches another conditional branch along the selected path then it uses the static predictor to fetch only one of the subsequent paths into the third instruction buffer.
In Ryan et al., U.S. Pat. No. 4,551,799, a split cache is described in which it has separate caches for instructions and data. There is no teaching, however, of using a Branch History Table to aid the cache prefetching.
In IBM TDB, Volume 28, No. 8, January 1986, p. 3510, "Prefetching Using a Pageable Branch History Table" teaches use of a Branch History Table to fetch cache lines. In order to do so, it suggested a Branch History Table that carried a target address as well as a line-exiting address. When a branch entry was found in the table, the instruction buffer was controlled by the target address and the line-exiting address was used to prefetch cache lines. When a line was exited, (i.e., when the flow of control left the line), an exit analysis was performed by unspecified hardware or software and the line-exiting addresses were updated based on the exit analysis for future use. In subject invention, entry analysis is performed. That is, analysis is performed immediately before or at the time the line is entered. In the subject invention there is no need for line-exiting address in the Branch History Table since the analysis is performed on entry which is the relevant time to prefetch. Further, there is no teaching of a mechanism for performing the branch analysis as in the present invention.
IBM TDB, Volume 22, No. 12, May 1980, p. 5539, "Using a Branch History Table to Prefetch Cache Lines." This article teaches that if a Branch History Table generates a target address to a line that is not in the cache, then it is probably appropriate to prefetch the line into the cache. There is, however, no teaching of organizing the Branch History Table on a cache line basis nor of the analysis of all branch entries in a line to anticipate a cache miss.
IBM TDB, Volume 23, No. 2, July 1980, p. 853, "Reducing Cache Misses in a Branch History Table Machine," teaches making use of a cache miss to identify a misprediction on the part of the Branch History Table. The BCR (Branch on Condition Register) is primarily used to return from subroutines. Since subroutines can be called from many different locations, the primary reason that the BCR is mispredicted is that the historical target address is not the current target address. Therefore, in contra distinction to the present invention, if the Branch History Table generates a target address to a line that is not in the cache, and if the underlying branch is a BCR, then the Branch History Table is probably in error and the prefetch should be suppressed.
In IBM TDB, Volume 28, No. 4, September 1985, p. 1737, "Using a Small Cache to Hedge for a BHT" a method is proposed for minimizing the penalty associated with mispredictions on the part of the Branch History Table. It was suggested in this article that branch target instructions be fetched into the instruction buffer even when the Branch History Table guessed that a branch would not be taken. In the event that the branch guessed not taken was found to be taken, decoding could begin immediately down the taken path since the target instruction would be in the instruction buffer.
IBM TDB, Volume 22, No. 8A, January 1980, p. 3437, "Dynamic Branch Prediction Using Branch History Table" discusses the relationship between cache behavior and the Branch History Table. For caches that remember more history than a Branch History Table, a series of cache misses indicates that new or unremembered code is beginning to run, and that the predictions generated by the Branch History Table are most probably irrelevant to the new code. Therefore, after a series of cache misses, the predictions made by the Branch History Table are ignored, and both the taken and not taken branch paths were fetched to hedge against either outcome.
Accordingly, it is an object of the invention to utilize branch history to prefetch for the cache. It is still another object of the invention to analyze all branch entries in a line to anticipate a cache miss.
These and other objects, advantages and features of the invention will be more apparent upon reference to the specification and drawings.