Not Applicable.
Not Applicable.
1. Field of the Invention
The present invention describes a way to conditionally prefetch instruction or data from the memory. In particular, a method and apparatus are disclosed for improving the performance of the cache by using branch prediction information to selectively issue prefetches.
2. Description of Related Art
The current state of computer system technologies is such that processor speeds are increasing at a more rapid rate than main memory speeds. This mismatch between processor speed and main memory speed is being masked by including larger and larger random access xe2x80x9cbuffersxe2x80x9d or xe2x80x9ccachesxe2x80x9d between the processor and main memory.
Data is typically moved within the memory hierarchy of a computer system. At the top of this hierarchy is the processor and at the bottom are the I/O storage devices. The processor is connected to one or more caches (random access buffers). One type of cache is an instruction cache for supplying instructions to the processor with minimal delay. Another type of cache is a high-speed buffer for holding data that is likely to be used in the near future.
Either type of cache can be connected either to other caches or to the main memory of the memory hierarchy. When a program is executed, the processor fetches and executes instructions of the program from the main memory (or an instruction cache). This can cause the processor to request a cache entry or modify or overwrite a cache entry and portions of the main memory.
An illustrative data processing system 100 in accordance with the prior art is shown in FIG. 1. The data processing system 100 has a cache which may consist of only a single cache unit or multiple cache units. The cache may be separated into a data cache 145 and an instruction cache 110 so that both instructions and data may be simultaneously provided to the data processing system 100 with minimal delay. The data processing system 100 further includes a main memory 150 in which data and instructions are stored, a memory system interface 105 which allows the instruction cache 110 and data cache 145 to communicate with main memory 150, an instruction fetch unit 115 for retrieving instructions of an executing program. Further included in the data processing system is a decode and dispatch unit 120 for interpreting instructions retrieved by the instruction fetch unit 115 and communicating the interpreted information to one or more execution units, and a plurality of execution units including a branch unit 125, functional unit 130 and memory unit 140, for using the interpreted information to carry out the instruction. The branch unit 125 is responsible for executing program branches, that is computing modifications to the program counter as a program is executed. The generic functional unit 130 represents one or more execution units that can perform operations such as addition, subtraction, multiplication, division, shifting and floating point operations with various types of data as required. Typically, a processor will have several execution units to improve performance. In this description all branches are sent to the branch unit 125. All other instructions go to the general functional unit 130. This configuration is chosen for simplicity and to present an explicit design. Clearly, many other execution unit configurations are used with general or special purpose computing devices. Associated with each execution unit is an execution queue (not shown). The execution queue holds decoded instructions that await execution. The memory unit 140 is responsible for computing memory addresses specified by a decoded instruction. A register file 135 is also included in the data processing system 100 for temporarily holding data. Of course, other storage structures may be used instead of or in addition to the register file 135, such as those used for dealing with speculative execution and implementation of precise interrupts. A sample register file 135 is described as being illustrative of the storage structures which may be used.
When a program is executed, a program counter or sequence prediction mechanism communicates an instruction address to the instruction fetch 115. The instruction fetch 115, in turn, communicates the instruction address to the instruction cache 110. If the instruction corresponding to the instruction address is already in the instruction cache 110, the instruction cache returns the instruction to the instruction fetch 115. If not, the instruction cache 110 transmits the instruction address to the memory system interface 105. The memory system interface 105 locates the instruction address in main memory 150, and retrieves the instruction stored at that address. The instruction is then delivered to the instruction cache 110, from which it is finally returned to the instruction fetch 115. When the instruction arrives at the instruction fetch 115, it is delivered to the decode and dispatch unit 120 if there is available buffer space within the decode and dispatch unit 120 for holding the instruction. The decode and dispatch unit 120 then decodes information from the delivered instruction, and proceeds to determine if each instruction and associated decoded information can be placed in the execution queue of one of the execution units. The appropriate execution unit receives the instruction and any decoded information from the decode and dispatch unit 120, and then uses the decoded information to access data values in the register file 135 to execute the instruction. After the instruction is executed, the results are written to the register file 135.
In addition to its general function of computing memory addresses, the memory unit 140 is responsible for executing two particular kinds of instructions: load and store.
A load instruction is a request that particular data be retrieved and stored in the register file 135. The memory unit 140 executes a load instruction by sending a request to the data cache 145 for particular data. If the data is in the data cache 145 and is valid, the data cache returns the data to the memory unit. If the data is not in the data cache 145 or is invalid, the data cache 145 accesses a particular data memory address in main memory 150, as indicated by the load instruction, through the memory system interface 105. The data is returned from main memory 150 to the data cache 145, from which it is eventually returned to the memory unit 140. The memory unit 140 stores the data in the register file 135 and possibly passes it to other functional units 130 or to the branch unit 125. A store instruction is a request that data be written to a particular memory address in main memory 150. For stores, a request is sent by the memory unit 140 to the data cache 145 specifying a data memory address and particular data to write to that data memory address. If the data corresponding to the specified data memory address is located in the data cache 145 and has the appropriate access permissions, that data will be overwritten with the particular data specified by the memory unit 140. The data cache 145 then accesses the specified memory address in main memory 150, through the memory system interface 105, and writes the data to that address.
Focusing on the cache, which may be a data cache 145, an instruction cache 110, or a combined cache, the cache is repeatedly queried for the presence or absence of data during the execution of a program. Specifically, the data cache 145 is queried by the memory unit 140 regardless of whether the memory unit 140 executes a load or store instruction. Similarly, the instruction cache 110 is repeatedly queried by the instruction fetch 115 for a particular instruction.
A cache has many xe2x80x9cblocksxe2x80x9d which individually store the various instructions and data values. The blocks in a cache are divided into one or more groups of blocks called xe2x80x9ccongruence classesxe2x80x9d. For any given memory block there is a unique congruence class in the cache into which the block can be mapped, according to preset mapping functions. The number of blocks in a congruence class is called the associativity of the cache, e.g., 2-way set associative means that, for any given memory block, there are two blocks in the cache into which the memory block can be mapped; however, several different blocks in the next level of memory can be mapped to any given congruence class.
An illustrative cache line (block) includes an address-tag field, a state-bit field, an inclusivity-bit field, and a value field for storing the actual instructions or data. The state-bit field and inclusivity field are used to maintain cache coherency in a multiprocessor data processing system. The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address-tag field indicates a cache xe2x80x9chit.xe2x80x9d On the other hand, a cache xe2x80x9cmissxe2x80x9d occurs if the requested tag is absent from a cache, or if the tag is in the cache but has the wrong xe2x80x9caccess permissionsxe2x80x9d. Data may have the wrong access permissions if is being read or written by another data processor in the data processing system when requested. The collection of all the address tags in the cache and the state-bit and inclusivity fields is referred to as a directory, and the collection of all the value fields is the cache-entry array.
If a cache miss occurs, the requested data are retrieved from main memory and inserted into the cache, which may displace other cached data. The delay associated with fetching data from main memory is generally much greater than if the data were already in the cache because main memory does not have the high speed access capabilities of a cache. This delay associated with memory data access is commonly referred to as xe2x80x9caccess latencyxe2x80x9d or xe2x80x9clatencyxe2x80x9d. In all cases, caches are of finite size. Selectivity must be applied in determining which data should be cached when the cache is full. When all of the blocks in a congruence class are full and the cache receives a request to a memory location that maps into that congruence class, the cache must xe2x80x9cevictxe2x80x9d one of the blocks currently in the congruence class. The cache chooses a block to be evicted by an algorithm (For example, least recently used (LRU), random, pseudo-LRU, etc.). If the data in the chosen block are modified, those data are written to the next level of memory hierarchy which may be another cache (in the case of primary or on-board caches). By the principle of inclusion, the lower level of hierarchy will already have a block available to hold the written modified data. However, if the data in the chosen block is not modified, the block is simply overwritten. This process of removing a block from one level of hierarchy is known as xe2x80x9ccastoutxe2x80x9d. At the end of this process, the cache no longer holds a copy of the evicted block.
Since the latency to the memory or the next level of cache hierarchy is generally significantly greater than the time to access the cache, many techniques have been proposed to hide or reduce this latency. Prefetching is one such technique. Prefetching mechanisms attempt to anticipate which sections of memory will be used by a program and fetch them into the cache before the processor would normally request them. If the prefetching mechanism is successful then a line of memory is transferred into the cache far enough ahead, in time, to avoid any processing stalls due to a cache miss.
Prefetching techniques fall into two major categoriesxe2x80x94hardware-based and software-based. Software based prefetching techniques involve inserting prefetching instructions into a program. For example, the paper xe2x80x9cSoftware Prefetchxe2x80x9d by Callahan et al, in the Proceedings of the Fourth International Conference on Architectural Support For Programming Languages and Operating Systems (pp 40-52), April 1991, describes adding new instructions in the instruction set to perform prefetching. Also, the IBM RS/6000 and PowerPC processors have an instruction, the Data-Cache-Block-Touch (dcbt) instruction (commonly called a touch instruction) that prefetches a line of memory into the cache. A compiler (or programmer) can insert these prefetching instructions into the program ahead of the actual use of the data in an attempt to assure that the line of memory will be in the cache when a subsequent instruction in the program is executed. Touch instructions can be used to prefetch instructions and data. For example, a touch instruction can be inserted into a program ahead of an upcoming branch to prefetch the instructions located at the target of the branch. Similarly, a touch instruction can be placed ahead of a load instruction to prefetch the data into the cache.
Hardware-based prefetching techniques rely on predicting future memory-access patterns based on previous patterns. These techniques do not require changes to the existing programs, so there is no need for programmer or compiler intervention. For example, Chen and Bear propose an elaborate approach called, xe2x80x9cLookahead Data Prefetchingxe2x80x9d in their paper xe2x80x9cReducing Memory Latency via Non-blocking and Prefetching Cachesxe2x80x9d in the Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (pp 51-61), October 1992. This technique requires a support unit for a conventional data cache. The support unit is based on the prediction of the execution of the instruction stream and associated operand references in load/store instructions. The latter, and their referencing patterns, are kept in a reference prediction table (RPT) which is organized as a regular cache. An entry in the RPT consists of the instruction address, the address of the operand generated at the last access, and two state bits for the encoding of a finite state machine to record the access patterns and to decide whether subsequent prefetches should be activated or prevented. The RPT will be accessed ahead of the regular program counter by a look-ahead program counter (LA-PC). The LA-PC is incremented and maintained in the same fashion as the PC with the help of a dynamic branch predictor. The LA-PC/RPT combination is used to detect regular data accesses in the RPT and to generate prefetching requests. The prefetched data blocks will be put in the data cache. The supporting unit is not on the critical path. Its presence does not increase the cycle time or data access latency except for an increase in bus traffic. The key to the successful working of this technique is the distance between the program counter and the LA-PC so that the prefetched data arrives just before it is needed. Incorrect branch predictions limit the distance from growing too large.
To increase the time between prefetching the data and its subsequent use, Veidenbaum presents a method in his paper, xe2x80x9cInstruction Cache Prefetching Using Multilevel Branch Predictionxe2x80x9d (pp 51-70), in High Performance Computing, Lecture Notes in Computer Science, V.1336, 1997. In this method the target of a branch instruction is prefetched using a multilevel branch predictor (capable of predicting the branch actions of more than one branch at a time). The predictor consists of a Branch History Register (BHR), a Predictor Table (PT), and associated control logic. BHR holds the program counter, target addresses, and taken/not taken history of previous K branches plus the current branch. The taken/not taken history is stored as one bit and is shifted left on each branch with current branch information shifting in. A PT entry holds 2K target addresses and 2-bit saturating counters to enable or disable prefetching. When the current instruction is a branch, the program counter is used to select the PT entry. The counter with a maximum value among the 2K counters is identified, and the target address associated with the counter is returned as the prediction. The counter is incremented whenever the prefetched line is used in the cache and decremented if it is replaced without having been used in cache. The PT entry is updated every time K branches are filled in the BHR. This technique relies on the accuracy of the branch predictor and prefetches only the target address. It has been applied only to instruction caches.
Liu and Kaeli have proposed a similar technique for data prefetching, in their paper, xe2x80x9cBranch-Directed and Stride-Based Data Cache Prefetching,xe2x80x9d in the Proceedings of the International Conference on Computer Design (pp 225-229), 1996. In this work, the next missing data address in the target path is stored along with the target address in the Branch Target Buffer. Every time the branch instruction is predicted xe2x80x9ctakenxe2x80x9d the data address is prefetched. The method also uses stride prefetching, wherein each data address has a 2-bit counter to detect stride access patterns. The lookahead distance in this technique is only 1 branch and so the latency that can be covered is less than in the technique proposed in this patent.
There are a number of patents directed to prefetching mechanisms, with each having certain advantages and disadvantages.
For example, several patents describe prefetching data inside a program loop.
U.S. Pat. No. 5,704,053 to Santhanam describes a mechanism where prefetching instructions are added to program loops. The technique uses execution profiles from previous run of the application to determine where to insert prefetching instructions in a loop.
U.S. Pat. No. 5,843,934 to Hsu determines the memory access pattern of a program inside a loop. Prefetches are scheduled evenly over the body of a loop. This avoids clustering of prefetches, especially when a prefetch causes castout or write back due to replacing a cache line that was previously updated. Prefetches are scheduled according to the number of loop iterations and number of prefetches to be performed on each loop iteration.
U.S. Pat. No. 5,919,256 to Widigen et al. describes a mechanism where data is prefetched from an operand cache instead of referencing memory. The data values from the operand cache are then used speculatively to execute instructions. If the data values retrieved from the operand cache equal the actual operand values the speculative executions are allowed to complete. If the values are unequal, then all speculative executions are discarded.
U.S. Pat. No. 5,357,618 to Mirza determines a prefetch length for lines of stride 1, or N or a combination of stride values. Stride registers are used to calculate the program""s referencing pattern and special instructions are used to transfer values between the general purpose registers and stride registers. The compiler uses these new instructions to control prefetching within a loop.
More general prefetching techniques include: U.S. Pat. No. 5,896,517 to Wilson, which uses a background memory move (BMM) mechanism to improve the performance of a program. The BMM mechanism performs background memory move operations, between different levels of the memory hierarchy, in parallel with normal processor operations.
U.S. Pat. No. 5,838,945 to Emberson describes a prefetching mechanism where lines of variable sizes are fetched into the cache. A special instruction is used to indicate the length of the cache line that is prefetched, the cache set location to preload the prefetched data, and prefetch type (instruction or data).
U.S. Pat. No. 5,918,246 to Goodnow et al. describes a prefetch method that uses the compiler generated program map. The program map will then be used to prefetch appropriate instructions and data information into the cache. The program map contains the address location of branches and branch targets, and data locations used by the program.
U.S. Pat. No. 5,778,435 to Berenbaum et al. describes a history based prefetching mechanism where cache miss addresses are saved in a buffer. The buffer is indexed by an instruction address that was issued N cycles previously. The buffer value is then used as a prefetch address in an attempt to avoid cache misses.
U.S. Pat. No. 5,732,242 to Mowry describes a mechanism where prefetching instructions contain xe2x80x98hintxe2x80x99 bits. The hint bits indicate which prefetch operation is to be performed, i.e. the prefetch is exclusive or read only, and into which cache set the line is loaded (least recently-used or most-recently-used).
U.S. Pat. No. 5,305,389 to Palmer describes a prefetching mechanism that stores the access pattern of a program in a pattern memory. Prefetch candidates are obtained by comparing a current set of objects (accesses) to the objects saved in the pattern memory. Pattern matches need not demonstrate a complete match to the objects saved in the pattern memory to generate a prefetch candidate. Prefetches are attempted for the remaining objects of each matching pattern.
U.S. Pat. No. 5,774,685 by Dubey uses a prefetch instruction that encodes the branch path, determined by the compiler, between the prefetching instruction and the instruction that uses the data, where the branch path represents the actions (taken or not-taken) of the intervening branches. Each prefetched line is tagged with the speculative branch path information contained in the prefetch instruction. Special hardware exists to compare tagged information of a prefetched line to the actual action of the branches executed by the processor. Whenever the tagged information differs from the actual branch actions the prefetched line is discarded earlier, whereas prefetched lines that have tags that equal the actual branch actions are retained longer in the cache.
Similarly, U.S. Pat. No. 5,742,804 to Yeh et al. describes a mechanism that only prefetches instructions. Branch prediction instructions are inserted into a program ahead of an upcoming branch. Each branch prediction instruction serves as a prefetching instruction and contains a guess field predicting the direction of the upcoming branch (taken or not-taken), a prefetch address, number of bytes to prefetch and a trace vector indicating the branch path leading to the upcoming branch. The trace vector is used to cancel issued prefetches by comparing the action of the upcoming branches to the actions predicted by the trace vector. No mechanism exists to prefetch data.
In addition, U.S. Pat. No. 6,055,621 to Puzak, describes a method that conditionally executes prefetch instructions. The mechanism uses a history table that records whether a previously executed prefetch instructions fetched information that was actually used by the processor. The table is called the Touch-History-Table. Information contained in the table is used to execute only those prefetch instructions that fetched useful data and discard (not execute) prefetch instructions that fetched unused data.
It is an objective of the present invention to describe a way to use the branch history to prefetch instruction and data into the cache sufficiently ahead of time to cover latency. In this invention, we keep track of the control flow path following a branch instruction and also the cache miss(es) that occur along this path. If the same branch instruction repeats in the program""s execution, we compare the predicted path with the path last associated with each miss address associated with that branch. When paths match, we prefetch the data (if not present already). The usefulness of the prefetch is determined by using saturating counters and the prefetch can be turned off if it is found to be useless.
It is another objective of this invention to increase the time between issuing a prefetch and the subsequent use of that data. To achieve this, the branch instruction associated with a miss address can be changed dynamically after observing the xe2x80x9ctimelinessxe2x80x9d of the prefetch. Both instruction and data can be prefetched using this mechanism.