The present embodiments relate to microprocessor systems, and are more particularly directed to circuits, systems and methods for data prefetching based on weighted considerations from both a branch target buffer and a load target buffer.
Microprocessor technology continues to advance at a rapid pace, with consideration given to all aspects of design. Designers constantly strive to increase performance, while maximizing efficiency. With respect to performance, greater overall microprocessor speed is achieved by improving the speed of various related and unrelated microprocessor circuits and operations. For example, one area in which operational efficiency is improved is by providing parallel and out-of-order instruction execution. As another example, operational efficiency also is improved by providing faster and greater access to information, with such information including instructions and/or data. The present embodiments are primarily directed at this access capability and, more particularly, to improving access to data by way of prefetching such data.
One common technique used in modem microprocessors to improve performance involves the "pipelining" of instructions. As is well known in the art, microprocessor instructions each generally involve several sequential operations, such as instruction fetch, instruction decode, retrieval of operands from registers or memory, execution of the instruction, and writeback of the results of the instruction. Pipelining of instructions in a microprocessor refers to the staging of a sequence of instructions so that multiple instructions in the sequence are simultaneously processed at different stages in the internal sequence. For example, if a pipelined microprocessor is executing instruction n in a given microprocessor clock cycle, a four-stage pipelined microprocessor may simultaneously (i.e., in the same machine cycle) retrieve the operands for instruction n+1 (i.e., the next instruction in the sequence), decode instruction n+2, and fetch instruction n+3. Through the use of pipelining, the performance of the microprocessor can effectively execute a sequence of multiple-cycle instructions at a rate of one per clock cycle.
Through the use of both pipelining and superscalar techniques, modem microprocessors may execute multicycle machine instructions at a rate greater than one instruction per machine clock cycle, assuming that the instructions proceed in a known sequence. However, as is well known in the art, many computer programs do not continuously proceed in the sequential order of the instructions, but instead include branches (both conditional and unconditional) to program instructions that are not in the current sequence. Such operations challenge the pipelined microprocessor because an instruction in the pipeline may not necessarily reach execution. For example, a conditional branch instruction may, upon execution, cause a branch to an instruction other than the next sequential instruction currently in the pipeline. In this event, the instructions currently in the pipeline and following the branch instruction are not used. Instead, these successive instructions are "flushed" from the pipeline and the actual next instruction (i.e., the target of the branch) is fetched and processed through the pipeline (e.g., by decoding, execution, writeback and the like).
In order to minimize microprocessor performance reduction which results from non-sequential program execution, many modem microprocessors incorporate speculative execution based upon branch prediction. Branch prediction predicts, on a statistical basis, the results of each conditional branch (i.e., whether the branch will be "taken" or "not-taken"), and the microprocessor continues fetching instructions and operating the pipeline based on the prediction. For example, if a branch instruction is predicted not taken, then the next instruction fetched into the pipeline is simply the next sequential instruction following the branch instruction. On the other hand, if a branch instruction is predicted taken, then the next instruction fetched into the pipeline is the target instruction (i.e., the instruction to which the branch goes if taken). The instructions fetched based upon such a prediction proceed along the pipeline until the actual result of the condition is determined (typically upon execution of the branch instruction). If the prediction is correct, the speculative execution of the predicted instructions maintains the microprocessor at its highest performance level through full utilization of the pipeline. In the event that the prediction is incorrect, the pipeline is flushed to remove all instructions following the branch instruction in the pipeline.
By way of further background, conventional speculative execution techniques include the use of branch target buffers (BTBs). Conventional BTBs are cache-like buffers commonly used in the fetch units of microprocessors. The BTB commonly stores at least three items: (1) an identifier of a previously performed branch instruction as a tag; (2) the predicted target address for the branch (i.e., the address to which the branch points in its predicted taken state); and (3) an indication relating to the branch's actual history, that is, whether or not the branch was taken in past occurrences of the branch. The indication relating to the branch's actual history either directly indicates a prediction, or is used to derive a prediction, of whether the branch is taken. Once a BTB entry is written to include this information for a given branch, subsequent fetches of the same branch are handled using this very information. Specifically, if the branch is predicted taken (based on the branch history), the target address is used as the next address to fetch in the pipeline. The history section of the BTB entry is also updated upon execution of the branch instruction. Specifically, the execution unit determines the actual target address for the branch instruction to determine whether or not the branch is taken. This information updates the history in the BTB entry and, therefore, affects the future prediction for that entry. Note also that the actual target address from the execution unit is also compared to the predicted address; if the two do not match, a misprediction has occurred and the instruction unit is so informed so that the pipeline may be flushed and begin fetching new instructions beginning at the actual address.
Another very common approach in modem computer systems directed at improving access time to information is to include one or more levels of cache memory within the system. For example, a cache memory may be formed directly on a microprocessor, and/or a microprocessor may have access to an external cache memory. Typically, the lowest level cache (i.e., the first to be accessed) is smaller and faster than the cache or caches above it in the hierarchy, and the number of caches in a given memory hierarchy may vary. In any event, when utilizing the cache hierarchy, when an information address is issued, the address is typically directed to the lowest level cache to see if that cache stores information corresponding to that address, that is, whether there is a "hit" in that cache. If a hit occurs, then the addressed information is retrieved from the cache without having to access a memory higher in the memory hierarchy, where that higher ordered memory is likely slower to access than the hit cache memory. On the other hand, if a cache hit does not occur, then it is said that a cache miss occurs. In response, the next higher ordered memory structure is then presented with the address at issue. If this next higher ordered memory structure is another cache, then once again a hit or miss may occur. If misses occur at each cache, then eventually the process reaches the highest ordered memory structure in the system, at which point the addressed information may be retrieved from that memory.
Still another prior art technique for increasing speed involves the prefetching of information in combination with cache systems. Prefetching involves a speculative retrieval, or preparation to retrieve, information, where the information is retrieved from a higher level memory system, such as an external memory, into a cache under the expectation that the retrieved information may be needed by the microprocessor for an anticipated event at some point after the next successive clock cycle. In this regard, the instance of a load is perhaps more often thought of in connection with retrieval, but note that prefetching may also concern a data store as well. More specifically, a load occurs where a specific data is retrieved so that the retrieved data may be used by the microprocessor. However, a store operation often first retrieves a group of data, where a part of that group will be overwritten. Still further, some store operations, such as a store interrogate, do not actually retrieve data, but prepare some resource external from the microprocessor for an upcoming event which will store information to that resource. Each of these cases, for purposes of this Background and the present embodiments to follow, should be considered a type of prefetch. In any event, in the case of prefetching where data is speculatively retrieved into an on-chip cache, if the anticipated event giving rise to the prefetch actually occurs, the prefetched information is already available in the cache and, therefore, may be fetched from the cache without having to seek it from a higher ordered memory system. In other words, prefetching lowers the risk of a cache miss once an actual fetch is necessary.
Given the above techniques, the present inventors provide within a microprocessor still additional circuits, systems, and methods, where various embodiments further advance operational efficiency in view of the above considerations as well as others which may be ascertained by one skilled in the art.