This invention relates to the utilization of caches in computer systems.
Traditional processor designs make use of various cache structures to store local copies of instruction(s) and data in order to avoid the lengthy access times of typical DRAM memory. In a typical cache hierarchy, caches closer to the processor (level one or L1) tend to be smaller and very fast, while caches closer to the DRAM (level two or L2; level three or L3) tend to be significantly larger but also slower (longer access time). The larger caches tend to handle both instruction(s) and data, while quite often a processor system will include separate data cache and instruction(s) cache at the L1 level (i.e. closest to the processor core).
All of these caches typically have similar organization, with the main difference being in specific dimensions (e.g. cache line size, number of ways per congruence class, number of congruence classes). In the case of an L1 Instruction(s) cache, the cache is accessed either when code execution reaches the end of the previously fetched cache line or when a taken (or at least predicted taken) branch is encountered within the previously fetched cache line. In either case, a next instruction(s) address is presented to the cache. In typical operation, a congruence class is selected via an abbreviated address (ignoring high-order bits), and a specific way within the congruence class is selected by matching the address to the contents of an address field within the tag of each way within the congruence class. Addresses used for indexing and for matching tags can use either effective or real addresses depending on system issues beyond the scope of this discussion. Typically, low order address bits (e.g. selecting specific byte or word within a cache line) are ignored for both indexing into the tag array and for comparing tag contents. This is because for conventional caches, all such bytes/words will be stored in the same cache line.
Recently, Instruction(s) Caches that store traces of instruction(s) execution have been used, most notably with the Intel Pentium 4. These “Trace Caches” typically combine blocks of instruction(s) from different address regions (i.e. that would have required multiple conventional cache lines). The objective of a trace cache is to handle branching more efficiently, at least when the branching is well predicted. The instruction(s) at a taken branch target address is simply the next instruction(s) in the trace line, allowing the processor to execute code with high branch density just as efficiently as it executes long blocks of code without branches. This type of trace cache works very well as long as branches within each trace execute as predicted. At the start of operation, however, there is no branch history from which to make predictions.
Even after a large number of cycles some branches may not have executed enough times to allow a reliable prediction, leading to formation of trace lines that frequently mispredict program execution. To avoid polluting the cache with such poorly predicted trace lines, the cache can begin execution forming conventional cache lines. Once significant branch history has been accumulated, trace lines can be formed and allowed to replace the conventional lines in the cache. While the conventional cache line mode can be run for a pre-chosen number of cycles, this may cause some well-predicted trace lines to be thrown away during those cycles, and some poorly-predicted trace lines to be used in the time after those cycles. What is needed is an effective mechanism to determine when enough branch history has been accumulated to switch to trace formation mode and achieve better performance than with conventional cache lines.
One limitation of trace caches is that branch prediction must be reasonably accurate before constructing traces to be stored in a trace cache. Switching to trace cache mode before such time will lead to frequent branch mispredicts. This can result in repeated early exits from a trace line when, for example a branch positioned early in a trace was predicted not taken when the trace was constructed, but is now consistently taken. Any instruction(s) beyond this branch are never executed, essentially becoming unused overhead that reduces the effective utilization of the cache. Since the branch causing the early exit is unanticipated, significant latency is encountered (branch misprediction penalty) to fetch instruction(s) at the branch target.