Throughout the instant disclosure, numerals in brackets—[ ]—are keyed to the list of numbered references towards the end of the disclosure.
Trace caches offer a means to increase instruction fetch bandwidth, leading to improved processor performance. However, they tend to suffer from high levels of instruction redundancy. Part of this is caused by the way in which traces are formed. For example, if a small loop is unrolled in a trace, this will increase instruction duplication. Ending traces on backward branches can eliminate this. However, this approach tends to severely limit the length of formed traces. Another source of redundancy is multi-path traces. This occurs when the conditional branches within a trace tend to vary in their outcome. For example, as shown in FIG. 1, a trace starting with basic block A might include the basic blocks ABCDE or ABXY, depending on the outcome of the conditional branches in the trace.
Two known prior solutions that reduce multi-path redundancy include: partial matching [1] and a block-based trace cache [2]. Partial matching [1] uses a predictor to select blocks out of the trace that will likely be used by the processor. Using the previous example, only trace ABCDE would be stored in the trace cache. Another trace, say XYZK,J may also be stored. In order to issue the instructions for the path ABXY, the trace ABCDE is issued first, and the predictor selects AB from the trace. Following this, trace XYZKJ is issued, and the predictor selects block XY from the trace. This technique can greatly reduce the redundancy in the trace cache, but it is also very complex to implement.
In the previous example, two cache accesses were required to obtain the single trace ABXY. In the worst case, each basic block of the desired trace can reside in a different cache line, resulting in several cache accesses. These accesses represent an expensive waste of power, since there are multiple accesses to the trace cache and many of the instructions fetched from the trace cache are not used. In addition to this, there can be an increase in latency if some of the traces are stored in the same bank in the trace cache. Since multiple access cycles would be required to issue the necessary traces, this can lead to an increase in access latency.
Block-based trace caches [2] store only a basic block in each bank of a ‘block cache’. To issue a trace from the block cache, a trace predictor is used to determine which line within each bank contains a desired basic block, and all the banks are accessed simultaneously to issue the basic blocks. The chain of basic blocks is then merged and issued to the processor. This technique is very efficient at reducing redundancy because all duplication within a single bank is eliminated. However, block-based trace caches also suffer from fragmentation of the different trace lines because a basic block might not fill the entire line. This reduces the length of a trace that can be issued. In addition, the “block cache” implementation as it is proposed does not alleviate the question of access latency as trace caches increase in size.
As trace caches increase in size, they will have high power-per-access requirements and increased access latencies. These problems are inherent in increasing the size of any cache.
In view of the foregoing, a need has been recognized in connection with overcoming the shortcomings and disadvantages presented by conventional organizations of trace caches.