Variable length instructions can complicate parallel decode and branch prediction, limiting instruction bandwidth and latency to the execution stage of a deeply pipelined processor. In a sequentially accessed memory, for example, a trace cache can enhance performance by storing micro-operation decoded (μop) sequences as well as implementing a better-informed branch prediction scheme. The first line of μops in a sequence (a “head”) is tagged with the address of the first μop, which is used to index into the trace cache, and also is used to hold a pointer to the next line of flops in the sequence (a “body”). Each body holds a pointer to the next body, and the sequence eventually ends with a last line of μops in the sequence (a “tail”) that, unlike the head and body lines, does not have a next way pointer to point to a next line of μops in the sequence. Traces can have variable lengths and a trace having a length of 1 is a special case. It is special because it contains only a single line, which is considered both a head and a tail, and the line does not have a next way pointer, that is, the next way pointer has a null value. Each line, that is, each head or body or tail, contains a fixed number of micro operations (μops), for example, eight (8) μops. High performance sequencing currently requires each head, body or tail to be in consecutive sets, and since trace sequences often contain taken branches, bodies are often “misplaced” according to their linear instruction pointer (LIP) in a set associative cache. For this reason, a trace can only be accessed through its head.
FIG. 1 is a portion of a set associative cache memory illustrating the ordering of a first trace sequence (Trace1) therein. In FIG. 1, a possible configuration of Trace1, which has a length of six (6), is shown stored in a subset of a trace cache 100. As seen in FIG. 1, trace cache 100 has a set (row), way (column) matrix structure that is read sequentially down the sets. In FIG. 1, H1 105 is used to indicate a head line of Trace1; B1 is used to indicate body lines of Trace1, for example, first body line B1 107, second body line B1 109, third body line B1 111 and fourth body line B1 113; and T1 115 is used to indicate a tail of Trace1. Next way pointer arrows 1 through 5 indicate the correct order to be followed to read the Trace1 sequence. That is, the Trace1 sequence is read by starting from H1 105 in set 1, way 1 to B1 107 in set 2, way 0 (arrow 1); from B1 107 in set 2, way 0 to B1 109 in set 3, way 1 (arrow 2); from B1 109 in set 3, way 1 to B1 111 in set 4, way 1 (arrow 3); from B1 111 in set 4, way 1 to B1 113 in set 5, way 3 (arrow 4); and B1 113 in from set 5, way 3 to T1 115 in set 6, way 3 (arrow 5).
Victim way selection in sequentially accessed memories, generally, uses a least recently used (LRU) algorithm to select a next victim way in the next set in the memory. When the LRU algorithm selects a victim way to be overwritten that holds a part of another active trace, it is called a trace “clobber.”
FIG. 2 illustrates the cache memory of FIG. 1 having a second trace sequence having partially overwritten (clobbered) the first trace sequence from FIG. 1 using an existing replacement algorithm. In FIG. 2, B1 109 of Trace1 has been clobbered by body line B8 210 of Trace8 in set 3, way 1. Although the first 2 lines of Trace1 can still be accessed through H1 105 in set 1, way 1, the last 3 lines, that is, B1 111 in set 4, way 1, B1 113 in set 5, way 3 and T1 115 in set 6, way 3, are no longer accessible by the selection algorithm, since the remaining sequence does not have a head. In FIG. 2, the LRU algorithm selected set 4 way 0 for a tail of Trace8, T8 215. Next way pointer arrows (21 and 22) indicate the way in the next set that each points to and the correct order to be followed to read the sequence of Trace8. That is, the sequence of Trace8 runs from H8 205 in set 2, way 2 to B8 210 in set 3, way 1 (arrow 21), and from B8 210 in set 3, way 1 to T1 115 in set 4, way 4 (arrow 22). Accordingly, the next time an attempt to read Trace1 occurs, a trace cache miss will result when attempting to read B1 109 in set 3, way 1 from B1 107. This will most likely decrease the trace cache utilization due to duplication of the orphaned bodies and tail of Trace1. Therefore, there is a need to decrease the number of clobbers and, thus, achieve better overall performance.