The present invention is directed to the efficient utilization of caches in computer architecture. Specifically, the invention is directed to a method and apparatus for efficient cache mapping of compressed cache lines containing Very Long Instruction Word (VLIW) instructions.
Very Long Instruction Word (VLIW) architectural instructions are comprised of multiple operations packed into a single very long instruction word. A VLIW processor relies on an optimizing compiler to find useful work to fill operation slots in the VLIW. To do so, the compiler uses tools such as loop unrolling, inlining, and code motion to maximize performance. This comes at a cost of increased code size. In addition, the compiler may not be able to fill all operation slots. Thus, no-ops (no operation) are used as fillers, increasing code size further. This generally results in VLIW code size being larger than other architectures. To combat this, VLIW code may be stored in compressed form in a cache line and decompressed as the cache line is loaded from memory.
Because compressed instructions vary in size, the instruction address (i.e. program counter) is not incremented by a set value. Further, the cache location is indexed by the compressed line address. That is, the lower address bits are used to map lines into the cache. This leads to sequential lines either being mapped to the same cache location or being distributed to non-sequential entries in the cache. Both of these increasing conflict misses and reducing cache utilization, reducing overall cache performance. This problem is further explained with reference to FIGS. 1 and 2. FIG. 1 illustrates an example of inefficient mapping of compressed cache lines into an instruction cache. FIG. 2 presents a simplified view of a typical instruction cache.
FIG. 1 shows a portion of a main memory 110 in which is stored compressed instruction cache lines 120 having different lengths and being stored at memory locations (addresses) 115. The figure also shows an instruction decompression unit 125 for decompressing the compressed lines retrieved from main memory 110. The instruction cache 130 is shown having stored therein the decompressed lines 140 with corresponding instruction tag entries 135. As can be seen, sequential lines are not in order and are distributed throughout the cache. Also shown are the program counter 145 and a comparator 150. FIG. 2 shows the components of a typical instruction cache implementation and corresponds to the cache mapping shown in FIG. 1. It consists of an Instruction Cache Data RAM 210 in which the decompressed cache lines are stored, an Instruction Cache Tag RAM 215 in which the instruction tag entries are stored, a Program Counter (PC) register 220, PC increment circuitry 225, Branch logic 230, and Cache Control logic (not shown). Also shown in the figure are a comparator 240 and a memory controller 235 for controlling main memory accesses.
In the typical cache implementation of FIGS. 1 and 2, the lower bits l of the Program Counter (PC) select an entry in the Instruction Cache Tag and Instruction Data RAMs. The upper bits u of the PC are compared, in the comparator, with the value retrieved from the Cache Tag. If a match, the access is deemed a xe2x80x9chit.xe2x80x9d On a hit, the Instruction Cache Line retrieved from the Cache Data RAM is passed to the processor pipeline and the PC is incremented by a set amount n. If not a match, the access is deemed a xe2x80x9cmiss.xe2x80x9d On a miss, the PC is supplied as the memory address to the Memory Controller. The Memory Controller 235 retrieves from main memory the desired cache line from the memory address supplied, and loads it and the upper bits of the PC into the selected cache data and cache tag entries, respectively. The access then proceeds as a xe2x80x9chit.xe2x80x9d
A change in the PC sequencing can be performed through a branch instruction. A branch instruction causes the PC to be updated with a Branch Target Address, supplied in the branch instruction as an absolute or PC relative (PC plus Offset) address.
Colwell et. al., in U.S. Pat. Nos. 5,057,837 and 5,179,680 addressed the issue of no-op compression for VLIW instructions. The approach packs the useful operations of an instruction into a variable length compressed instruction. Associated with this compressed instruction is a mask word. Each bit of the mask word corresponds to an operation slot in the VLIW instruction. A mask word bit set to one specifies that a useful operation held in the compressed instruction is mapped to the operation slot. A zero specifies that a no-op occupies the slot. The VLIW instruction is reconstructed during a cache miss by expanding the compressed instruction, inserting no-ops as specified by the mask word. The reconstructed VLIW instruction and the upper bits of the program counter (PC) are loaded into the cache data and tag, respectively, at the line indexed by the lower bits of the PC. The PC is equal to the address of the compressed VLIW instruction in memory. Because the compressed instructions vary in size, the PC is not incremented by a set amount for each instruction. For this reason, the implementation also stores the next PC, computed from the mask word and current PC, into the cache line.
One disadvantage of this method is that the cache location is indexed by the compressed instruction address. As discussed earlier and shown in FIG. 1, this leads to a reduction in cache performance. One solution is to either pad or not compress critical instruction sequences. In other words, give up some compression to improve cache performance.
Another proposal is to use a virtual memory style approach in which the PC is incremented by a set value, indexing the cache with its lower bits. On a cache miss, the PC indexes a translation table, accessing the address of the compressed line in memory. The compressed line is then accessed, decompressed, and loaded into the appropriate location in the cache. This achieves efficient mapping of the decompressed lines into cache at the cost of an additional translation table access.
In today""s burst memory systems, it is advantageous to minimize multiple random accesses, in favor of a multi-word bursts. A random access has latency of 5 to 15 times that of a sequential burst access. A drawback of the implementation presented above is that it requires an additional translation table access, which cannot be combined with other accesses. This could nearly double the miss penalty in certain implementations.
One proposal to avoid the added cost of table access is to use a Translation Lookaside Buffer (TLB). A TLB, in essence a cache of the translation table, works with the general assumption that the same translation (TLB entry) is used for many page accesses. In the case of compressed cache lines, a translation is associated with each cache line. Thus, a much larger TLB than usual is needed to achieve an effective TLB hit rate. Note, each cache line must be compressed independently of other cache lines. This is needed because even if the cache line does not contain a target of a branch, the line might be replaced in the cache and later reloaded. Cache lines could be compressed together in blocks; however, this will increase the miss penalty because an instruction could only be retrieved by decompressing from the beginning of the block.
An alternative is to allocate several words to each translation table entry. If the compressed cache line fits within the entry, only the translation table is accessed. Otherwise, a pointer to the remaining words of the compressed line is stored in one of the entry words. A critical design choice of this approach is the entry size. To achieve the best compression, the entry size should be as small as possible. To utilize the burst capability of the memory system, the entry size should be sufficiently large (at least 4 memory words, which is 32 bytes in a 64 bit memory system). To minimize average miss penalty, the majority of the instructions executed should fit in the entry. This may conflict with xe2x80x9ctrace schedulingxe2x80x9d, which tries to maximize Instruction Level Parallelism (ILP) for the paths most likely to be executed. The more parallelism found, the less likely it will compress well and fit in a translation table entry. It is the cache lines least likely to be executed that are most likely to compress well. Finally, to simplify the implementation, the entry size should be a power of 2. Clearly these goals are at odds with one another and may inhibit the use of the approach in certain implementations.
The inventor has developed the present invention which overcomes the drawbacks of the existing systems and avoids the reduction in cache utilization and increase in conflict misses exhibited by the system shown in FIGS. 1 and 2. Through the use of present invention, aggressive code compression can be performed without degrading instruction cache performance.
In the present invention, efficient cache mapping of compressed variable length cache lines is performed by decompressing a sequence of compressed cache lines to obtain decompressed cache lines and storing the decompressed cache lines in the same sequence in the cache memory. The present invention decouples the program counter based cache mapping from the memory address. In this way, a fixed increment cache pointer and variable size compressed cache line can be achieved, and, in doing so, decompressed lines fit nicely within the cache, in sequential order, while variable length compressed lines can be directly accessed without the use of a translation table.
The present invention includes a method of cache mapping of compressed variable length lines stored in a main memory. The method includes determining the length of a compressed line and decompressing that compressed line to obtain a decompressed line. This length is then stored, preferably in the cache memory. Furthermore, the decompressed line is stored in the cache memory. The length of the compressed line is added to a current main memory address to obtain a next main memory address. In the case of a cache miss, the main memory is accessed with said next main memory address.
More specifically described, the present invention is directed to a method and apparatus for efficient cache mapping of compressed variable length instructions. In the present invention, an entry in an instruction cache is selected based on a lower portion of a program counter. The entry comprises a tag area and a data area. The system determines whether an address stored in said tag area corresponds to that of a desired cache line. It does this by determining whether the address stored in said tag area is a match with an upper portion of the program counter.
In the case of a match, the access is deemed a xe2x80x9chitxe2x80x9d and the cache line stored in said data area of the selected entry is passed to a processor pipeline. The program counter is incremented by a set amount and the memory address is incremented by a value held in a memory address increment field in the cache line.
In the case of a mismatch the access is deemed a xe2x80x9cmiss.xe2x80x9d The system retrieves the desired line from main memory based upon a memory address stored in a memory address register separately from the said program counter. The retrieved line is decompressed in an instruction decompression unit and is loaded into the data area of the selected entry in the instruction cache. Meanwhile, the upper portion of the program counter is loaded into the tag area of that selected entry in the instruction cache. The system then proceeds as above as if it were a xe2x80x9chit.xe2x80x9d
Other features and advantages of the present invention will become apparent to those skilled in the art from the following detailed description. It should be understood, however, that the detailed description and specific examples, while indicating preferred embodiments of the present invention, are given by way of illustration and not limitation. Many changes and modifications within the scope of the present invention may be made without departing from the spirit thereof, and the invention includes all such modifications.