Briefly, the present invention relates generally to the field of cache structures, and more particularly, to cache structures for use with variable length instructions that may cross cache line boundaries.
Variable length instructions occur not only for CISC processors, but also in very long instruction word (VLIW) architectures with NOP (non-operational instruction) compression. In particular, it is noted that VLIW bundles of instructions must explicitly schedule NOP operations in unused issue slots. In order to reduce code size and better utilize the instruction cache, these NOP""s are compressed out. This operation results in variable length instruction bundles. With variable length bundles, some of the bundles may cross cache line boundaries. The processing necessary to handle bundles that cross cache line boundaries adversely affects die area and cycle time.
A straightforward approach to processing instruction bundles that cross cache line boundaries is to cache the bundles in an uncompressed form. This caching can be implemented by uncompressing the bundles as they are loaded into the cache on a miss. See Lowney, P. Freudenberger, S. Karzes, T. Lichtenstein, W. Nix, R. O""Donell, J., Ruttenberg, J., xe2x80x9cThe Multiflow Trace Scheduling Compilerxe2x80x9d, Journal of Supercomputing, January 1993, pages 51-142; Wolfe, A. and Chanin, A., xe2x80x9cExecuting Compressed Programs on An Embedded RISC Architecturexe2x80x9d, International Symposium on Microarchitecture, December 1992, pages 81-91. Because the uncompressed bundles will be fixed in size, the cache line size can be chosen such that the bundles will never straddle cache line boundaries. This, however, results in reduced cache performance due to NOP instructions occupying cache slots. A second issue with this approach is the mapping of the uncompressed bundles into the cache. Because uncompressed and compressed bundles have different sizes, a mechanism is needed to translate between the PC (program counter) and the main memory addresses.
A second approach to the problem of bundles crossing cache line boundaries is to restrict the bundles to a limited set of sizes. See Beck, G., Yen, D., Anderson, T., xe2x80x9cThe Cydra 5 Mini Supercomputer: Architecture and Implementation,xe2x80x9d Journal of Supercomputing, January 1993, pages 143-180; Rau, B., Yen, D., Yen, W., and Towle, R., xe2x80x9cThe Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions and Trade-offsxe2x80x9d, Computer, January 1989, pages 12-35. Bundles that fall between the allowed sizes are NOP padded to the next size up. Bundles within a cache line are restricted to be either all the same size or a limited combination of sizes. The advantage of this approach is that it is relatively simple. The disadvantage of the approach is that it limits the amount of NOP compression, both in main memory and in the cache. This results in both a larger code size and a reduction of cache performance.
Another common approach to the problem is to not allow variable length bundles to cross cache line boundaries. Bundles that cross cache line boundaries are either moved to the next line with NOP padding (see Conte, T., Banerjia, S., Larin, S., Menezes, K., and Sathaye, S., xe2x80x9cInstruction Fetch Mechanisms for VLIW Architecture with Compressed Encodings,xe2x80x9d Symposium on Microarchitecture, December 1996, pages 201-211). This design results in a reduction in code compression in both memory and cache.
A fourth approach to the crossing of cache line boundaries is to use a banked cache structure. Typically, banked caches are implemented by splitting the cache into two physical pieces, one for the even cache lines and the other for the odd cache lines. See Banerjia, S., Menezes, K., and Conte, T., xe2x80x9cNext P.C. computation for a Banked Instruction Cache for a VLIW Architecture with a Compressed Encodingxe2x80x9d, Technical Report, Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, N.C. 27695-7911, June 1996; and the Conte et al. reference noted earlier. By using this approach, adjacent cache lines can always be accessed simultaneously. A banked instruction cache is implemented in the AMD K5 processor. See Christie, D., xe2x80x9cDeveloping the AMD-K 5 Architecturexe2x80x9d, IEEE Micro, April 1996, pages 16-26. The disadvantage to this banked approach is that data is distributed between the two banks on a cache line basis. Since cache lines are long and bundles can start at any location within either bank""s line, bus routing and multiplexing are costly. For example, a bank cache structure with 32 syllable lines has 64 possible bundle starting positions (32 in each bank). A four issue, 32-bit per syllable machine with this line length would require a 128-bit 64-to-1 multiplexer to index the desired bundle. This is costly both in terms of die area and cycle time. In addition, the implementation typically places an incrementor on the critical timing path between the PC and the cache. Because of these points, banked caches have been used only sparingly.
A proposal by STMicroelectronics, is to use a single bank cache, latching the current sub-line until all syllables within the sub-line have been used. This frees the instruction cache to fetch the next sub-line if needed to complete the bundle. Intel uses a similar approach in their Pentium Pro Processor. See Gwennap, L., xe2x80x9cIntel""s P6 User Decoupled Superscalar Designxe2x80x9d, Microprocessor Report, February 1995, pages 9-15. This single bank cache approach works well for the execution of sequential code segments, effectively mimicking a two bank cache. However, branches to bundles that straddle sub-lines result in a stall due to two sub-lines being needed. Because branches occur frequently and the probability that the target will straddle line boundary is great, the degradation in performance is significant.
Briefly, in one aspect the present invention comprises a cache structure, organized in terms of cache lines, for use with variable length bundles of instructions (syllables), including: a first cache bank that is organized in columns and rows; a second cache bank that is organized in columns and rows; logic for defining the cache line into a sequence of equal sized segments, and mapping alternate segments in the sequence of segments to the columns in the cache banks such that the first bank holds even segments and the second bank holds odd segments; logic for storing bundles across at most a first column in the first cache bank and a sequentially adjacent column in the second cache bank; and logic for accessing bundles stored in the first and second cache banks.
In a further aspect of the present invention, each of the segments is at least the size of a maximum issue width for the cache structure.
In a yet further aspect of the present invention, the storing logic includes logic for storing the bundles across line boundaries.
In a further aspect of the present invention, the accessing logic comprises: a first select logic for selecting a first column in the first cache bank based on first information from a program counter; and a second select logic for selecting a sequentially adjacent column, relative to the column selected in the first cache bank, in the second cache bank based on the first information from the program counter.
In yet a further aspect of the present invention, one of the first and second select logics includes an incrementor to selectively increment the first information from the program counter.
In yet a further aspect of the present invention, the accessing logic comprises: third select logic for receiving an input of second information from the program counter; and an bundle extraction multiplexer receiving a segment from each of the first cache bank and the second cache bank and outputting a composite segment formed from one or both of the received segments in accordance with a selector output from the third select logic.
In yet a further aspect of the present invention, an output from a decoder for one of the first and second cache bank is rotated by one.
In another aspect, the present invention further comprises: a first tag bank for storing even cache line tags; a second tag bank for storing odd cache line tags; tag select logic for accessing a tag from at least one of the first and second tag banks; and comparison logic for comparing the accessed tag to tag information from a program counter in order to obtain a validity determination for bundles associated with the accessed tag that were accessed from the first and/or second cache banks.
In another aspect of the present invention, the tag select logic comprises: first tag select logic for selecting a tag in the first tag bank in accordance with select information from a program counter; and second tag select logic for selecting a tag in the second tag bank in accordance with the select information from the program counter.
In yet a further aspect of the present invention, one of the first and second tag select logics includes an incrementor to selectively increment the respective information received from the program counter.
In another aspect of the present invention, the comparison logic comprises: first comparison logic for comparing the tag selected by the first tag select logic to tag information from a program counter and generating an output; second comparison logic for comparing the tag selected by the second tag select logic to tag information from the program counter and generating an output; and tag validity logic taking inputs from the first comparison logic and the second comparison logic and generating an output indicative of validity of the tags selected by the first and second comparison logics.
In another aspect of the present invention, the tag validity logic includes logic to selectively mask an output from one of the comparison logics if the bundle does not cross a cache line boundary.
In another aspect, the present invention further comprises: a cluster extraction multiplexer for receiving a composite segment and extracting cluster syllables.
In a further aspect of the present invention, the cluster extraction multiplexer includes: a first cluster multiplexer; a second cluster multiplexer; a first cluster select logic receiving cluster select information from the program counter and controlling an output of the first cluster multiplexer in accordance therewith; and a second cluster select logic receiving the cluster select information from the program counter and a cluster fields associated with each syllable received by the second cluster multiplexer and controlling an output of the second cluster multiplexer in accordance therewith.
In another aspect, the present invention further comprises a PC logic to determine a next value for the program counter, the PC logic including: a next bundle logic to determine in which segment a next bundle begins; and a next word address logic to determine the next word address within a segment where the next bundle begins.
In a further aspect of the present invention, the next bundle logic determines the segment in which the next bundle begins based on remaining start bits in a current segment after start bits associated with any previous bundle, a current bundle, and a next segment are masked out.
In another aspect of the present invention, the next word address logic determines the next word address by masking out start bits associated with previous bundles to obtain remaining start bits, priority encoding the remaining start bits for each segment, and then selecting a Wd value in the program counter for the next bundle.
In a further aspect of the present invention, values for Wd0 and Wd1 are latched prior to multiplexing, and are selected using an IB field in the program counter.
In another aspect, the present invention further comprises PC logic to determine a next value of a program counter register in branch situations that includes: logic to extract a branch instruction from a bundle, determine if the branch is taken, and if taken compute a target address; and if taken, loading logic to load the target address into the program counter register.
In another aspect, the invention further comprises a power reduction logic to enable only the first cache bank if first information is received from the second program counter, to enable only the second cache bank if second information is received from the program counter, and to always enable both of the cache banks when a branch is taken.
In further embodiment of the present invention, a method is provided for organizing a cache structure which, in turn, is organized in terms of cache lines, for use with variable length bundles of instructions (syllables), comprising: defining the cache line into a sequence of equal sized segments, and mapping alternate segments in the sequence of segments to columns in cache banks such that a first bank holds even segments and a second bank holds odd segments; storing bundles across at most a first column in the first cache bank and a sequentially adjacent column in the second cache bank; and accessing bundles stored in the first and second cache banks.
In yet a further embodiment of the present invention, a program product is provided, comprising: computer usable medium having computer readable code embodied therein for organizing a cache structure which, in turn, is organized in terms of cache lines, for use with variable length bundles of instructions (syllables), including: first code for defining the cache line into a sequence of equal sized segments, and mapping alternate segments in the sequence of segments to columns in cache banks such that a first bank holds even segments and a second bank holds odd segments; second code for storing bundles across at most a first column in the first cache bank and a sequentially adjacent column in the second cache bank; and third code for accessing bundles stored in the first and second cache banks.