This invention relates to computer systems and in particular to processors that utilize a data level cache for holding operands.
Modern microprocessors may incorporate a private local level 1 data cache (L1) that holds recently accessed operand data in order to provide improved performance. This L1 cache holds recently accessed data, or data that are prefetched for potential future operand fetch requests for the processor, or both. Caches are known to be managed in terms of cache lines, which are usually of a pre-defined fixed size of data. Lines are known to range from 32 bytes to 256 bytes, but lines are not limited to those sizes. In an architecture that allows unaligned (i.e. not aligned to storage boundaries) operand access, a requested operand or operands can span multiple cache lines.
Assume the cache under discussion can return one doubleword (DW) which is 8 bytes of data per fetch request. When the length of an operand is more than 1 byte, the fetch request can cross from one cache line to the next, i.e. part of the data requested is in a first cache line, while another part of data requested is in a second, subsequent cache line. When line crossing is involved, a Load-Store Unit (LSU) which processes the fetch request will usually have to perform two subsequent lookups to figure out whether the LSU has the lines in its cache and, if so, figure out a location in the cache where the lines reside.
For a simple instruction, like an 8-byte load instruction, a typical processor pipeline will assume it takes one cycle to finish the lookup. When a line crossing occurs, the processor pipeline will keep the first piece of data obtained from the first line, and then it will have to “hold” execution by either directly stalling execution for some cycles, or provide a pipeline reject for some cycles, so that the processor pipeline can schedule a fetch to the next line to obtain the 2nd piece of data.
For instructions that require more than 8 bytes, e.g. Load Multiple (LM) in IBM's z-architecture, it is possible or probable that the requested operands will cross a cache line (or multiple cache lines). For a processor design, when the length of an operand is greater than the cache data return bus, multiple operand fetch requests must be performed for each block of data. As the requestor fetches sequentially from one block to the next, a penalty will be incurred when that particular operand fetch request requires data to be returned from two separate lines in a given cycle. This penalty will be similar to the penalty when a simple instruction's operand crosses a line as described earlier.
In a processor that implements an instruction set architecture that has many long operand instructions (for example, IBM's z-Architecture), and when a pipeline hiccup during a line crossing can be many cycles, it is important to have a solution that both avoids unnecessary line crossing penalties and is a low-latency solution that does not impact the performance of operand fetches that do not cross a line. Some processor designs merely tolerate the line reject penalty in the middle of a long operand instruction. Other designs try to solve this problem by always putting a “gap” (or stall) cycle after the initial address generation to figure out whether there is a line crossing and readjust its fetching pattern. Other possible solutions can be implemented by providing multi-port L1 directory and cache accesses to concurrently access line X, and line X+1, but this would impact both area and frequency as the required array design is relatively large and thus slower. Accordingly, an improved low-latency solution for avoiding unnecessary line crossing penalties is needed.