A cache is generally a small fast memory holding recently accessed data, designed to speed up subsequent access to the same data. Instructions and data are transferred from main memory to the cache in blocks, using a look-ahead algorithm. The cache stores this information in one or more cache lines. Typically, sequential lines of instructions are stored in the cache lines. A fetch engine system speculatively stores consecutive sequential lines of instructions in anticipation of their future use.
FIG. 1 illustrates a prior art fetch engine system fetching a new cache line from the instruction cache every fetch cycle. The fetch engine system consists of a BTB engine, a branch predictor (BP), a return address stack (RAS), logic to determine the next address, and an instruction cache. The fetch engine fetches one full block of instructions from the instruction cache per cycle by accessing the instruction cache. The BTB engine and branch predictor (BP) provide instruction information of the current fetch cycle. The logic to determine the next address provides the next fetch address. The instruction cache consists of multiple cache lines. Note, the at sign “@” means “address.
In general, a cache line is a unit of information, such as multiple bytes, words, etc. In most Reduced Instruction Set Codes (RISC) systems, the cache lines are 32 bytes or 64 bytes wide. Typically, instructions are 4 bytes wide and fetch engines are designed to fetch 3-5 instructions (12-20 bytes) per clock cycle. Rather than reading a single word or byte from main memory at a time, each cache entry is usually holds a certain number of words, known as a “cache line” and a whole line is read and cached at once. However, it is very frequent that the same cache line is fetched in several consecutive cycles. This is especially true for long cache lines.
Typically, fetch performance is a very important factor because it effectively limits the overall processor performance. However, traditional thinking is usually that there is little performance advantage in increasing front-end performance beyond what the back-end can consume. For each processor design, typically the target is to build the best possible fetch engine for the required performance level. Thus, a fetch engine can fetch a certain number (width) of instructions per clock cycle. The fetch width of fetch operation preformed by the fetch engine is cooperated with the number of instructions that the processor can consume.
The prior art fetch engine reads a cache line from the instruction cache every cycle and then extracts the requested instructions from that cache line, as instructed by the fetch request. The fetch engine first tries to obtain these instructions from the instruction cache and then from main memory. The longer in width a cache line is the more area on the chip that a cache line occupies. If a fetch engine can only fetch X number of instructions, then traditional thinking has been why build a cache line that store 8× of instructions because the actual execution of those instructions by the processor will not occur any faster than if the cache line is 2×. Fetching a cache line wider than the actual fetch width of the processor represents a wasted number of instructions transferred, because not all the instructions in the cache line will be actually used/fetched for actual execution.
However, a fetch engine will be better if it provides better performance, but also if it takes fewer resources, requires less chip area, or consumes less power. Power consumption is becoming an important design factor in high performance microarchitectures. A design that consumes as little energy and dissipate as little power as possible is also advantageous.
Also, the increasing clock frequencies employed in current and future generation processors limits the size of cache memories, or else increases their access time. The use of line buffers has been implemented in main memory chips to reduce access time, providing a level of cache within the memory chip itself. However, some traditional thinking has been not to use line buffers for on-chip cache memories because it does not offer any speed performance advantage as long as the access time is one cycle.
While the invention is subject to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will herein be described in detail. The invention should be understood to not be limited to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.