Modern microprocessors and other programmable processor circuits utilize a hierarchy of memories to store and supply instructions. A common hierarchy includes an instruction cache or L1 cache that is relatively close to the core of the processor, for example, on the processor chip. Instructions are loaded to the L1 instruction cache from a somewhat more remote or L2 cache, which stores both instructions and data. One or both caches are loaded with instructions from main memory, and the main memory may be loaded from more remote sources, such as disk drives of the device that incorporates the processor. The cache memories enhance performance. Because of its proximity to the processor core, for example, fetching of instructions from the L1 cache is relatively fast.
In many implementations, a line of the instruction cache holds a number of instructions. If the number of bits per instruction is fixed for all instructions, a cache line can be sized to hold an integer number of such instructions. For example, if each instruction is 32 bits, a 256-bit cache line will hold eight such instructions, and the boundaries of the first and last instructions stored in the line match or align with the boundaries of the cache line. However, if the processor handles instructions of different lengths, e.g. 32-bit instructions and 16-bit instructions, then the instructions in a given cache line may not align with the boundaries of that line. If the processor architecture mandates that an instruction may not overlap two cache lines, then there will be some wastage. However, many architectures do not impose such a cache restriction. In the later cases, problems occur in reading an instruction that has part stored in one line and the rest stored in another line, e.g. a 32-bit instruction having 16 bits at the end of one line of the cache and the other 16 bits stored at the beginning of the next cache line.
Modern programmable processor circuits often rely on a pipeline processing architecture, to improve execution speed. A pipelined processor includes multiple processing stages for sequentially processing each instruction as it moves through the pipeline. Of course while one stage is processing an instruction, other stages along the pipeline are concurrently processing other instructions. Each stage of a pipeline performs a different function necessary in the overall processing of each program instruction. Although the order and/or functions may vary slightly, a typical simple pipeline includes an instruction Fetch stage, an instruction Decode stage, a memory access or Readout stage, an instruction Execute stage and a result Write-back stage. More advanced processor designs break some or all of these stages down into several separate stages for performing sub-portions of these functions. Super scalar designs break the functions down further and/or provide duplicate functions, to perform operations in parallel pipelines of similar depth.
The Fetch stage is the portion of the pipeline processor that obtains the instructions from the hierarchical memory system. In many pipeline designs, the Fetch operation is broken down into two or more stages. Of these stages, one stage collects the instructions when fetched from the L1 cache and communicates with the higher level memories to obtain instruction data not found in the L1 cache.
A problem can occur in such a fetch operation where the boundaries of the instructions cross the cache line boundaries, and part of a desired instruction is not yet present in the L1 cache. For example, if the stage that collects the instructions fetched from the L1 cache receives a first part of the instruction, it will not communicate with the higher level memories because the instruction was found in the L1 cache. Similarly, if that stage has already obtained the line containing the first piece from the higher level memory, it will not initiate a second request for the line containing the other piece of the instruction. Instead, it waits to receive the rest of the instruction from processing of the next cache line by the preceding stage. However, if the preceding stage detects that the rest of the desired instruction is not in the appropriate line of the L1 cache (a miss), it can not provide the remaining part of the instruction, and it does not have the capability to access the higher level memories to obtain the missing piece of the instruction. Normally, in the case of a miss, this stage would drop the address down to the next stage to request the data from higher level memory, but that next stage is waiting for the second piece of a split instruction to come from the preceding stage. In some extreme cases, the Fetch processing may lock up for some period waiting for a portion of the instruction that neither stage can request.
It might be possible to allow more than one stage in the fetch section of the pipeline to request instructions from the other memory resources, to avoid the above-identified problem. Such a solution, however, adds complexity, in construction of the fetch stages, in the interconnection of the fetch stages to other memory resources and in the management of flow of instructions to and through the fetch stages. For a high performance processor design, it is desirable to make requests to higher level memory resources from as few places as possible, e.g. because each such request delays other processing while waiting for return of requested data. Hence, there is still further room for improvement in fetching instructions, where instructions may cross cache line boundaries.