The present invention relates generally to the field of processors and in particular to a system, method, and software for pre-loading instructions from a variable-length instruction set into an instruction cache, and correctly pre-decoding the instructions.
Microprocessors perform computational tasks in a wide variety of applications. Improved processor performance is almost always desirable, to allow for faster operation and/or increased functionality through software enhancements. By exploiting both architectural advances, such as RISC architectures, and advances in semiconductor technology, many modern processors execute at clock speeds much higher than memory chips, such as DRAM and SDRAM. To minimize the penalty of relatively slow memory accesses, these processors utilize a hierarchical memory structure, with fast, on-chip cache memories storing local copies of data and instructions that have been recently accessed and/or that the processor anticipates (via software) accessing in the near future.
One example of anticipating loads to an instruction cache is a preload instruction (e.g., the ARM instruction PLI). A preload instruction loads instructions into the instruction cache in advance of their execution. A programmer may insert a preload instruction into code somewhat ahead of a known or anticipated branch (such as a subroutine call) to a target. The preload instruction loads instructions at the target into the instruction cache, so that when the branch executes, instructions at the target are available in the cache for immediate execution.
Most modern processors employ a pipelined architecture, where sequential instructions are overlapped in execution to increase overall processor throughput. Maintaining smooth execution through the pipeline is critical to achieving high performance. One pipeline optimization technique known in the art is pre-decoding instructions. Instructions are examined as they are retrieved from memory, are partially decoded, and some information about the instructions—known as pre-decode information—is stored in an instruction cache along with the associated instructions. When the instructions are later fetched from the cache into the pipeline, the pre-decode information is also fetched, and assists in fully decoding the instructions. Pre-decode information may, for example, identify branch and floating-point instructions, or other information useful in executing the instructions in the pipeline.
Some instruction sets include instructions having different lengths. For example, a processor may have a 32-bit native instruction set, and additionally include the ability to execute Java byte codes. As another example, several versions of the ARM processor architecture execute instructions from an instruction set known as Thumb-2. Thumb-2 instructions may be sixteen or thirty-two bits long. In general, a compiler selects Thumb-2 instructions to achieve the most compact code, and 16-bit and 32-bit instructions may be mixed in any order within a segment of Thumb-2 code. Furthermore, in general, there is no restriction on the alignment of instructions in memory.
One useful function of a pre-decoder in a variable-length instruction set processor is to ascertain and indicate the length of each instruction in an instruction cache line. That is, the pre-decode information may identify instruction boundaries, which may be determined, for example, by inspection of the instruction op codes. This information assists the pipeline fetch and decode stages upon fetching the instructions into the pipeline for execution.
One problem encountered in pre-loading instructions from a variable-length instruction set and pre-decoding the instructions to indicate instruction boundaries is that, because instructions are not necessarily aligned in memory, they may not be aligned along instruction cache line boundaries. Consequently, the first datum (e.g., halfword) in a cache line-size data block retrieved from memory may be a complete instruction (e.g., a 16-bit instruction) or part of an instruction that is continued in the same cache line (e.g., the first 16 bits of a 32-bit instruction), either of which is detected by the pre-decoder. However, the first datum may be part of an instruction that began prior to the data block (e.g., the last 16-bits of a 32-bit instruction) or may comprise non-instruction data such a variable, immediate data, or the like. In either of the latter two cases—a partial instruction or data—the first datum may be incorrectly pre-decoded as a complete instruction or the first part of an instruction.
Since the length and hence boundaries of other instructions in the data block are determined sequentially, incorrectly pre-decoding the first datum may cause all instructions in the data block to be incorrectly pre-decoded, and the incorrect pre-decode information written with the instructions to the instruction cache. When the instructions are subsequently fetched from the cache into the pipeline and the pre-decode information is determined to be erroneous, the cache line must be discarded and the instructions again fetched from memory and pre-decoded to determine the correct instruction boundaries—adversely affecting both processor performance and power consumption.