Microprocessors perform computational tasks in a wide variety of applications. Improved processor performance is almost always desirable, to allow for faster operation and/or increased functionality through software enhancements. By exploiting both architectural advances, such as RISC architectures, and advances in semiconductor technology, many modern processors execute at clock speeds much higher than memory chips, such as DRAM and SDRAM. To minimize the penalty of relatively slow memory accesses, these processors utilize a hierarchical memory structure, with fast, on-chip cache memories storing local copies of data and instructions that have been recently accessed and/or that the processor anticipates (via software) accessing in the near future.
One example of anticipating loads to an instruction cache is a preload instruction (e.g., the ARM instruction PLI). A preload instruction loads instructions into the instruction cache in advance of their execution. A programmer may insert a preload instruction into code somewhat ahead of a known or anticipated branch, such as a subroutine call, to a target. The preload instruction loads instructions at the target into the instruction cache, so that when the branch executes, instructions at the target are available in the cache for immediate execution. Instructions to be loaded into a cache in response to a preload instruction are referred to herein as pre-loaded instructions.
Most modern processors employ a pipelined architecture, where sequential instructions are overlapped in execution to increase overall processor throughput. Maintaining smooth execution through the pipeline is critical to achieving high performance. One pipeline optimization technique known in the art is pre-decoding instructions. Instructions are examined as they are read from memory, are partially decoded, and some information about the instructions—known as pre-decode information—is stored in an instruction cache along with the associated instructions. When the instructions are later fetched into the pipeline from the cache, the pre-decode information is also fetched, and is used to assist in fully decoding the instructions. Pre-decode information may, for example, identify branch and floating-point instructions, identify the length of an instruction in a variable-length instruction set architecture, or other information useful in executing the instructions in the pipeline.
Some modern processors are able to execute instructions according to two or more instruction set encodings. For example, a processor may have a 32-bit native instruction set, and additionally include the ability to execute Java byte codes. As another example, several versions of the ARM processor architecture execute 32-bit ARM instruction encodings in an ARM instruction set operating mode, and 16-bit or 32-bit Thumb instruction encodings in a Thumb instruction set operating mode. Programmers explicitly set the proper instruction set operating mode in software, prior to executing instructions according to the relevant instruction set encoding.
When a conventional processor having a pre-decoder executes a preload instruction, the pre-loaded instructions are pre-decoded according to the current instruction set operating mode. If a programmer wishes to preload instructions in an instruction set different from the current processor instruction set operating mode—for example in anticipation of a branch to code section written in a different instruction set—the programmer must change the instruction set mode prior to executing the preload instruction.
For example, a program executing ARM instructions in ARM mode may change to Thumb mode and branch to a code segment executing only Thumb instructions. To preload the Thumb instructions with correct pre-decoding, the program must switch the current processor instruction set operating mode from ARM to Thumb, execute a preload instruction with the target address of the Thumb code, then switch back to ARM mode to execute the remaining ARM instructions, prior to the branch into the Thumb code section.
Failure to enter Thumb mode to execute the preload instruction would cause the pre-decoder to incorrectly pre-decode the pre-loaded Thumb instructions according to ARM instruction set encodings. If this occurred, when fetched from the instruction cache the pre-decode information would be useless and the Thumb instructions would have to be re-fetched, incurring a delay that the preload was meant to obviate. In addition, in an instruction cache having an ARM/Thumb bit as part of the tag, the re-fetched Thumb instructions would be allocated a separate cache line, and the Thumb instructions incorrectly pre-decoded as ARM instructions would remain in the cache, wasting cache space and hence lowering the cache hit rate and system performance.
In many processors, the overhead involved in switching the current processor instruction set operating mode twice to execute a single preload instruction may outweigh the benefit of pre-loading instructions in the anticipated instruction set.