1. Field of the Invention
The present invention relates to a data processing apparatus and method for handling instructions to be executed by processing circuitry within the data processing apparatus.
2. Description of the Prior Art
In a typical data processing apparatus, significant power is consumed in decoding instructions prior to execution within the execution pipelines of the processing circuitry. This issue can become particularly problematic in processing circuitry that supports multiple instruction sets, since often multiple separate decoders will need to be provided for decoding instructions from the various instruction sets. By way of example, in some implementations approximately 15% of the processor power may be consumed by the instruction decoders.
It is typically the case that one or more caches are provided within the data processing apparatus for caching the instructions and data required by the processing circuitry. At any particular level in a cache hierarchy, separate instruction and data caches may be provided (often referred to as a Harvard architecture), or alternatively a unified cache may be provided for storing the instructions and data (often referred to as a Von Neumann architecture). When instructions are fetched from memory for storing in a cache, some known systems have employed pre-decoding mechanisms for performance orientated reasons. In accordance with such mechanisms, instructions are pre-decoded prior to storing in the cache, and in such cases the cache often then stores instructions in a wider format than the instructions stored in main memory, to accommodate the additional information produced by the pre-decoding process. To assist in improving performance when the instructions are later decoded and executed, the extra information provided in the pre-decoded instructions as stored in the cache has been used to identify branch instructions, identify classes of instructions (e.g. load/store instructions, coprocessor instructions, etc) to later assist multi-issue circuitry in dispatching particular instructions to particular execution pipelines, and to identify instruction boundaries in variable length instruction sets.
For example, the article “Performance Evaluation Of A Decoded Instruction Cache For Variable Instruction Length Computers”, IEEE Transactions on Computers, Volume 43, number 10, pages 1140 to 1150, October 1994, by G Intrater et al., discusses the storing of pre-decoded instructions in a cache. The article “The S-1 Project: Developing High-Performance Digital Computers” by L. Curtis Widdoes, Jr., Lawrence Livermore National Laboratory, 11 Dec. 1979, describes the S1 Mark IIA computer, where a decoded instruction cache expanded the 36-bit instruction word to a 56-bit instruction cache format to reduce instruction decoding time (see also the paper “Livermore S-1 Supercomputer—A Short History” appearing on the website http://www.cs.clemson.edu/˜mark/s1.html). Further, the idea of using pre-decoding mechanisms to pre-identify branches and pre-identify instruction boundaries is discussed in the AMD K5 Processor Data sheet, Publication no. 18522E-0, September 1996, Section 4.5, Innovative x86 Instruction Predecoding, page 6, which discusses adding 4 bits per instruction byte to identify start, end, opcode position, and number of Rops (RISC operations) the individual x86 instruction requires for later translation.
Whilst the above-mentioned pre-decoding mechanisms can improve the performance of the processing circuitry, they do not typically significantly alleviate the earlier mentioned power cost associated with the later decoder circuits used to decode the instructions once they are output from the instruction cache.
The processing circuitry at any point in time will be in a particular processor state, dependent on the instruction set being executed at the time. For example, an ARM processor may be able to execute instructions from either the ARM, Thumb, ThumbEE or Jazelle instruction sets developed by ARM Limited, United Kingdom, and accordingly at any point in time will be in one of four possible processor states. When pre-decoding instructions, the pre-decoding circuitry will need to assume a speculative processor state when performing the pre-decoding. The chosen speculative processor state would then typically be used for pre-decoding a whole cache line's worth of instructions. However, instructions from different instruction sets may reside on the same cache line, and accordingly such a procedure may give rise to pre-decoded instructions which are corrupted, for example if the speculative processor state is the Thumb state, but the cache line mixes ARM and Thumb instructions. Indeed, it is also possible for an entire cache line to be pre-decoded in the wrong state if the last instruction in one cache line is a state changing branch, and the next cache line is already in the pre-decode process, the pre-decoding circuitry hence making a wrong assumption of the speculative processor state when performing the pre-decoding process.