1. Field of the Invention
The present invention relates to a data processing apparatus and method for pre-decoding instructions, whereafter the pre-decoded instructions are placed in a cache for access by processing circuitry within the data processing apparatus.
2. Description of the Prior Art
In a typical data processing apparatus, significant power is consumed in decoding instructions prior to execution within the execution pipelines of the processing circuitry. This issue can become particularly problematic in processing circuitry that supports multiple instruction sets, since often multiple separate decoders will need to be provided for decoding instructions from the various instruction sets. By way of example, in some implementations approximately 15% of the processor power may be consumed by the instruction decoders.
It is typically the case that one or more caches are provided within the data processing apparatus for caching the instructions and data required by the processing circuitry. At any particular level in a cache hierarchy, separate instruction and data caches may be provided (often referred to as a Harvard architecture), or alternatively a unified cache may be provided for storing the instructions and data (often referred to as a Von Neumann architecture). When instructions are fetched from memory for storing in a cache, some known systems have employed pre-decoding mechanisms for performance orientated reasons. In accordance with such mechanisms, instructions are pre-decoded prior to storing in the cache, and in such cases the cache often then stores instructions in a wider format than the instructions stored in main memory, to accommodate the additional information produced by the pre-decoding process. To assist in improving performance when the instructions are later decoded and executed, the extra information provided in the pre-decoded instructions as stored in the cache has been used to identify branch instructions, identify classes of instructions (e.g. load/store instructions, coprocessor instructions, etc) to later assist multi-issue circuitry in dispatching particular instructions to particular execution pipelines, and to identify instruction boundaries in variable length instruction sets.
For example, the article “Performance Evaluation Of A Decoded Instruction Cache For Variable Instruction Length Computers”, IEEE Transactions on Computers, Volume 43, number 10, pages 1140 to 1150, October 1994, by G Intrater et al., discusses the storing of pre-decoded instructions in a cache. The article “The S-1 Project: Developing High-Performance Digital Computers” by L. Curtis Widdoes, Jr., Lawrence Livermore National Laboratory, 11 Dec. 1979, describes the S1 Mark IIA computer, where a decoded instruction cache expanded the 36-bit instruction word to a 56-bit instruction cache format to reduce instruction decoding time (see also the paper “Livermore S-1 Supercomputer—A Short History” appearing on the website http://www.cs.clemson.edu/˜mark/s1.html). Further, the idea of using pre-decoding mechanisms to pre-identify branches and pre-identify instruction boundaries is discussed in the AMD K5 Processor Data sheet, Publication no. 18522E-0, September 1996, Section 4.5, Innovative x86 Instruction Predecoding, page 6, which discusses adding 4 bits per instruction byte to identify start, end, opcode position, and number of Rops (RISC operations) the individual x86 instruction requires for later translation.
Whilst the above-mentioned pre-decoding mechanisms can improve the performance of the processing circuitry, they do not typically significantly alleviate the earlier mentioned power cost associated with the later decoder circuits used to decode the instructions once they are output from the instruction cache.
When instructions are fetched from memory, a plurality of instructions are typically fetched at a time, sufficient to fill a cache line within the cache, and when employing pre-decoding techniques, a cache line's worth of instructions will hence be passed through the pre-decoding circuitry prior to storing the pre-decoded instructions in the cache. Some instruction sets do not require the instructions to be aligned with cache line boundaries (for example the Thumb 32 instruction set developed by ARM Limited, United Kingdom), and accordingly situations can arise where a particular instruction crosses a cache-line boundary. In such instances, the instruction can be considered to comprise of a plurality of instruction portions, and it will be appreciated that the pre-decoding circuitry may not have access to all of the plurality of instruction portions of a particular instruction when it performs the pre-decoding operation, if that instruction crosses a cache-line boundary. In such situations, the pre-decoding operation will be performed on only part of that instruction, and accordingly will be incomplete. This can lead to the pre-decoded instruction being corrupted, which could lead to incorrect operation of the processing circuitry if the instruction is subsequently read from the instruction cache.