The present invention relates to a data processing apparatus and method for pre-decoding instructions to be executed by processing circuitry within the data processing apparatus.
It is typically the case that one or more caches are provided within a data processing apparatus for caching the instructions and data required by processing circuitry within the data processing apparatus. At any particular level in a cache hierarchy, separate instruction and data caches may be provided (often referred to as a Harvard architecture), or alternatively a unified cache may be provided for storing the instructions and data (often referred to as a Von Neumann architecture). When instructions are fetched from memory for storing in a cache, some known systems have employed pre-decoding mechanisms for performance orientated reasons. In accordance with such mechanisms, instructions are pre-decoded prior to storing in a particular level of the cache (typically the level 1 cache), and in such cases that level of the cache often then stores instructions in a wider format than the instructions stored in main memory, to accommodate the additional information produced by the pre-decoding process. To assist in improving performance when the instructions are later decoded and executed, the extra information provided in the pre-decoded instructions as stored in the cache has been used to identify branch instructions, identify classes of instructions (e.g. load/store instructions, coprocessor instructions, etc) to later assist multi-issue circuitry in dispatching particular instructions to particular execution pipelines, and to identify instruction boundaries in variable length instruction sets.
For example, the article “Performance Evaluation Of A Decoded Instruction Cache For Variable Instruction Length Computers”, IEEE Transactions on Computers, Volume 43, number 10, pages 1140 to 1150, October 1994, by G Intrater et al., discusses the storing of pre-decoded instructions in a cache. The article “The S-1 Project: Developing High-Performance Digital Computers” by L. Curtis Widdoes, Jr., Lawrence Livermore National Laboratory, 11 Dec. 1979, describes the S1 Mark IIA computer, where a decoded instruction cache expanded the 36-bit instruction word to a 56-bit instruction cache format to reduce instruction decoding time (see also the paper “Livermore S-1 Supercomputer—A Short History” appearing on the website http://www.cs.clemson.edu/˜mark/s1.html). Further, the idea of using pre-decoding mechanisms to pre-identify branches and pre-identify instruction boundaries is discussed in the AMD K5 Processor Data sheet, Publication no. 18522E-0, September 1996, Section 4.5, Innovative x86 Instruction Pre-decoding, page 6, which discusses adding 4 bits per instruction byte to identify start, end, opcode position, and number of Rops (RISC operations) the individual x86 instruction requires for later translation.
Commonly owned US patent publication US 2010/0017580 describes an apparatus including pre-decoding circuitry for performing a pre-decoding operation on instructions fetched from memory in order to generate corresponding pre-decoded instructions for storing in a cache. In one embodiment, the pre-decoding circuitry is provided in association with a level 1 instruction cache. However, in an alternative embodiment, the pre-decoding circuitry is instead associated with a unified cache storing both instructions and data, for example a level 2 cache.
U.S. Pat. Nos. 6,092,182 and 6,804,799 describe a pre-decode unit associated with a level 1 instruction cache, which is configured to receive instruction bytes from a level 2 cache and to generate corresponding pre-decode information for storing in the level 1 cache in association with the instruction bytes. When such instruction bytes and corresponding pre-decode information is subsequently evicted from the level 1 instruction cache, the level 1 instruction cache is configured to output at least part of the corresponding pre-decode information for storage in the level 2 cache. The pre-decode information may be stored in storage locations within the level 2 cache that are configured to store parity and/or error checking and correction information.
In many implementations, significant power consumption may be associated with performance of such pre-decoding operations, and furthermore such pre-decoding operations can impact performance. Accordingly, it would be desirable to provide a mechanism for performing such pre-decoding operations, which enabled a reduction in power consumption and/or an improvement in performance.