1. Field
An aspect of the present invention relates to a processing unit.
2. Description of the Related Art
In a numerical calculation program handling floating-point arithmetic, an increase of the number of registers within a processing unit enables optimization by a compiler such as loop unrolling and software pipelining so that performance can be improved.
On the other hand, in a processing unit of a RISC instruction set architecture, or a SPARC architecture for example, a limitation is imposed to fix the instruction length to 32 bits for example, and also a single process is instructed by a single instruction so that information held by one instruction is limited.
Under such limitations, various proposals of new instructions and to increase the number of registers have been made.
According to Patent Document 1 and Patent Document 2, register designation information designating a register is divided into two portions. Then, the two portions are arranged on separate basic units of instruction code and one instruction code is made omissible. If an omissible instruction code is omitted, a register selection operation is performed by implicitly assuming predetermined register designation information.
According to Patent Document 3 and Patent Document 4, instructions are extended by combining a plurality of instruction codes for transfer instructions between memory and register and operation instructions between registers to enable direct operations on data in a memory while maintaining compatibility with existing CPUs.
Further, according to Patent Document 5 and Patent Document 6, a user can add commands so that a processor can easily be redesigned.    [Patent Document 1] Japanese Patent Application Laid-Open No. 2001-202243    [Patent Document 2] Japanese Patent Application Laid-Open No. 2006-313561    [Patent Document 3] Japanese Patent Application Laid-Open No. 2005-353105    [Patent Document 4] Japanese Patent Application Laid-Open No. 2006-284962    [Patent Document 5] Japanese Patent Application Laid-Open No. 2003-518280    [Patent Document 6] Japanese Patent Application Laid-Open No. 2007-250010
Assuming that instructions have operation code length of 32 bits. For a PC (Program Counter) showing a logical address of an instruction performing processing and a logical address NPC (Next Program Counter) of an instruction to be executed next, logical specifications such as NPC=PC+4 are specified, as long as the instruction pointed to by the PC is not a branch instruction or a trap instruction. Thus, if the instruction length of one instruction is extended to 64 bits, existing software will not operate.
Under such limitations, in a floating-point sum-of-product arithmetic for which a total of four registers, three source registers and one destination register, must be designated, 4×8 bits=32 bits will be needed when 8-bit register addresses are designated. This means that 32 bits in a 32-bit instruction are used for designating the registers, so that information of operation code of the instruction type of sum-of-product arithmetic cannot be held at all. Therefore, 32-bit instructions with 8-bit register addresses cannot practically be defined.
As a conventional means of handling many registers under limitations of 32-bit length, an instruction set architecture adopting the register window system is known.
Taking the SPARC architecture as an example, a pointer called CWP (Current Window Pointer) is set by a separate instruction and subsequent instructions reference registers in a window pointed to by the CWP, for example, 32 5-bit registers that can be instructed by one operation code in a window.
Then, one register window is allocated to one sub-routine containing a plurality of instructions. When other windows are referenced or updated, the CWP is changed and then instructions are executed.
In this system, only processing within 32 registers can be performed by one instruction. Thus, processing using 32 registers or more at the same time cannot be performed and optimizations such as software pipelining and loop unrolling by a compiler cannot be performed so that performance improvement cannot be aimed for.
Being able to perform one piece of process with input of optional register data of, for example, 100 or 200 registers at the same time is effective in performance improvement by optimization of a compiler. That is, the number of registers handled in one piece of processing cannot be increased beyond a certain number due to limitations of the operation code length in the conventional RISC instruction set architecture and thus, performance improvement of floating-point programs cannot be expected, which has become a major subject.
Instructions of the type called SIMD (Single Instruction Multiple Data) instructions which perform a plurality of processings by one instruction are effective as floating-point instructions.
However, if SIMD instructions such as addition, multiplication, sum-of-product arithmetic, division, and square root arithmetic are newly added to the processing unit, as many instruction operation codes as the number of additional instructions must be allocated. For this purpose, empty operation codes not used for existing addition, multiplication, and sum-of-product arithmetic are found for allocation. Therefore, in an instruction set in which operation codes are allocated without originally assuming SIMD instructions, there may not be an empty field and even if there is an empty field, an SIMD instruction is allocated to an operation code that happens to be empty and thus, it is difficult to logically allocate SIMD instructions to operation codes in a well-ordered manner.
A high-performance processor in recent years has a cache memory inside the processor chip or near the processor. This is because a difference between the processing speed of memory and that of the processor has become large and data in the memory is controlled to be registered with the cache memory that can be accessed at high speed for reading and writing on the cache memory.
The cache memory has a set associative structure and cache registration/discharge is generally controlled by LRU (Least-Recently-USED) and thus, registered data may arbitrarily be discharged from the cache by hardware. In that case, data that may be reused is discharged when data that will not be reused is registered even if there is data that will not be reused depending on the circumstances, leading to performance degradation.
Regarding processing by an SIMD instruction handling a plurality of pieces of data by one instruction described above, for example, processing by an SIMD instruction handling two pieces of double precision floating-point data is generally realized by doubling the data width, that is, the number of bits of one register to 128 bits. However, not all programs can execute SIMD instructions and in that case, resources of the extended data width are wasted. Moreover, first half data and second half data in 128 bits cannot be handled separately, imposing restrictions on programming of software.