Instructions included in a program are processed through a series of stages, such as instruction fetch (fetch), instruction decode (decode), instruction execution and execution result commit (commit), in an instruction processing apparatus typified by a CPU. Conventionally, there has been known a technique called pipeline where CPU resources are allocated for each of instructions in a timesharing manner to perform processing. By applying the pipeline, it is possible to perform a parallel processing while an instruction is executed, for example, to decode a next instruction and to fetch a second next instruction. And in addition, an instruction execution itself is performed through the pipeline, so that it is possible to enhance a processing speed in the instruction processing apparatus.
In recent years, a superscalar system where the pipelines described above are provided in plural to further enhance the speed has been widely used. In addition, out-of order execution where if a condition to execute an instruction is satisfied, the instruction may be executed without following a program order of the instruction has been applied.
FIG. 1 is a diagram explaining a concept of out-of order execution in a superscalar system.
FIG. 1 illustrates an example in which four instructions included in a program are processed by out-of-order execution. Each of the instructions is processed through four stages of fetch (step S501), decode (step S502), execution (step S503) and commit (step S504). Fetch (step S501), decode (step S502) and commit (step S504) are executed in an predetermined order for the four instructions (in-order), and instruction execution of the instruments whose condition (step S503) is first prepared is first performed regardless of a program order (out-of-order).
To achieve out-or-order execution illustrated in FIG. 1, in execution processing (step S503) where plural instructions are executed in parallel, data to be used in those instructions is required to be held without being overwritten. In an instruction processing apparatus employing out-of-order execution of a superscalar system, there is a case where plural data storage places (hereinafter, the storage places each is referred to as register window) connected in a ring form is switched to be used.
FIG. 2 is a conceptual diagram of a register file including plural register windows.
The register file A includes eight register windows W0-W7. The register window being currently used (in FIG. 2, first register window W1) is pointed by a CWP (Current Window Pointer). In the example of FIG. 2, each of the register windows W0-W7 includes 32 registers. Eight registers of the 32 registers are all used as a global area common in the register windows W0-W7, 24 registers out of the 32 registers are divided into three areas of an in-area, a local-area and out-area, eight registers each.
For example, in the register window W1 surrounded by a bold line, an in1 area on the left end overlaps an out0 area of the register window W0 which is one before the window and the in1 area also functions as the out0. In addition, a local1 area in the center does not overlap other window and the register window W1 occupies the local1 area. An out1 area on the right end overlaps an in2 area of the register window W2 which is one after the window W1 and is used commonly by the register windows W1 and W2.
When a SAVE instruction to increment the CWP, a RESTORE instruction to decrement the CWP and a CWP update instruction to move the CWP to an arbitrary position or a trap is issued, the CWP moves according to the instruction so that the register window is switched.
Here, because the register file A has a number of registers, it takes mach time of processing to search a register window which the CWP points every time an instruction is received. Therefore, by providing a work register in which a part of the register file A is copied separately from the register file A as a master file having all the registers, processing time required for searching a register is reduced. Patent document 1 describes an instruction processing apparatus in which a general purpose register (GPR: General Purpose Register) including a replacement buffer (CRB: Current Window Replace Buffer) storing a copy of a register window adjacent to a current register window, in addition to a master register file (WRF: Master Register File) storing an original of data and a current register file (CWR: Current Window Register) storing a copy of the current register window pointed by the CWP.
FIG. 3 is a conceptual diagram of a general purpose register including a master register file, a current register and a replace buffer and FIG. 4 is a diagram illustrating a data transfer timing in a case where a SAVE instruction is issued.
As illustrated in FIG. 3, the general purpose register (GPR) 1 includes a master register file (MRF) 2, a replace buffer (CRB) 3 and a current register file (CWR) 4. Between the master register file (MRF) 2 and the replace buffer (CRB) 3, and between the replace register (CRB) 3 and the current register (CWR) 4 are connected by data buses 5, 6 to be transfer paths for data, respectively. The master register file (MRF) 2 is a file to be an original of data. Data of a register window pointed by a current pointer (CWP) 7 is copied in the current register (CWR) 4. Data of the register window is copied in the replace buffer (CRB) 3.
As illustrated in FIG. 4, when the SAVE instruction to increment the current pointer (CWP) is issued, the SAVE instruction is decoded and data stored in the (n+1)th register window which is immediately after the nth register window pointed by the current pointer (CWP) 7 is copied in the replace buffer (CRB) 3. The data copied in the replace buffer (CRB) 3 is transferred, when preceding instructions are all processed, to the current register (CWR) 4.
In a case where, before transferring to the current register (CWR) 4, further consecutively, for example, a SAVE instruction is issued, transferring of data from the master register file (MRF) 2 to the replace buffer (CRB) 3 in the second SAVE instruction is inhibited (stalled) because the replace buffer (CRB) 3 is in use by the first SAVE instruction.
In the instruction processing apparatus, processing in an instruction preceding to the SAVE instruction is executed using data stored in the current register (CWR) 4, the data stored in the replace buffer (CRB) 3 is used to process a consecutive instruction after the SAVE instruction by out-of-order execution. When the processing in the preceding instruction is completed and committed, the current register (CWR) 4 is released, the current pointer (CWP) 7 is increased by one according to the first SAVE instruction, and the data stored in the replace buffer (CRB) 3 is transferred to the current register (CWR) 4 so that data of the (n+1)th register window becomes available for using.
In addition, when transferring of the data from the replace buffer (CRB) 3 to the current register (CWR) 4 is completed, the replace buffer (CRB) 3 is released, the second SAVE instruction is decoded, and data stored in the (n+2)th register window in one next to the (n+1)th register window pointed by the current pointer is copied in the replace buffer (CRB) 3.
As described above, by updating the replace buffer (CRB) 3 when a SAVE instruction or a RESTORE instruction is decoded, and by updating the current register (CWR) 4 when those instructions are committed, data to be used in a consecutive instruction for the SAVE instruction and the like is able to be prepared to execute an out-of-order processing, so that it is possible to speed up processing time.
Conventionally, the multitask has been commonly utilized where CPU resources are allocated a time-sharing manner for each of plural applications and the plural applications are executed in parallel, such as using an application for spreadsheet calculation while using an application for word processing. In addition, in the instruction processing apparatus is provided with plural kinds of computing units, and when an instruction is executed a computing unit according to the contents of the instruction as an object to be executed is used. However, there are a few chances when all kinds of the computing units are simultaneously used, and there may be a computing unit being not in use. Therefore, there is a considerable amount of margin in the operation rate of the calculating units.
Then, as a technique to enhance the operation rate of the calculating units, simultaneously multi threading (SMT: Simultaneously Multi Threading) has been proposed where a calculating unit being not in use for a certain thread is allocated for another thread so that instructions of plural threads are simultaneously processed in parallel.
FIG. 5 is a diagram illustrating conceptually an example of SMT function.
FIG. 5 illustrates how instructions belonging to two kinds of thread of a thread A and a thread B are executed by SMT function. Each of four cells aligned vertically in FIG. 5 represents a calculating unit to perform instruction execution in the instruction processing apparatus. “A” or “B” in each of the cells represents a kind of thread to which an instruction to be executed by a computing unit calculating to the cell belongs.
In addition, a clock cycle in an instruction processing apparatus is illustrated along the horizontal axis. In an example of FIG. 5, in a first cycle (step S511), an instruction of the thread A is executed by two calculating units in the upper two cells, and an instruction of the thread B is executed by two computing units in the lower cells. In a second cycle (step S512), an instruction of the thread A is executed by two calculating units in the uppermost and lowermost cells an instruction of the thread B is executed in two calculating units in the middle cells. In a third cycle (step S513), an instruction of the thread A is executed by three computing units in these upper cells and an instruction of the thread B by the computing unit in the lowest cell.
As described, in the SMT function, it is possible to execute instructions of plural threads in each cycle simultaneously and in parallel, and it is possible to execute an instruction execution in regardless of a program order if a condition to execute the instruction is satisfied (out-of-order).
FIG. 6 is a diagram illustrating a concept of out-of-order execution by a SMT function.
In plural instructions that belongs to a same thread, instruction fetch, instruction decode and commit are required to be executed according to a program order. In contrast, in plural instructions belonging to different threads, in any stage, an instruction in which a condition is satisfied may be executed regardless of an order in which the instruction is issued. In addition, according to SMT function, regarding an instruction belonging to a same thread, there is a case where stall that commit is waited until execution processing of another instruction is completed occurs, however, regarding an instruction belonging to a different thread, it is possible to further reduce the processing time for an application as a whole because it is not required to wait for a commit order and the like.    Patent document 1: Japanese Laid-open Patent Publication No. 2007-87108
Here, in order to achieve out-of-order execution by SMT function as illustrated in FIG. 6, it is conceivable to prepare CPU resources such as a decoder illustrated in FIG. 1, a general purpose register (GPR) 1 and buses 5, 6 illustrated in FIG. 3 for the number of threads. However, there is a problem in which in recent years, the number of threads has been increased and so adding CPU resources introduces size increasing of an instruction processing apparatus and increasing of cost.
In view of the foregoing, it is an object of the present invention to provide an instruction processing apparatus in which size increasing of the apparatus and cost increasing are suppressed and simultaneous multithreading is achieved.