1. Field of the Invention
The present invention relates to a multithread controller and a control method for effectively switching a plurality of threads in a multithread processor for executing a plurality of threads.
2. Description of the Related Art
In recent years, as the typical computer architectures, the RISC (Reduced Instruction Set Computer) architecture having simplified the processes executable in one instruction and the VLIW (Very Long Instruction Word) architecture having summarized a plurality of instructions which can be executed simultaneously into one longer instruction with software, are known in addition to the CISC (Complex Instruction Set Computer) architecture for executing complicated processes with one instruction.
Moreover, the processing methods in a central processing unit (CPU) of a computer to reach these architectures may be roughly classified into two methods of an in-order execution type method and an out-of-order execution type method. The in-order type method is capable of sequentially processing the instruction stream along the program sequence, while the out-of-order type method is capable executing the process by passing over the preceding instruction without relation to the program sequence when the instructions are not in the relationship of mutual dependence.
In these years, attention is paid to a multithread processor system to physically execute in parallel a plurality of threads in a processor which is physically composed of one device in addition to the single thread process to execute one program (thread) in one processor.
In general, a CPU also has the resources, in addition to a register and a status register (CPU status register), which can be observed from software for executing addition, subtraction, multiplication and division and a loading process to read the memory data to a register and a storing process to write data of a register into a memory.
A multithread processor has a plurality of resources to execute a plurality of instructions with a plurality of programs while executing, only in one CPU, individual programs by multiplexing the registers which may be observed with software.
As a system for realizing the multithread process as described above, there are coarse grained multithreading system and vertical multithreading (VMT: time division type multithreading) system (refer to FIG. 2) in which a plurality of threads are not executed simultaneously and the thread is switched to the other thread for execution when an event such as cache miss occurs, in addition to the fine grained multithreading system and simultaneous multithreading (SMT) (refer to FIG. 1) for simultaneously executing a plurality of threads (refer to Japanese publication JP-A No. 2002-163121).
FIG. 1 is a diagram for explaining the SMT system, while FIG. 2 is a diagram for explaining the VMT system.
The VMT system is intended to hide the instruction processes in which a cache miss that takes a longer time to process is generated. When the cache miss is detected, while the process to bring the data from the memory to the cache is executed in the cache control unit (not illustrated), the thread is switched for execution to the other thread for a process other than a memory access in the executing unit and the control unit (not illustrated). In this VMT system, the thread is switched to the other thread when a constant time has passed for the thread in which a cache miss is not easily generated.
FIG. 3 is a diagram for explaining the process when a cache miss is generated in the in-order method. FIG. 4 is a diagram for explaining the process when a cache miss occurs in the out-of-order method. FIG. 5 is a diagram for explaining the thread switching method of the related art in the out-of-order method. In the related art, the VMT method is installed only on the in-order type processor described above.
In the processor for in-order execution, an event such as cache miss occurs in the program sequence and the data generating a cache miss is also responded from the memory in the program sequence (refer to FIG. 3). Meanwhile, in the processor for out-of-order execution, memory access does not occur in the instruction sequence in the program and an event of cache miss is not always generated in the program sequence.
For example, if two instructions A, B exist on the thread X wherein cache miss is generated and the sequence of the instructions A and B is determined in this sequence on the thread X as illustrated in FIG. 5, when it is possible to execute the instruction B before the instruction A, cache miss by the instruction B is detected before detection of the cache miss by the instruction A.
For example, in the example of FIG. 5, when cache miss by the instruction B is detected and the thread X is switched to the other thread Y before detection of cache miss by the instruction A, cache miss by instruction A is generated after execution of the thread X is restarted.
In the in-order execution type processor, execution of the instruction B is started after the start of execution of the instruction A. Accordingly, the cache miss is generated in the sequence of the instructions A and B.
Moreover, in a shared memory system by a multiprocessor, it is known to use the method for locking mutex-lock (Mutual Exclusion lock) in order to attain the exclusive access right. As a typical method for attaining the lock, a spin-loop method has been proposed. In this method, the exclusive access right is obtained among a plurality of processors by providing a “lock variable” on the main memory, repeating, by individual processor, the reference/update trial of the “lock variable” for attaining the lock and the spin loop (waiting for idle state), displaying only the lock state only during the lock period when the lock has been attained and displaying cancellation of lock state when it is requested to cancel the lock state.
However, a check is always required even in this structure by searching the loop, but the processing rate of processors can be more and more improved in these years exceeding the processing rate of the memory system and the relative processing rate tends to be more alienated.
In this state, even when the number of times of idle state by the spin loop is increased, the spin loop instruction stream is interpreted and executed during this period, but any job is not substantially carried out, resulting in the problem of influence applied on the system performance. Particularly, in a large scale Symmetrical Multiprocessor (SMP) system, it is often detected that only a certain lock variable is used frequently. In this case, other CPUs except for only one CPU are working uselessly and therefore the performance cost required by the system operation is left as a problem to be solved.
Moreover, in the processor core having employed the multithread processing system, if the spin loop is generated in a certain thread processing portion, idle operation by the spin loop process in which any job is not executed substantially gives an adverse effect which impedes progress of the other thread process of the processor core.
Similar problem is also generated in the other process using the lock variable, for example, an ordinary processor-to-processor synchronization such as the processor-to-processor synchronization like the barrier synchronization, I/O synchronization and idle loop.
The Japanese publications JP-A No. 1991-164964, JP-A No. 1986-229150, and JP-A No. 2002-41489 are known as the exclusive access control and synchronization control technology of the related art in the multiprocessor system.
The JP-A No. 1991-164964 discloses a mechanism to realize the exclusive access control with centralized monitoring on the main memory by storing a common variable on the relevant main memory. In the processor having the cache memory in recent years, modification in the cache is not immediately reflected on the main memory. Particularly in the write back cache memory, considerably longer time is usually required until the modification is reflected. Moreover, even in the write through cache, memory latency is very longer and reflection loss becomes longer, deteriorating the performance in the current processors.
Accordingly, various spin loop problems described above cannot be solved only by the centralized supervising of the main memory as disclosed in the JP-A No. 1991-164964, and a method is now desired to solve the problems within the cache memory not influenced by the memory latency.
The JP-A No. 1986-229150 discloses the technology to realize exclusive control for the common memory among the CPUs by providing access control signal lines (pins) for exclusive control among the CPUs in addition to the system bus which is shared by a plurality of CPUs. In recent years, higher cost is required for connections among processors (for example, the number of input/output pins of an LSI) and it is more effective for improvement in the performance to use one pin as the data line than use of the same only as for exclusive access control. Otherwise, much contribution can be attained to reduction in manufacturing cost of the CPU by saving the number of pins even if only one pin is saved. Accordingly, it is desired to provide a method of realizing exclusive access control among CPUs without increase in the number of pins.
The JP-A No. 2002-41489 discloses a synchronization control circuit for synchronous control between the processor and the coprocessor which are in the master-and-slave relationship. However, application of this circuit into the system like an SMP (symmetrical multiprocessor) and a cc-NUMA (Cache-Coherent Non-Uniform Memory Access) in which individual processors equally use the common memory is difficult.
Namely, since the processor is in the standpoint to issue instructions to the coprocessor, it can detect the operating conditions of the coprocessor when it hopes. However, in the SMP system, since individual processor does not hold, in principle, the information of the operating conditions of the other processors, it is difficult to apply the technology of the JP-A No. 2002-41489 into the problem in the spin loop described above.
Moreover, in view of solving the problem of spin loop, the method for starting execution has been proposed in which when a particular event which shows spin loop to wait for release is detected, a processor or a thread which is considered as the factor thereof is stopped, the context of the thread in the stop state is saved to the memory, and a new context is stored from the memory (refer to the Japanese publication JP-A No. 1994-44089). However, in the JP-A No. 1994-44089, since the particular event which shows the spin loop is generated by miss-hit during access to the cache, the total performance is likely deteriorated because more useless thread switching and saving of context than the effect of improvement in the performance resulting from reduction in the spin loop time are generated.
Accordingly, in view of solving the problem of spin loop, the method to solve the same problem has been proposed as a background art in which possibility of update event of the lock variable for exclusive access control of memory access is forecasted and the process or thread is stopped at the part which will result in the spin loop. In other words, in view of realizing forecast of the possibility of the update event of lock variable, a new load instruction having the function to set the timing to start the supervising of the memory block in the range including the load object memory block (hereinafter referred to as LOAD-WITH-LOOKUP (LLKUP) instruction) and a write event detecting function for supervising the memory block are provided, and stop and restart of the processor are realized by executing and canceling a pause instruction of the SUSPEND instruction or the like in conjunction with the detection result of the LOAD-WITH-LOOKUP instruction and the write event detecting function.
Namely, FIG. 6A is a diagram for explaining a method of canceling a lock in the background art. As illustrated in FIG. 6A, for acquisition of lock of the lock variable [A] on the memory device, a useless spin loop has been executed to verify change in the lock variable [A] (release from the other processor) by repeating LD[A] after the failure in acquisition by CAS[A].
FIGS. 6B and 7-9, are diagrams for explaining four technologies using a LOAD-WITH-LOOKUP instruction. In contrast to FIG. 6A, in FIG. 6B, technology (1) uses a LOAD-WITH-LOOKUP instruction 601 (see also LLKUP instructions 701, 801, 901 with respect to other LLKUP technologies), in which a CPU1 issues the LOAD-WITH-LOOKUP instruction 601 after the failure of acquisition by the CAS[A] and supervises the store event to the lock variable [A] (possibility of release from the other CPU2). The store event to the lock variable [A] is performed via a store instruction 603 (see also store instructions 703, 803, 903 with respect to other LLKUP technologies). Moreover, the CPU1 also shifts to the pause state with the SUSPEND instruction 602 (see also SUSPEND instructions 702, 802, 902 with respect to other LLKUP technologies). Here, the CPU1 is reset in the timing of the detection of possibility of store event for the lock variable [A] from the other CPU2 in order to try the reacquisition of the lock variable [A]. Accordingly, it is no longer required to execute the useless spin loop.
Namely, in general, as illustrated in FIG. 7 for explaining LOAD-WITH-LOOKUP instruction technology (2), the CPU1 starts supervising the target lock variable [A] with the LOAD-WITH-LOOKUP instruction and thereafter shifts to the SUSPEND (pause state). Upon detection of the access for releasing the lock variable [A] from the other CPU2, the CPU1 is reset from the pause state and starts the subsequent execution of the instruction.
Moreover, in the technology to use the LOAD-WITH-LOOKUP instruction, forecasting of the portion which shows the spin loop and stop/restart of the processor are realized by analyzing the instruction stream of the existing programs. In other words, the step for finding out the instruction stream which becomes the spin loop to find the possibility of update event of the lock variable from the existing instruction stream and the step to stop the relevant processor or the relevant hardware thread in place of the conventional spin loop are executed.
However, the processors in recent years naturally form the cache and supervising of the main memory device is always accompanied by considerable difficulty. Therefore, the technology using the LOAD-WITH-LOOKUP instruction is provided with the write event detecting function for supervising and detecting possibility of the update event of lock variable within the cache memory.
Namely, as illustrated in FIG. 8, as a method of finding out the possibility of the update event of the lock variable [A], the CPU1 side is reset from the pause state in the timing for detecting invalidation on the cache memory of the lock variable [A] from the side of CPU2 in the lock state.
Here, there lies a possibility, as illustrated in FIG. 9, that invalidation (release) on the cache memory of the lock variable [A] is detected during the period until the shift to the pause state from the LOAD-WITH-LOOKUP instruction. In this case, access to the lock variable [A] is continued in direct without shift to the pause state.
The higher the detection accuracy in the possibility of update of the lock variable is, the higher the application efficiency of the process becomes. Moreover, the constitution is provided to avoid the situation that the update cannot be detected even if the update is actually generated in order to prevent that unreasonable hang-up.
Moreover, it is naturally required sometimes to use the suspend method which allows only the restart with only the existing interruption without supervising the address. Accordingly, it is convenient to provide the constitution to enable selection of the suspend instruction when it is used.
However, in certain cases, an additional instruction cannot be generated/added to an existing instruction set or a program cannot be revised (or is difficult to revise) from the old instruction code. In this case, addition of the instruction cannot result in any merit. Therefore, it is desired to propose a method of resulting in the merit and solving such problem without addition of instructions.
Otherwise, for the actual improvement, it can be said more desirable to add the LOAD-WITH-LOOKUP instruction and to clearly give the instruction using the added instruction. Namely, the method for adding the LOAD-WITH-LOOKUP instruction and the method for analyzing the existing instruction stream can be clearly improved with the existing method but it is the best method to combine these methods.
For the installing of this LOAD-WITH-LOOKUP instruction, it is required to supervise whether the memory address of the main memory device designated with the LOAD-WITH-LOOKUP instruction has been updated or not with the other thread or the other processor and therefore the following installing method has been proposed as the related art.
As a first installing method, a method is considered (refer to the U.S. Pat. No. 6,493,741 and the U.S. Pat. No. 6,674,192), in which all bits of the physical address of the cache line as the object of supervising are held in a supervising object management register as the exclusive register and presence of the access to the physical address as the object of supervising is detected with comparison of the physical addresses.
In this case, it is necessary to hold, to a supervising object management register, the information including a physical address of the supervising object, a bit indicating the supervising process, and a thread number of the supervising object. For example, when the WRITE access is generated to the cache memory or to the main memory from the other thread and the physical address is matched with that of the supervising object, update of the address of the supervising object is detected. Moreover, when the cache line including the address of the supervising object is lost by the replace of the cache memory, purge request from the other processor (discharge request) or by the invalidation request, it is reported that the address of the supervising object has been updated because of the possibility that the address of the supervising object may be updated with the other processor.
Next, as the second installing/loading method, a method is considered in which presence of access to the memory location as the supervising object is detected by storing the bit indicating the supervising object in a cache tag and supervising update and reference to the cache line to which the bit indicating the supervising object is set.
In this case, it is required to add, as an entry of cache tag, the bit indicating the supervising object and the thread number of the supervising object. For example, it is reported that the address of supervising object has been updated with the bit indicating the supervising object and the bit indicating the thread number of the supervising object registered to the cache tag at the time of processing the WRITE access request to the cache memory or main memory, or invalidation or purge request of the cache line by the replacement, and invalidation and purge request of the cache line by the request from the other processors.
FIG. 16 illustrates a structure of an address comparator of the related art, corresponding to the first installing/loading method. The line address stored in an address supervising register 1601 is compared with the line address stored in an update access address register 1602 for storing the update access address at the time of cache access. The Ex-NOR logic gates 1611 to 1518 output negation of exclusive OR of the address bits and output logical AND with an AND gate 1619 in order to detect matching of the line addresses. In the method of the related art, the physical addresses (41 bits of bit [46:6] in FIG. 16) have to be compared completely. Accordingly, the logic circuit is increased from the physical point of view.
FIG. 17 illustrates a method of storing the supervising addresses in a cache tag of the related art, corresponding to the second installing/loading method. A tag RAM 1701 includes a plurality of entries and each entry is formed of a valid flag 1702, cache status 1703, a supervising flag 1704, and a physical address 1705. In the method of the related art, the RAM is increased physically because all entries of the relevant tag RAM is required to provide a valid flag and a supervising flag.
FIG. 18 illustrates an example of the hardware structure for update control of the supervising object block, corresponding to the second installing/loading method. A read/write control unit 1811 judges whether the relevant cache access is the READ access or WRITE access when the cache access is generated and controls the select signal of multiplexers 1812 and 1813. For example, when the relevant cache access is the READ access, the read/write control unit 1811 controls the select signal to control the multiplexers 1812 and 1813 to output the READ address 1801. While when the relevant cache access is the WRITE access, it controls the select signal to control the multiplexers 1812 and 1813 to output the WRITE address 1802.
The tag RAM 1815 for WAY 0 and the tag RAM 1816 for WAY 1 are RAMs provided with the write enable (WE) terminal and the WE terminal(s) executes the write operation for the RAM when 1 is input to the relevant write enable.
A cache LRU control RAM 1817 might correspond to a cache LRU control RAM 1112 of FIG. 11. In the related art, the cache LRU control RAM 1817 is used for control of cache LRU (Least Recently Used) for the cache 1815, 1816 and outputs the replace WAY-ID 1803 based on the LRU information. An inverter logic gate 1814 is a logical gate for outputting negation of input.
In this related art, when the read access 1801 is generated as the cache access, the read/write control unit 1811 selectively controls the multiplexers 1812 and 1813, searches the relevant line address of the tag RAM 1815 for WAY 0 and the tag RAM 1816 for WAY 1, and also searches the cache LRU control RAM 1817. When a cache miss is generated in the searches of the tag RAM 1815 for WAY 0 and tag RAM 1816 for WAY 1, cache is registered with the replacement of the relevant line address of the rag RAMs for WAY 0 and WAY 1 in accordance with the replace WAY-ID 1803 on the basis of the LRU information of the cache LRU control RAM 1817.
Accordingly, when the line address is identical to the line address of the supervising object block and the WAY-ID of the tag RAM registered is identical, useless thread switching can be generated because the relevant supervising object block could be replaced.
In the first installing method of the related art, all physical addresses are stored in the supervising object management register for each thread and thereby the supervising object management register physically becomes large. As the trend in future, a high-end server which is required to have higher processing capability for the principal job processes in the company tends to cover high-level multithreading by a large scale CMP (Chip Multi-Processor). Therefore, it may be said that the system to obtain the supervising object by simply storing all physical addresses as many as the number of threads has insufficient expandability to high-level multithread processor in future.
In addition, in the second instruction installing method in the related art, it is required to add the entry for the supervising process to all cache lines of the cache tag, but the cache line as the supervising object has higher possibility that the cache line itself is purged at the time of the cache replacement thereof, resulting in the problem that unwanted thread switching is generated because update of the address of the supervising object is reported carelessly.
Therefore, the first and second installing/loading methods in the related art listed above can be said to be within the scope of the related art because these do no disclose any effective method for supervising the addresses in regard to the method of switching a plurality of threads.