1. Field of the Invention
The present invention relates to a parallel processing apparatus, and, more particularly, to a parallel processing apparatus which processes programs in parallel while generating and terminating a thread consisting of a plurality of instructions in a plurality of processors.
2. Description of the Related Art
Today""s typical computers are of a von Neumann-type whose built-in processor that plays the key role in each computer repeats a sequence of procedures of fetching a single instruction, decoding it, executing a process specified by that instruction, accessing the memory and writing the execution result back in the memory.
To improve the processing speed, the current computers each have a cache memory with a fast access speed provided between the main memory and the processor. The processor therefore mainly exchanges data with the cache memory. The operation of the processor of reading an instruction from the cache memory is called xe2x80x9cinstruction fetchingxe2x80x9d, and the operation of decoding an instruction is called xe2x80x9cinstruction decodingxe2x80x9d, and the operation of writing the execution result back into the memory is called xe2x80x9cwrite backxe2x80x9d.
Pipelining is known as one of techniques that improve the processing speed of processors. The xe2x80x9cpipeliningxe2x80x9d process is described in many books about computers, for example, xe2x80x9cComputer Architecturexe2x80x9d by Hennessy and Patterson. Pipelining is the technique that improves the processing performance by allowing a plurality of instructions, each of which performs only a part of the entire process, to be executed in an overlapped manner in one clock cycle.
FIG. 13 is a diagram for explaining a pipelining process.
An instruction is executed in separate pipeline stages called xe2x80x9cinstruction fetching (IF)xe2x80x9d, xe2x80x9cinstruction decoding (ID)xe2x80x9d, xe2x80x9cinstruction execution (EX)xe2x80x9d, xe2x80x9cmemory access (MEM)xe2x80x9d and xe2x80x9cwrite back (WB)xe2x80x9d. In cycle T1, an instruction at an address xe2x80x9c1000xe2x80x9d undergoes instruction fetching. In cycle T2, the instruction at the address xe2x80x9c1000xe2x80x9d undergoes instruction decoding and an instruction at an address xe2x80x9c1004xe2x80x9d undergoes instruction fetching at the same time. This technique of simultaneously executing a plurality of instructions in an overlapped manner is called xe2x80x9cpipeliningxe2x80x9d. Registers placed between processes are called xe2x80x9cpipeline registersxe2x80x9d and a process unit for carrying out each process is called a xe2x80x9cpipeline stagexe2x80x9d. As is apparent from the above, pipelining speeds up the processing as a whole by executing instructions, described in a program, in parallel.
However, there occurs a circumstance where an instruction cannot be executed in the proper cycle due to a change in the program flow caused by a branching instruction. While a scheme of computing the address of the branching destination specified by the branching instruction at an early stage in the pipeline stages, such as the ID stage, is taken for faster processing, the branching destination cannot be determined for a conditional branching instruction until the condition is determined. For a conditional branching instruction, therefore, the cycle that stops pipelining is eliminated by carrying out a scheme of predicting whether or not its branching condition is satisfied by using history information (see pp. 302 to 307 in the aforementioned book entitled xe2x80x9cComputer Architecturexe2x80x9d by Hennessy and Patterson).
A xe2x80x9csuperscalarxe2x80x9d system (xe2x80x9cSuperscalarxe2x80x9d by Johnson) which improves the processing speed by providing a plurality of processing elements or processor elements in a single processor and simultaneously generating a plurality of instructions has already been put to a practical use. The superscalar system is ideally capable of executing instructions equal in number to the provided processor elements in one clock. It is however said that even if the number of processor elements should be increased limitlessly, instructions would not be smoothly executed due to a branching instruction and the actual performance would be restricted to about three to four times that of the case of using a single processor.
Another practical way of improving the processing speed is to perform parallel processing by using a plurality of processors. In a typical processor system which accomplishes parallel processing by using a plurality of processors, parallel processing is executed by carrying out communication among the processors to assign processes to the individual processors. A system which uses conventional processors accomplishes such communication by an interruption processing scheme that is from outside carried out externally each processor as an interrupt control on that processor.
In the interruption processing scheme, when an external unit interrupts a processor, a program to be executed in the processor is switched to an interruption program from a user program and the interruption process is then executed. When the interruption process is completed, the original user program is resumed. To switch the execution program in a processor, data which will be used again by the original user program, such as data in the program counter or register file, is saved in a memory device. The overhead that is need for such data saving for switching between programs is nonnegligibly large and an interruption process is generally takes time. A parallel processing system which uses interruption processing therefore suffers a large overhead in communications between processors, which is an impediment in enhancing the performance.
One solution to this problem is a so-called multi-thread architecture. This technique is disclosed in, for example, xe2x80x9cA Multi-threaded Massively Parallel Architecturexe2x80x9d, Proceedings of 19th International Symposium on Computer Architecture, R. S. Nikhil, G. M. Papadopuolos, and Arvind, pp. 156-167.
A xe2x80x9cthreadxe2x80x9d is a sequence of instructions. A program consists of a plurality of threads. In a multi-thread architecture, thread-by-thread processes are assigned to a plurality of processors so that those processors can process threads in parallel. Therefore, the multi-thread architecture has a mechanism and an instruction for allowing a thread which is being executed on one processor to generate a new thread on another processor.
The generation of a new thread on another processor is called xe2x80x9cto fork a threadxe2x80x9d and an instruction to fork a thread is called a xe2x80x9cfork instructionxe2x80x9d. A fork instruction specifies to which processor element a thread should be forked and which thread to fork.
Control parallel processing has been proposed in, for example, xe2x80x9cProposition Of On Chip MUlti-Stream Control Architecture (MUSCAT)xe2x80x9d by Torii et al., Joint Symposium Parallel Processing JSPP ""97, pp. 229-236. The multi-stream control architecture analyzes the control flow of a program, predicts a path which is very likely to be executed soon, and speculatively executes the path before its execution is established. In this manner, the multi-stream control processes programs in parallel.
FIG. 14 is a diagram showing a model of multi-stream control.
A conventional sequence of instructions which are executed sequentially consists of threads A, B, and C. In the sequential execution, one processor processes the threads A, B, and C in order as shown in section (a) in FIG. 14. In the multi-stream control, by way of contrast, while a processor element (PE)#0 is processing the thread A, the thread B which is expected to be executed later is forked to and is processed by a processor element #1 as shown in section (b) in FIG. 14. The processor element #1 forks the thread C to a processor element #2. The speculative execution of threads which are expected to be executed later can ensure parallel processing of threads, thus improving the processing performance.
The aforementioned paper that has proposed the xe2x80x9cMUSCATxe2x80x9d mentions that it is not always possible to predict, before execution, whether or not a thread is to be forked. It is also known that adequate parallel processing cannot be achieved merely by the established forking that involves threads whose forking has been established before execution. In this respect, the MUSCAT employs controlled speculation that analyzes a program at the time of compiling it and speculatively executes a thread which is highly likely to be executed before its execution is established. The fork instruction that is to be speculatively executed is called a xe2x80x9cspeculation fork instructionxe2x80x9d. If the speculative execution in the multi-stream control has failed, however, the thread that has been speculatively executed must be canceled before actual execution. This means a wasteful operation of the processor elements, which undesirably leads to increased power consumption.
A thread which is executed by each processor element finishes a series of processes by its end instruction. When a thread is forked by a speculation fork instruction, the termination of the thread becomes effective in response to the end instruction. When a thread is not forked, however, it may be unnecessary to execute such an end instruction in some cases. To cope with this situation, the MUSCAT uses a conditioned end instruction so that executing an end instruction depends on whether or not that condition is met. As a plurality of threads are processed in parallel, however, a conditioned end instruction, which is to be executed after the condition is met, may be processed in the multi-stream control before an instruction to determine that condition is executed. In such a case, the conditioned end instruction should wait for the processing of the condition-determining instruction to end. If the termination is decided, fetching or the like of subsequent instructions which becomes wasteful is carried out until the condition is determined. This also results in increased power consumption.
Accordingly, it is an object of the present invention to provide a parallel processing apparatus which is used in a processor system that carries out parallel processing using a plurality of processors and which efficiently executes fork instructions for activating a plurality of processors, thereby reducing power consumption.
It is another object of this invention to provide a parallel processing apparatus capable of efficiently terminating a thread with respect to the aforementioned conditioned end instruction for the thread.
It is a further object of this invention to provide a parallel processing apparatus which efficiently accomplishes the execution of the aforementioned speculation fork instruction and thread-end-conditioned thread-end instruction in the form of a hardware unit.
To achieve the above objects, according to the first aspect of this invention, there is provided a parallel processing apparatus having processing means for generating (forking) a thread consisting of a plurality of instructions on an external unit,
the processing means including a predicting section for making a prediction of whether or not a fork condition of a fork-conditioned fork instruction is satisfied after fetching but before executing the instruction.
According to the second aspect of this invention, there is provided a parallel processing apparatus comprising processing means having means for issuing an externally forked thread,
the processing means including a predicting section for making a prediction of whether or not a thread-end condition of a thread-end-conditioned thread-end instruction for terminating the forked thread is satisfied after fetching but before executing the instruction.
According to the third aspect of this invention, there is provided a parallel processing apparatus comprising processing means for generating a thread consisting of a plurality of instructions on an external unit, the processing means including:
means for issuing an externally forked thread; and
a predicting section for predicting whether or not a fork condition of a fork-conditioned fork instruction is satisfied after fetching but before executing the fork instruction and whether or not a thread-end condition of a thread-end-conditioned thread-end instruction for terminating the forked thread is satisfied after fetching but before executing the thread-end instruction.
According to one modification of the parallel processing apparatuses of the first to third aspects, in addition to making the prediction, when an input instruction is a conditional branching instruction, the predicting section predicts whether or not the conditional branching instruction is satisfied.
In any one of the above-described parallel processing apparatuses, a plurality of the processing means may be provided.
In any one of the above-described parallel processing apparatuses, the predicting section may make the prediction using history information. In this case, it is preferable that the history information have a plurality of states according to probabilities of the prediction. In the latter case, the predicting section may predict the fork condition, the thread-end condition or the conditional branching instruction based on the states.
In the parallel processing apparatus according to the first aspect, it is preferable that the fork-conditioned fork instruction include information about the result of previous analysis of the probability of the fork condition, and the predicting section predicts whether or not the fork condition is satisfied in accordance with the probability.
In the parallel processing apparatus according to the second aspect, it is preferable that the thread-end-conditioned thread-end instruction include information about the result of previous analysis of the probability of the thread-end condition, and the predicting section predicts whether or not the thread-end condition is satisfied in accordance with the probability.
In the parallel processing apparatus according to the third aspect, it is preferable that the fork-conditioned fork instruction include information about results of previous analysis of the probability of the fork condition and a probability of the thread-end condition, and the predicting section predicts whether or not the fork condition and the thread-end condition are satisfied in accordance with the probabilities.
In the parallel processing apparatus according to the aforementioned second case, the processing means may further include memory means for storing the history information associated with at least two of the fork condition, the thread-end condition, and the conditional branching instruction.
In the parallel processing apparatus according to the modification, the processing means may further include address generating means for generating a top instruction address of a thread to be generated when the fork condition is satisfied and generating an instruction address of a branching target when the conditional branching instruction is satisfied.
According to a more specific example of the first aspect of this invention, there is provided a parallel processing apparatus comprising:
analysis means for analyzing an input instruction;
prediction means for, when the instruction analyzed by the analysis means is a fork-conditioned fork instruction, predicting whether or not a fork condition of the fork-conditioned fork instruction is satisfied after fetching but before executing the instruction and sending out a fork instruction in accordance with a result of the prediction; and
execution means for executing the instruction, deciding whether or not the prediction of the fork instruction is correct, and sending out an instruction to cancel a thread generated by the fork instruction, when the fork instruction has been sent out and the prediction is wrong.
According to a more specific example of the second aspect of this invention, there is provided a parallel processing apparatus comprising:
analysis means for analyzing an input instruction;
prediction means for, when the instruction analyzed by the analysis means is a thread-end-conditioned thread-end instruction for terminating a forked thread, predicting whether or not a thread-end condition of the thread-end-conditioned thread-end instruction is satisfied after fetching but before executing the instruction, and sending out a thread-end instruction in accordance with a result of the prediction; and
execution means for executing the instruction, deciding whether or not the prediction of the thread-end instruction is correct, and sending out an instruction to cancel stopping of a thread which has been stopped by the thread-end instruction, when the thread-end instruction has been sent out and the prediction is wrong.
In the parallel processing apparatus according to the specific examples of the first and second aspects of this invention, it is preferable that the prediction means should include memory means for storing history information and update means for updating the history information stored in the memory means;
the execution means informs the update means of a result of the decision; and
the update means updates the history information in accordance with the result of the decision.