1. Field of the Invention
The present invention relates to a data processing system, and more particularly to a multithreaded data processing system and an operating method thereof for processing a plurality of threads.
2. Description of the Conventional Art
Generally, plural instructions in a computer system are executed in sequence. An improvement in the processing time of the instructions can be achieved through the use of a cache memory providing a significant reduction in latency which means the time span between the start and the completion of an instruction's execution. Such a cache memory serve to reduce tens of cycles to a few cycles in the case of a memory reference operation.
In a single thread processor, the current state of a computation is defined by the contents of program counter, general-purpose registers, and status registers of the data processing system, etc., wherein the term "thread" means statically ordered sequences or instructions. Typically, a machine state of thread-related program counter and registers, including the above-mentioned elements provided in such a data processing system, are referred to as a hardware context. When the computation is interrupted in the above-described single thread processor, that is, when there occurs a waiting mode in which the processor must wait until a predetermined resource becomes available, the related context should be stored in a memory, so that resumption is possible at the moment when the computation is resumed.
Such a procedure may be referred to as a context switch, i.e., the act of redirecting processor execution from one context to another. That is, execution of one thread stops, to permit starting or resuming another thread's execution. Accordingly, the context should be saved in a memory and restored from the memory if required. However, such a context switch incurs context-switching overhead.
In order to decrease or eliminate such a context-switching overhead, a multithreaded architecture may be employed to obtain a parallel processing of a plurality of threads, thereby reducing a necessary instruction processing time period. Accordingly, depending on a design of the multithreaded architecture, this enables a workstation or a personal computer to effectively cope with its general task amount.
Specifically, an improvement in data processing speed is obtained by providing a plurality of general-purpose registers, status registers and program counter while using a multithreaded processor architecture, so that a plurality of thread contexts can be simultaneously stored in a given hardware. Therefore, during a context switching, it is not required to store the contents of registers into a memory and to retrieve the stored contents later. As a result, the processor becomes freed from the procedures that incur a long latency. However, a costly hardware is in return required, and so is a compromise.
Support for more than one context affects the performance of a processor, as well as the cost of its hardware. Increased costs stem from the replication of registers and other state circuitry for each context. If an access to register files is on the processor critical path, the cycle time could be increased due to the larger number of registers on multithreaded units.
For the most part, an architectural decision about how many contexts to support is based on the hardware budget, cycle time considerations, and expectations of run length and latency.
A computer system as a data processing apparatus is based on a CPU (Central Processing Unit) or processor that is used for recognizing and processing instructions given, and it is applied to a variety of industrial fields. Although such a processor is capable of recognizing a considerable number of instructions, the system speed experiences a retardation due to sparsely used instruction. Therefore, in order to prevent the speed retardation, complicated and long instructions may be advantageously replaced by combinations of more frequently employed short instructions. Consequently, a new design technique has been introduced, wherein respective instructions are set identical in size and multiple instructions are concurrently executed. A conventional data processing system having a so-called superscalar architecture for simultaneously processing a plurality of instructions will now be described.
A main trend of computer design under the above-mentioned circumstances is concerned with a superscalar processor which is able to provide and issue more than two subinstructions within a single cycle. Respective subinstructions are processed by each of multiple functional units in the processor for processing a plurality of threads.
The use of superscalar and multithreaded architectures provides a processor coupling. A plurality of subinstructions may be concurrently issued from several threads within a single cycle under the processor coupling.
FIG. 1 illustrates the processing of multiple threads in s conventional multithreaded architecture. As shown therein, there are provided, for example, three threads T1, T2, T3, each of which designates some of eight functional units f1-f8 in a processor which are to be used during, for example, five consecutive cycles. Here, reference numeral letter E denotes respective execution of functional units f1-f8 during the respective cycles in a vertical direction in the drawing. The respective functional units f1-f8 in the processor are mapped from different threads, and this clearly explains a processor coupling.
When more than two threads need to get access to identical functional units, one of the threads must wait until the corresponding functional unit completes its processing of the other thread.
In cycle 1 in FIG. 1, the first thread T1 is allocated to the third and fourth functional units f3, f4. In cycle 2, the first and second functional units f1, f2, which the first thread T1 also requires, are allocated for the third thread T3, so that the first and second functional units f1, f2 for the first thread T1 are assigned to the subsequent cycles 3, 4, respectively. Accordingly, a plurality of threads can be simultaneously processed within a certain number of cycles.
Such an architecture provides an interleaving as well as a thread switching as described below. In case a thread includes instructions with long latency, if there are no instructions to be issued due to the holding of such long latency instructions, instructions from other threads are automatically sent to functional units. In fact, a threads are switching may be understood as a subcategory of the interleaving. Further details as to FIG. 1 are described in S. Keckler and W. Dally, 19th International Symposium on Computer Architecture, 1992.
Next, a VLIW (Very Long Instruction Word) processor for processing a plurality of threads will now be explained.
The VLIW processor provides a parallel processing of instructions to reduce the number of instructions, wherein an instruction specifies more than one subinstructions.
FIG. 2 illustrates an instruction timing in a VLIW processor. As shown therein, a plurality of instructions are simultaneously processed. Here, reference characters i0, i1, i2 respectively denote serial instruction stream, f denotes a fetch stage, d is an instruction decoding stage, and e is an instruction execution stage. More details related to VLIW can be found in "Superscalar Microprocessor Design" by Mike Johnson, Prentice Hall, 1991.
As described above, a context switching occurs in a multithreaded processor architecture. In particular, a context switching overhead causing a performance deterioration of a system tends to be more problematic with a long latency instruction when a multithreaded architecture is coupled to a VLIW or superscalar processor.