The present invention relates in general to an improved method for and apparatus of a computer data processing system; and in particular, to an improved high performance multithreaded computer data processing system and method embodied in the hardware of the processor.
The fundamental structure of a modern computer includes peripheral devices to communicate information to and from the outside world; such peripheral devices may be keyboards, monitors, tape drives, communication lines coupled to a network, etc. Also included in the basic structure of the computer is the hardware necessary to receive, process, and deliver this information from and to the outside world, including busses, memory units, input/output (I/O) controllers, storage devices, and at least one central processing unit (CPU), etc. The CPU is the brain of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer""s hardware, most systems operate in fundamentally the same manner. Processors actually perform very simple operations quickly, such as arithmetic, logical comparisons, and movement of data from one location to another. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system, however, may actually be the machine performing the same simple operations, but much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
One measurement of the overall speed of a computer system, also called the throughput, is measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, particularly the clock speed of the processor. So that if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Computer processors which were constructed from discrete components years ago performed significantly faster by shrinking the size and reducing the number of components; eventually the entire processor was packaged as an integrated circuit on a single chip. The reduced size made it possible to increase the clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems still exists. Hardware designers have been able to obtain still further improvements in speed by greater integration, by further reducing the size of the circuits, and by other techniques. Designers, however, think that physical size reductions cannot continue indefinitely and there are limits to continually increasing processor clock speeds. Attention has therefore been directed to other approaches for further improvements in overall speed of the computer system.
Without changing the clock speed, it is still possible to improve system speed by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this practical. The use of slave processors considerably improves system speed by off-loading work from the CPU to the slave processor. For instance, slave processors routinely execute repetitive and single special purpose programs, such as input/output device communications and control. It is also possible for multiple CPUs to be placed in a single computer system, typically a host-based system which services multiple users simultaneously. Each of the different CPUs can separately execute a different task on behalf of a different user, thus increasing the overall speed of the system to execute multiple tasks simultaneously. It is much more difficult, however, to improve the speed at which a single task, such as an application program, executes. Coordinating the execution and delivery of results of various functions among multiple CPUs is a tricky business. For slave I/O processors this is not so difficult because the functions are pre-defined and limited but for multiple CPUs executing general purpose application programs it is much more difficult to coordinate functions because, in part, system designers do not know the details of the programs in advance. Most application programs follow a single path or flow of steps performed by the processor. While it is sometimes possible to break up this single path into multiple parallel paths, a universal application for doing so is still being researched. Generally, breaking a lengthy task into smaller tasks for parallel processing by multiple processors is done by a software engineer writing code on a case-by-case basis. This ad hoc approach is especially problematic for executing commercial transactions which are not necessarily repetitive or predictable.
Thus, while multiple processors improve overall system performance, there are still many reasons to improve the speed of the individual CPU. If the CPU clock speed is given, it is possible to further increase the speed of the CPU, i.e., the number of operations executed per second, by increasing the average number of operations executed per clock cycle. A common architecture for high performance, single-chip microprocessors is the reduced instruction set computer (RISC) architecture characterized by a small simplified set of frequently used instructions for rapid execution, those simple operations performed quickly as mentioned earlier. As semiconductor technology has advanced, the goal of RISC architecture has been to develop processors capable of executing one or more instructions on each clock cycle of the machine. Another approach to increase the average number of operations executed per clock cycle is to modify the hardware within the CPU. This throughput measure, clock cycles per instruction, is commonly used to characterize architectures for high performance processors. Instruction pipelining and cache memories are computer architectural features that have made this achievement possible. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished. Cache memories store frequently used and other data nearer the processor and allow instruction execution to continue, in most cases, without waiting the full access time of a main memory. Some improvement has also been demonstrated with multiple execution units with look ahead hardware for finding instructions to execute in parallel.
The performance of a conventional RISC processor can be further increased in the superscalar computer and the Very Long Instruction Word (VLIW) computer, both of which execute more than one instruction in parallel per processor cycle. In these architectures, multiple functional or execution units are provided to run multiple pipelines in parallel. In a superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction is allowed to complete before all instructions ahead of it have been completed, as long as a predefined rules are satisfied.
For both in-order and out-of-order completion of instructions in superscalar systems, pipelines will stall under certain circumstances. An instruction that is dependent upon the results of a previously dispatched instruction that has not yet completed may cause the pipeline to stall. For instance, instructions dependent on a load/store instruction in which the necessary data is not in the cache, i.e., a cache miss, cannot be completed until the data becomes available in the cache. Maintaining the requisite data in the cache necessary for continued execution and to sustain a high hit ratio, i.e., the number of requests for data compared to the number of times the data was readily available in the cache, is not trivial especially for computations involving large data structures. A cache miss can cause the pipelines to stall for several cycles, and the total amount of memory latency will be severe if the data is not available most of the time. Although memory devices used for main memory are becoming faster, the speed gap between such memory chips and high-end processors is becoming increasingly larger. Accordingly, a significant amount of execution time in current high-end processor designs is spent waiting for resolution of cache misses and these memory access delays use an increasing proportion of processor execution time.
And yet another technique to improve the efficiency of hardware within the CPU is to divide a processing task into independently executable sequences of instructions called threads. This technique is related to breaking a larger task into smaller tasks for independent execution by different processors except here the threads are to be executed by the same processor. When a CPU then, for any of a number of reasons, cannot continue the processing or execution of one of these threads, the CPU switches to and executes another thread. The term xe2x80x9cmultithreadingxe2x80x9d as defined in the computer architecture community is not the same as the software use of the term which means one task subdivided into multiple related threads. In the architecture definition, the threads may be independent. Therefore xe2x80x9chardware multithreadingxe2x80x9d is often used to distinguish the two uses of the term. Within the context of the present invention, the term multithreading connotes hardware multithreading to tolerate memory latency.
Multithreading permits the processors"" pipeline(s) to do useful work on different threads when a pipeline stall condition is detected for the current thread. Multithreading also permits processors implementing non-pipeline architectures to do useful work for a separate thread when a stall condition is detected for a current thread. There are two basic forms of multithreading. A traditional form is to keep N threads, or states, in the processor and interleave the threads on a cycle-by-cycle basis. This eliminates all pipeline dependencies because instructions in a single thread are separated. The other form of multithreading, and the one considered by the present invention, is to interleave the threads on some long-latency event.
Traditional forms of multithreading involves replicating the processor registers for each thread. For instance, for a processor implementing the architecture sold under the trade name PowerPC(trademark) to perform multithreading, the processor must maintain N states to run N threads. Accordingly, the following are replicated N times: general purpose registers, floating point registers, condition registers, floating point status and control register, count register, link register, exception register, save/restore registers, and special purpose registers.
Additionally, the special buffers, such as a segment lookaside buffer, can be replicated or each entry can be tagged with the thread number and, if not, must be flushed on every thread switch. Also, some branch prediction mechanisms, e.g., the correlation register and the return stack, should also be replicated. Fortunately, there is no need to replicate some of the larger functions of the processor such as: level one instruction cache (L1 I-cache), level one data cache (L1 D-cache), instruction buffer, store queue, instruction dispatcher, functional or execution units, pipelines, translation lookaside buffer (TLB), and branch history table. When one thread encounters a delay, the processor rapidly switches to another thread. The execution of this thread overlaps with the memory delay on the first thread.
Existing multithreading techniques describe switching threads on a cache miss or a memory reference. A primary example of this technique may be reviewed in xe2x80x9cSparcle: An Evolutionary Design for Large-Scale Multiprocessors,xe2x80x9d by Agarwal et al., IEEE Micro Volume 13, No. 3, pp. 48-60, June 1993. As applied in a RISC architecture, multiple register sets normally utilized to support function calls are modified to maintain multiple threads. Eight overlapping register windows are modified to become four non-overlapping register sets, wherein each register set is a reserve for trap and message handling. This system discloses a thread switch which occurs on each first level cache miss that results in a remote memory request. While this system represents an advance in the art, modem processor designs often utilize a multiple level cache or high speed memory which is attached to the processor. The processor system utilizes some well-known algorithm to decide what portion of its main memory store will be loaded within each level of cache and thus, each time a memory reference occurs which is not present within the first level of cache the processor must attempt to obtain that memory reference from a second or higher level of cache.
It should thus be apparent that a need exists for an improved data processing system which can reduce delays due to memory latency in a multilevel cache system utilized in conjunction with a multithread data processing system.
An object of the present invention is to provide an improved data processing system and method for multithreaded processing embodied in the hardware of the processor. This object is achieved by a multithreaded processor capable of switching execution between two threads of instructions, and thread switch logic embodied in hardware registers with optional software override of thread switch conditions. An added advantage of the thread switch logic is that processing of various threads of instructions allows optimization of the use of the processor among the threads.
Another object of the present invention is to improve multithreaded computer processing by allowing the processor to execute a second thread of instructions thereby increasing processor utilization which is otherwise idle because it is retrieving necessary data and/or instructions from various memory elements, such as caches, memories, external I/O, direct access storage devices for a first thread.
An additional object of the present invention is to provide a multithread data processing system and method which performs conditional thread switching wherein the conditions of thread switching can be different per thread or can be changed during processing by the use of a software thread control manager.
It is yet another object of the invention to allow processing of a second thread when a first thread has a latency event, such as a cache miss, which requires a large number of cycles to complete, during which time the second thread may experience a cache miss at the same cache level which can be completed in much less time.
It is yet another object of the invention to prevent thrashing wherein each thread is locked in a repetitive cycle of switching threads without any instructions executing. The invention provides a forward progress count register and method which allows up to a programmable maximum number of thread switches called the forward progress threshold after which the processor stops switching threads until one thread is able to execute. The forward progress register and its threshold monitors the number of thread switches that have occurred without an instruction having been executed and when that number is equal to a threshold no further thread switching occurs until an instruction is executed. An added advantage of the forward progress count register is that the register and threshold can be customized for certain latency events, such as one threshold value for a very long latency event such as access to external computer networks; and another forward progress threshold for shorter latency events such as cache misses.
It is yet another object to prevent computer processing on a thread from being inactive for an excessive period of time. The feature of the invention to achieve this object is to force a thread switch after waiting the number of cycles specified in a thread switch time-out register. An added advantage of this feature is that the computer processing system does not experience hangs resulting from shared resource contention. Fairness of allocating processor cycles between threads is accomplished and the maximum response latency to external interrupts and other events external to the processor is limited.
It is a further object of the invention to provide for rapid thread switching conditions. This object is achieved by hardware registers which stores the state of threads, the priority of threads, and thread switch conditions.
It is yet another object of the invention to provide flexibility to modify the results of the thread switch hardware registers. This object is achieved by altering the priority of one or more of the threads in the processor. Either a signal from an interrupt request or a software instruction can be used to modify bits in a state register indicating the priority of each thread. Then depending upon the priority of each thread, a thread switch may occur to allow a higher priority thread to have more processing cycles. The advantage to altering the priority allows changing the frequency of thread switching, increasing execution cycles for a critical task, and decreasing the number of processing cycles lost by the high priority thread because of thread switch latency.
These and other related objects are achieved by providing a method for computer processing comprising storing the states of all threads, whether the thread is an active thread executing in a multithreaded processor or a background ground waiting for execution, in corresponding hardware registers; executing at least one active thread in the multithreaded processor and changing the state of the active thread. Changing the state of the active thread can cause the multithreaded processor to switch execution to a background thread.
There are several methods to change the state of any or all the threads in the multiprocessor complex. The state of a thread will change when that thread experiences a latency event which stalls execution of that thread in the multithreaded processor. The state of a thread can also change when the priority of that or another thread is altered.
As a result of any or a combination of several events, the multithreaded processor can switch to another thread. For instance, the inventive method herein also comprises counting the number of multiprocessor cycles that the at least one active thread has been executing and when the number of execution cycles is equal to a time-out value, then switching execution to the at least one background thread. Another step of the inventive method which can result in the multithreaded processor switching threads is receiving an external interrupt signal indicating that data and/or instructions for any thread in the processor has been received from an external source; the external interrupt signal may or may not alter the priority of the thread to which the interrupt signal pertains.
The inventive method also comprises determining if changing the state any of the threads in the multithreaded processor causes it to switch execution to the at least one background thread by, inter alia, checking if the change of state results from a latency event, determining if the latency event is a thread switch event, and determining if the thread switch event is enabled. The thread switch event is enabled when at least one bit in a thread switch control register corresponding to the thread switch event is enabled.
Even though a thread within the multithreaded processor may change state, the multithreaded processor may still not switch execution to another thread when the latency event is not a thread switch event, or when the thread switch event is not enabled in the thread switch control register, or when a change of priority is irrelevant. Forward progress count also precludes switching threads by counting a number of thread switches that has occurred away from the at least one active thread, comparing the number with a count threshold, and signaling when the number is equal to the count threshold and in response thereto not switching execution.
The invention is also a method of computer processing, comprising storing a first state of at least one active thread in at least one hardware register and storing a second state of at least one background thread in at least one hardware register, executing at least one active thread in a multithreaded processor. The method changes the first state of an active thread if any one of the following conditions occur: (i) execution of an active thread stalls because of a latency event; or (ii) altering priority of an active thread to be equal to or lower than priority of a background thread. The method then determines if changing the first state of the active thread causes the multithreaded processor to switch execution to the background thread by first determining if the latency event is a thread switch event, then determining if the thread switch event is enabled. The method envisions that the multithreaded processor can switch execution to the at least one background thread under one of the following conditions: (i) counting the number of processor cycles that the active thread has been executing and when the number of execution cycles is equal to a time-out value, then switching execution to the background thread; (ii) receiving an external interrupt signal and then switching execution to the background thread; (iii) at least one bit in a thread switch control register corresponding to the thread switch event is enabled; or (iv) changing priority of a background thread to a priority equal to or higher than the priority of the active thread. The multithreaded processor may not switch execution to the background thread under one of the following conditions: (i) the latency event is not a thread switch event; (ii) the thread switch event is not enabled; or (iii) by counting a number of thread switches that has occurred away from the active thread, then comparing the number with a count threshold and signaling the thread switch control register when the number is equal to the count threshold.
The invention is also a thread state register comprising a plurality of bits to store a state of at least one active thread and a state of at least one background thread wherein some of the plurality of bits indicate a latency event, if a transition to each respective state results in switching execution to another of the threads, and priority of the threads.
The invention is also a data processing system, comprising a central processing unit having a multithreaded processor capable of executing at least one active thread and storing the state of at least one background thread, a plurality of execution units, a plurality of registers, a plurality of cache memories, a main memory, and an instruction unit; wherein the execution units, the registers, the memories, and the instruction unit are functionally interconnected; said central processing unit further comprising a thread switch logic unit and a storage control unit also functionally connected to said multithreaded processor. The data processing system also comprises a plurality of external connections comprising a bus interface, a bus, at least one input/output processor connected to at least one of the following: a tape drive, a data storage device, a computer network, a fiber optics communication, a workstation, a peripheral device, an information network; any of which are capable of transmitting data and instructions.to the central processing unit over the bus. In the data processing system of the invention, when the at least one active thread stalls execution, the event and reason thereof are communicated to the storage control unit, the storage control unit sends a corresponding signal to the thread switch logic unit, and the thread switch logic unit changes the state of the at least one active thread and determines if the multithreaded processor will switch threads and execute one of said plurality of background threads.
The invention is also a computer processing system comprising a multithreaded processor unit; a thread switch logic unit functionally connected to the multithreaded processor; and a storage control unit functionally connected to the multithreaded processor and the thread switch logic unit. The storage control unit receives data, instructions, and input for the multithreaded processor and signals the thread switch logic unit and the multithreaded processor according to the data, instructions, and input. In response, the thread switch logic outputs signals to the multithreaded processor. The storage control unit further comprises a transition cache, at least a first multiplexer connected to at least one instruction unit to supply instructions for execution to the multithreaded processor unit, and at least a second multiplexer to supply data to the at least one execution unit. The multithreaded processor unit comprises at least one data cache, at least one memory, at least one instruction unit, and at least one execution unit. The thread switch logic further comprises a thread state register, and a thread switch control register. The thread switch logic may further comprise a forward progress count register, a thread switch time-out register, and a thread switch manager.
The computer processing system of the invention may also comprise a multithreaded processor complex having at least one multithreaded processor capable of executing at least one active thread and storing at least one background thread of a plurality of threads of instructions, one data cache to supply data to the multithreaded processor, at least one instruction unit having an instruction cache, at least one memory to supply data and instructions to the caches and the multithreaded processor, and at least one execution unit wherein the data and instructions are executed. The computer processing system further comprises a storage control unit functionally connected to the multithreaded processor, the storage control unit comprising a transition cache, at least a first multiplexer to transmit instructions from the transition or instruction cache or memory to the instruction unit, and at least a second multiplexer to transmit data from the data or transition cache or memory to the at least one execution unit, at least one sequencer unit to provide control signals to at least the memory, the caches, the multiplexers, and the execution units. The computer processing system also comprises a thread switch logic unit functionally connected to the multithreaded processor and the storage control unit, the thread switch logic unit also receiving and transmitting control signals from and to the sequencer unit, the thread switch logic unit comprising a thread state register to store states of the at least one active and background thread, and a thread switch control register to store and enable a plurality of thread switch events. In this arrangement of the computer processing system, the thread switch logic unit receives signals from the storage control unit characterizing the plurality of threads in the multithreaded processor and in response thereto, determines whether to switch execution from the at least one active thread in the multithreaded processor.
Another embodiment of the inventive computer processor system comprises means to process at least one active thread of instructions, means to store a state of the at least one active thread, means to store a state of at least one background thread of instructions, means to change the states of the at least one active thread and the at least one background thread, and means, responsive to the means to change the states, to switch threads so that the processing means processes the at least one background thread. The means to change the states of the at least one active thread and the at least one background thread comprises an external hardware interrupt signal or a thread switch manager. The means to change the states of the at least one active thread and the at least one background thread comprise means to signal one of a plurality a latency events experienced by the processing means which stall the processing means from continued processing of the at least one active thread. The means to switch threads comprise means to enable one of a plurality of latency events to be a thread switch event, means to change priority of any of the threads, or means to time-out the means to process. In addition, the invention provides means to disregard the means to switch threads.
Simply, the invention is also a computer processor, comprising a multithreaded processor capable of executing at least one of a plurality of threads of instructions, a first plurality of hardware registers to store the states of each of the plurality of threads of instructions, and a second plurality of hardware registers to store a plurality of first events upon which the multithreaded processor will switch execution of threads, wherein the computer processing system will switch threads if a second event which changes the states of any of the plurality of threads of instructions in the first plurality of hardware registers is enabled in the second plurality of hardware registers.
Other objects, features and characteristics of the present invention; methods, operation, and functions of the related elements of the structure; combination of parts; and economies of manufacture will become apparent from the following detailed description of the preferred embodiments and accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures.