In the realm of digital computing the history of development of computing power comprises steady advancement in many areas. Steady advances are made, for example, in device density for processors, interconnect technology, which influences speed of operation, ability to tolerate and use higher clock speeds, and much more. Another area that influences overall computing power is the area of parallel processing, which includes more than the parallel operation of multiple, separate processors.
The concept of parallel processing includes the ability to share tasks among multiple, separate processors, but also includes schemes for concurrent execution of multiple programs on single processors. This scheme is termed generally multithreading.
The concept of multithreading is explained as follows: As processor operating frequency increases, it becomes increasingly difficult to hide latencies inherent in the operation of a computer system. A high-end processor which misses in its data cache on 1% of the instructions in a given application could be stalled roughly 50% of the time if it has a 50-cycle latency to off-chip RAM. If instructions directed to a different application could be executed when the processor is stalled during a cache miss, the performance of the processor could be improved and some or all of the memory latency effectively hidden. For example, FIG. 1A shows a single instruction stream 101 that stalls upon experiencing a cache miss. The supporting machine can only execute a single thread or task at a time. In contrast, FIG. 1B shows instruction stream 102 that may be executed while stream 101 is stalled. In this case, the supporting machine can support two threads concurrently and thereby more efficiently utilize its resources.
More generally, individual computer instructions have specific semantics, such that different classes of instructions require different resources to perform the desired operation. Integer loads do not exploit the logic or registers of a floating-point unit, any more than register shifts require the resources of a load/store unit. No single instruction consumes all of a processor's resources, and the proportion of the total processor resources that is used by the average instruction diminishes as one adds more pipeline stages and parallel functional units to high-performance designs.
Multithreading arises in large measure from the notion that, if a single sequential program is fundamentally unable to make fully efficient use of a processor's resources, the processor should be able to share some of those resources among multiple concurrent threads of program execution. The result does not necessarily make any particular program execute more quickly—indeed, some multithreading schemes actually degrade the performance of a single thread of program execution—but it allows a collection of concurrent instruction streams to run in less time and/or on a smaller number of processors. This concept is illustrated in FIGS. 2A and 2B, which show single-threaded processor 210 and dual-threaded processor 250, respectively. Processor 210 supports single thread 212, which is shown utilizing load/store unit 214. If a miss occurs while accessing cache 216, processor 210 will stall (in accordance with FIG. 1A) until the missing data is retrieved. During this process, multiply/divide unit 218 remains idle and underutilized. However, processor 250 supports two threads; i.e., 212 and 262. So, if thread 212 stalls, processor 250 can concurrently utilize thread 262 and multiply/divide unit 218 thereby better utilizing its resources (in accordance with FIG. 1B).
Multithreading on a single processor can provide benefits beyond improved multitasking throughput, however. Binding program threads to critical events can reduce event response time, and thread-level parallelism can, in principle, be exploited within a single application program.
Several varieties of multithreading have been proposed. Among them are interleaved multithreading, which is a time-division multiplexed (TDM) scheme that switches from one thread to another on each instruction issued. This scheme imposes some degree of “fairness” in scheduling, but implementations which do static allocation of issue slots to threads generally limit the performance of a single program thread. Dynamic interleaving ameliorates this problem, but is more complex to implement.
Another multithreading scheme is blocked multithreading, which scheme issues consecutive instructions from a single program thread until some designated blocking event, such as a cache miss or a replay trap, for example, causes that thread to be suspended and another thread activated. Because blocked multithreading changes threads less frequently, its implementation can be simplified. On the other hand, blocking is less “fair” in scheduling threads. A single thread can monopolize the processor for a long time if it is lucky enough to find all of its data in the cache. Hybrid scheduling schemes that combine elements of blocked and interleaved multithreading have also been built and studied.
Still another form of multithreading is simultaneous multithreading, which is a scheme implemented on superscalar processors. In simultaneous multithreading instructions from different threads can be issued concurrently. Assume for example, a superscalar reduced instruction set computer (RISC), issuing up to two instructions per cycle, and a simultaneously multithreaded superscalar pipeline, issuing up to two instructions per cycle from either of the two threads. Those cycles where dependencies or stalls prevented full utilization of the processor by a single program thread are filled by issuing instructions for another thread.
Simultaneous multithreading is thus a very powerful technique for recovering lost efficiency in superscalar pipelines. It is also arguably the most complex multithreading system to implement, because more than one thread may be active on a given cycle, complicating the implementation of memory access protection, and so on. It is perhaps worth noting that the more perfectly pipelined the operation of a central processing unit (CPU) may be on a given workload, the less will be the potential gain of efficiency for a multithreading implementation.
Multithreading and multiprocessing are closely related. Indeed, one could argue that the difference is only one of degree: whereas multiprocessors share only memory and/or connectivity, multithreaded processors share memory and/or connectivity, but also share instruction fetch and issue logic, and potentially other processor resources. In a single multithreaded processor, the various threads compete for issue slots and other resources, which limits parallelism. Some multithreaded programming and architectural models assume that new threads are assigned to distinct processors, to execute fully in parallel.
There are several distinct problems with the state-of-the-art multithreading solutions available at the time of submission of the present application. One of these is the treatment of real-time threads. Typically, real-time multimedia algorithms are run on dedicated processors/DSPs to ensure quality-of-service (QoS) and response time, and are not included in the mix of threads to be shared in a multithreading scheme, because one cannot easily guarantee that the real-time software will be executed in a timely manner.
What is clearly needed in this respect is a scheme and mechanism allowing one or more real-time threads or virtual processors to be guaranteed a specified proportion of instruction issue slots in a multithreaded processor, with a specified inter-instruction interval, such that the compute bandwidth and response time is well defined. If such a mechanism were available, threads with strict QoS requirements could be included in the multithreading mix. Moreover, real time threads (such as DSP-related threads) in such a system might be somehow exempted from taking interrupts, removing an important source of execution time variability. This sort of technology could well be critical to acceptance of DSP-enhanced RISC processors and cores as an alternative to the use of separate RISC and DSP cores in consumer multimedia applications.
Another distinct problem with state-of-the-art multithreading schemes at the time of filing the present application is in the creation and destruction of active threads in the processor. To support relatively fine-grained multithreading, it is desirable for parallel threads of program execution to be created and destroyed with the minimum possible overhead, and without intervention of an operating system being necessary, at least in usual cases. What is clearly needed in this respect is some sort of FORK (thread create) and JOIN (thread terminate) instructions. A separate problem exists for multi-threaded processors where the scheduling policy makes a thread run until it is blocked by some resource, and where a thread which has no resource blockage needs nevertheless to surrender the processor to some other thread. What is clearly needed in this respect is a distinct PAUSE or YIELD instruction. Furthermore, the opcode space of a microprocessor instruction set is a valuable architectural resource, which may be limited, particularly in RISC instruction sets; consequently, what is needed is a means for combining two or more of the FORK, JOIN, and YIELD-type instructions into a single instruction decode to conserve opcode space.