According to Wikipedia, published Aug. 1, 2011 on the World Wide Web, “Multithreading Computers” have hardware support to efficiently execute multiple threads. These are distinguished from multiprocessing systems (such as multi-core systems) in that the threads have to share the resources of a single core: the computing units, the CPU caches and the translation look-aside buffer (TLB). Where multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using thread-level as well as instruction-level parallelism. As the two techniques are complementary, they are sometimes combined in systems with multiple multithreading CPUs and in CPUs with multiple multithreading cores.
The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s. This allowed the concept of Throughput Computing to re-emerge to prominence from the more specialized field of transaction processing:
Even though it is very difficult to further speed up a single thread or single program, most computer systems are actually multi-tasking among multiple threads or programs.
Techniques that would allow speed up of the overall system throughput of all tasks would be a meaningful performance gain.
The two major techniques for throughput computing are multiprocessing and multithreading.
Some advantages include:
If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
If a thread cannot use all the computing resources of the CPU (because instructions depend on each other's result), running another thread permits to not leave these idle.
If several threads work on the same set of data, they can actually share their cache, leading to better cache usage or synchronization on its values.
Some criticisms of multithreading include:
Multiple threads can interfere with each other when sharing hardware resources such as caches or translation look-aside buffers (TLBs).
Execution times of a single thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
Hardware support for multithreading is more visible to software, thus requiring more changes to both application programs and operating systems than Multiprocessing.
Types of Multithreading:
Block Multi-Threading Concept
The simplest type of multi-threading occurs when one thread runs until it is blocked by an event that normally would create a long latency stall. Such a stall might be a cache-miss that has to access off-chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of ready-to-run threads.
For example:
1. Cycle i: instruction j from thread A is issued
2. Cycle i+1: instruction j+1 from thread A is issued
3. Cycle i+2: instruction j+2 from thread A is issued, load instruction which misses in all caches
4. Cycle i+3: thread scheduler invoked, switches to thread B
5. Cycle i+4: instruction k from thread B is issued
6. Cycle i+5: instruction k+1 from thread B is issued
Conceptually, it is similar to cooperative multi-tasking used in real-time operating systems in which tasks voluntarily give up execution time when they need to wait upon some type of the event.
This type of multi threading is known as Block or Cooperative or Coarse-grained multithreading.
Hardware Cost
The goal of multi-threading hardware support is to allow quick switching between a blocked thread and another thread ready to run. To achieve this goal, the hardware cost is to replicate the program visible registers as well as some processor control registers (such as the program counter). Switching from one thread to another thread means the hardware switches from using one register set to another.
Such additional hardware has these benefits:
The thread switch can be done in one CPU cycle.
It appears to each thread that it is executing alone and not sharing any hardware resources with any other threads. This minimizes the amount of software changes needed within the application as well as the operating system to support multithreading.
In order to switch efficiently between active threads, each active thread needs to have its own register set. For example, to quickly switch between two threads, the register hardware needs to be instantiated twice.
Examples
Many families of microcontrollers and embedded processors have multiple register banks to allow quick context switching for interrupts. Such schemes can be considered a type of block multithreading among the user program thread and the interrupt threads
Interleaved Multi-Threading
1. Cycle i+1: an instruction from thread B is issued
2. Cycle i+2: an instruction from thread C is issued
The purpose of this type of multithreading is to remove all data dependency stalls from the execution pipeline. Since one thread is relatively independent from other threads, there's less chance of one instruction in one pipe stage needing an output from an older instruction in the pipeline.
Conceptually, it is similar to pre-emptive multi-tasking used in operating systems. One can make the analogy that the time-slice given to each active thread is one CPU cycle.
This type of multithreading was first called Barrel processing, in which the staves of a barrel represent the pipeline stages and their executing threads. Interleaved or Pre-emptive or Fine-grained or time-sliced multithreading are more modern terminology.
Hardware Costs
In addition to the hardware costs discussed in the Block type of multithreading, interleaved multithreading has an additional cost of each pipeline stage tracking the thread II) of the instruction it is processing. Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to be larger to avoid thrashing between the different threads.
Simultaneous Multi-Threading
Concept
The most advanced type of multi-threading applies to superscalar processors. A normal superscalar processor issues multiple instructions from a single thread every CPU cycle. In Simultaneous Multi-threading (SMT), the superscalar processor can issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction level parallelism, this type of multithreading tries to exploit parallelism available across multiple threads to decrease the waste associated with unused issue slots.
For example:
1. Cycle i: instructions j and j+1 from thread A; instruction k from thread B all simultaneously issued
2. Cycle i+1: instruction j+2 from thread A; instruction k+1 from thread B; instruction m from thread C all simultaneously issued
3. Cycle i+2: instruction j+3 from thread A; instructions m+1 and m+2 from thread C all simultaneously issued.
To distinguish the other types of multithreading from SMT, the term Temporal multithreading is used to denote when instructions from only one thread can be issued at a time.
Hardware Costs
In addition to the hardware costs discussed for interleaved multithreading. SMT has the additional cost of each pipeline stage tracking the Thread ID of each instruction being processed. Again, shared resources such as caches and TLBs have to be sized for the large number of active threads.
According to U.S. Pat. No. 7,827,388 “Apparatus for adjusting instruction thread priority in a multi-thread processor” issued Nov. 2, 2010, a assigned to IBM and incorporated by reference herein, a number of techniques are used to improve the speed at which data processors execute software programs. These techniques include increasing the processor clock speed, using cache memory, and using predictive branching. Increasing the processor clock speed allows a processor to perform relatively more operations in any given period of time. Cache memory is positioned in close proximity to the processor and operates at higher speeds than main memory, thus reducing the time needed for a processor to access data and instructions. Predictive branching allows a processor to execute certain instructions based on a prediction about the results of an earlier instruction, thus obviating the need to wait for the actual results and thereby improving processing speed.
Some processors also employ pipelined instruction execution to enhance system performance. In pipelined instruction execution, processing tasks are broken down into a number of pipeline steps or stages. Pipelining may increase processing speed by allowing subsequent instructions to begin processing before previously issued instructions have finished a particular process. The processor does not need to wait for one instruction to be fully processed before beginning to process the next instruction in the sequence.
Processors that employ pipelined processing may include a number of different pipeline stages which are devoted to different activities in the processor. For example, a processor may process sequential instructions in a fetch stage, decode/dispatch stage, issue stage, execution stage, finish stage, and completion stage. Each of these individual stages may employ its own set of pipeline stages to accomplish the desired processing tasks.
Multi-thread instruction processing is an additional technique that may be used in conjunction with pipelining to increase processing speed. Multi-thread instruction processing involves dividing a set of program instructions into two or more distinct groups or threads of instructions. This multi-threading technique allows instructions from one thread to be processed through a pipeline while another thread may be unable to be processed for some reason. This avoids the situation encountered in single-threaded instruction processing in which all instructions are held up while a particular instruction cannot be executed, such as, for example, in a cache miss situation where data required to execute a particular instruction is not immediately available. Data processors capable of processing multiple instruction threads are often referred to as simultaneous multithreading (SMT) processors.
It should be noted at this point that there is a distinction between the way the software community uses the term “multithreading” and the way the term “multithreading” is used in the computer architecture community. The software community uses the term “multithreading” to refer to a single task subdivided into multiple, related threads. In computer architecture, the term “multithreading” refers to threads that may be independent of each other. The term “multithreading” is used in this document in the same sense employed by the computer architecture community.
To facilitate multithreading, the instructions from the different threads are interleaved in some fashion at some point in the overall processor pipeline. There are generally two different techniques for interleaving instructions for processing in a SMT processor. One technique involves interleaving the threads based on some long latency event, such as a cache miss that produces a delay in processing one thread. In this technique all of the processor resources are devoted to a single thread until processing of that thread is delayed by some long latency event. Upon the occurrence of the long latency event, the processor quickly switches to another thread and advances that thread until some long latency event occurs for that thread or until the circumstance that stalled the other thread is resolved.
The other general technique for interleaving instructions from multiple instruction threads in a SMT processor involves interleaving instructions on a cycle-by-cycle basis according to some interleaving rule (also sometimes referred to herein as an interleave rule). A simple cycle-by-cycle interleaving technique may simply interleave instructions from the different threads on a one-to-one basis. For example, a two-thread SMT processor may take an instruction from a first thread in a first clock cycle, an instruction from a second thread in a second clock cycle, another instruction from the first thread in a third clock cycle and so forth, back and forth between the two instruction threads. A more complex cycle-by-cycle interleaving technique may involve using software instructions to assign a priority to each instruction thread and then interleaving instructions from the different threads to enforce some rule based upon the relative thread priorities. For example, if one thread in a two-thread SMT processor is assigned a higher priority than the other thread, a simple interleaving rule may require that twice as many instructions from the higher priority thread be included in the interleaved stream as compared to instructions from the lower priority thread.
A more complex cycle-by-cycle interleaving rule in current use assigns each thread a priority from “1” to “7” and places an instruction from the lower priority thread into the interleaved stream of instructions based on the function 1/(2|X−Y|+1), where X=the software assigned priority of a first thread, and Y=the software assigned priority of a second thread. In the case where two threads have equal priority, for example, X=3 and Y=3, the function produces a ratio of ½, and an instruction from each of the two threads will be included in the interleaved instruction stream once out of every two clock cycles. If the thread priorities differ by 2, for example, X=2 and Y=4, then the function produces a ratio of ⅛, and an instruction from the lower priority thread will be included in the interleaved instruction stream once out of every eight clock cycles.
Using a priority rule to choose how often to include instructions from particular threads is generally intended to ensure that processor resources are allotted based on the software assigned priority of each thread. There are, however, situations in which relying on purely software assigned thread priorities may not result in an optimum allotment of processor resources. In particular, software assigned thread priorities cannot take into account processor events, such as a cache miss, for example, that may affect the ability of a particular thread of instructions to advance through a processor pipeline. Thus, the occurrence of some event in the processor may completely or at least partially defeat the goal of assigning processor resources efficiently between different instruction threads in a multi-thread processor.
For example, a priority of 5 may be assigned by software to a first instruction thread in a two thread system, while a priority of 2 may be assigned by software to a second instruction thread. Using the priority rule 1/(2|X−Y|+1) described above, these software assigned priorities would dictate that an instruction from the lower priority thread would be interleaved into the interleaved instruction stream only once every sixteen clock cycles, while instructions from the higher priority instruction thread would be interleaved fifteen out of every sixteen clock cycles. If an instruction from the higher priority instruction thread experiences a cache miss, the priority rule would still dictate that fifteen out of every sixteen instructions comprise instructions from the higher priority instruction thread, even though the occurrence of the cache miss could effectively stall the execution of the respective instruction thread until the data for the instruction becomes available.
In an embodiment, each instruction thread in a SMT processor is associated with a software assigned base input processing priority. Unless some predefined event or circumstance occurs with an instruction being processed or to be processed, the base input processing priorities of the respective threads are used to determine the interleave frequency between the threads according to some instruction interleave rule. However, upon the occurrence of some predefined event or circumstance in the processor related to a particular instruction thread, the base input processing priority of one or more instruction threads is adjusted to produce one more adjusted priority values. The instruction interleave rule is then enforced according to the adjusted priority value or values together with any base input processing priority values that have not been subject to adjustment.
Intel® Hyper-threading is described in “Intel® Hyper-Threading Technology, Technical User's Guide” 2003 from Intel® corporation, incorporated herein by reference. According to the Technical User's Guide, efforts to improve system performance on single processor systems have traditionally focused on making the processor more capable. These approaches to processor design have focused on making it possible for the processor to process more instructions faster through higher clock speeds, instruction-level parallelism (ILP) and caches. Techniques to achieve higher clock speeds include pipelining the micro-architecture to finer granularities, which is also called super-pipelining. Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second. But because there are far more instructions being executed in a super-pipelined micro-architecture, handling of events that disrupt the pipeline, such as cache misses, interrupts and branch miss-predictions, is much more critical and failures more costly. ILP refers to techniques to increase the number of instructions executed each clock cycle. For example, many super-scalar processor implementations have multiple execution units that can process instructions simultaneously. In these super-scalar implementations, several instructions can be executed each clock cycle. With simple in-order execution, however, it is not enough to simply have multiple execution units. The challenge is to find enough instructions to execute. One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order. Accesses to system memory are slow, though faster than accessing the hard disk, but when compared to execution speeds of the processor, they are slower by orders of magnitude. One technique to reduce the delays introduced by accessing system memory (called latency) is to add fast caches close to the processor. Caches provide fast memory access to frequently accessed data or instructions. As cache speeds increase, however, so does the problem of heat dissipation and of cost. For this reason, processors often are designed with a cache hierarchy in which fast, small caches are located near and operated at access latencies close to that of the processor core. Progressively larger caches, which handle less frequently accessed data or instructions, are implemented with longer access latencies. Nonetheless, times can occur when the needed data is not in any processor cache. Handling such cache misses requires accessing system memory or the hard disk, and during these times, the processor is likely to stall while waiting for memory transactions to finish. Most techniques for improving processor performance from one generation to the next are complex and often add significant die-size and power costs. None of these techniques operate at 100 percent efficiency thanks to limited parallelism in instruction flows. As a result, doubling the number of execution units in a processor does not double the performance of the processor. Similarly, simply doubling the clock rate does not double the performance due to the number of processor cycles lost to a slower memory subsystem.
Multithreading
As processor capabilities have increased, so have demands on performance, which has increased pressure on processor resources with maximum efficiency. Noticing the time that processors wasted running single tasks while waiting for certain events to complete, software developers began wondering if the processor could be doing some other work at the same time.
To arrive at a solution, software architects began writing operating systems that supported running pieces of programs, called threads. Threads are small tasks that can run independently. Each thread gets its own time slice, so each thread represents one basic unit of processor utilization. Threads are organized into processes, which are composed of one or more threads. All threads in a process share access to the process resources.
These multithreading operating systems made it possible for one thread to run while another was waiting for something to happen. On Intel processor-based personal computers and servers, today's operating systems, such as Microsoft Windows* 2000 and Windows* XP, all support multithreading. In fact, the operating systems themselves are multithreaded. Portions of them can run while other portions are stalled.
To benefit from multithreading, programs need to possess executable sections that can run in parallel. That is, rather than being developed as a long single sequence of instructions, programs are broken into logical operating sections. In this way, if the application performs operations that run independently of each other, those operations can be broken up into threads whose execution is scheduled and controlled by the operating system. These sections can be created to do different things, such as allowing Microsoft Word* to repaginate a document while the user is typing. Repagination occurs on one thread and handling keystrokes occurs on another. On single processor systems, these threads are executed sequentially, not concurrently. The processor switches back and forth between the keystroke thread and the repagination thread quickly enough that both processes appear to occur simultaneously. This is called functionally decomposed multithreading.
Multithreaded programs can also be written to execute the same task on parallel threads. This is called data-decomposed multithreaded, where the threads differ only in the data that is processed. For example, a scene in a graphic application could be drawn so that each thread works on half of the scene. Typically, data-decomposed applications are threaded for throughput performance while functionally decomposed applications are threaded for user responsiveness or functionality concerns.
When multithreaded programs are executing on a single processor machine, some overhead is incurred when switching context between the threads. Because switching between threads costs time, it appears that running the two threads this way is less efficient than running two threads in succession. If either thread has to wait on a system device for the user, however, the ability to have the other thread continue operating compensates very quickly for all the overhead of the switching. Since one thread in the graphic application example handles user input, frequent periods when it is just waiting certainly occur. By switching between threads, operating systems that support multithreaded programs can improve performance and user responsiveness, even if they are running on a single processor system.
In the real world, large programs that use multithreading often run many more than two threads. Software such as database engines creates a new processing thread for every request for a record that is received. In this way, no single I/O operation prevents new requests from executing and bottlenecks can be avoided. On some servers, this approach can mean that thousands of threads are running concurrently on the same machine.
Multiprocessing
Multiprocessing systems have multiple processors running at the same time. Traditional Intel® architecture multiprocessing systems have anywhere from two to about 512 processors. Multiprocessing systems allow different threads to run on different processors. This capability considerably accelerates program performance. Now two threads can run more or less independently of each other without requiring thread switches to get at the resources of the processor. Multiprocessor operating systems are themselves multithreaded, and the threads can use the separate processors to the best advantage.
Originally, there were two kinds of multiprocessing: asymmetrical and symmetrical. On an asymmetrical system, one or more processors were exclusively dedicated to specific tasks, such as running the operating system. The remaining processors were available for all other tasks (generally, the user applications). It quickly became apparent that this configuration was not optimal. On some machines, the operating system processors were running at 100 percent capacity, while the user-assigned processors were doing nothing. In short order, system designers came to favor an architecture that balanced the processing load better:symmetrical multiprocessing (SMP). The “symmetry” refers to the fact that any thread—be it from the operating system or the user application—can run on any processor. In this way, the total computing load is spread evenly across all computing resources. Today, symmetrical multiprocessing systems are the norm and asymmetrical designs have nearly disappeared.
SMP systems use double the number of processors, however performance will not double. Two factors that inhibit performance from simply doubling are:                How well the workload can be parallelized        System overhead        
Two factors govern the efficiency of interactions between threads:                How they compete for the same resources        How they communicate with other threads        
Multiprocessor Systems
Today's server applications consist of multiple threads or processes that can be executed in parallel. Online transaction processing and Web services have an abundance of software threads that can be executed simultaneously for faster performance. Even desktop applications are becoming increasingly parallel. Intel architects have implemented thread-level parallelism (TLP) to improve performance relative to transistor count and power consumption.
In both the high-end and mid-range server markets, multiprocessors have been commonly used to get more performance from the system. By adding more processors, applications potentially get substantial performance improvement by executing multiple threads on multiple processors at the same time. These threads might be from the same application, from different applications running simultaneously, from operating-system services, or from operating-system threads doing background maintenance. Multiprocessor systems have been used for many years, and programmers are familiar with the techniques to exploit multiprocessors for higher performance levels.
US Patent Application Publication No. 2011/0087865 “Intermediate Register Mapper” filed Apr. 14, 2011 by Barrick et al. and incorporated herein by reference teaches “A method, processor, and computer program product employing an intermediate register mapper within a register renaming mechanism. A logical register lookup determines whether a hit to a logical register associated with the dispatched instruction has occurred. In this regard, the logical register lookup searches within at least one register mapper from a group of register mappers, including an architected register mapper, a unified main mapper, and an intermediate register mapper. A single hit to the logical register is selected among the group of register mappers. If an instruction having a mapper entry in the unified main mapper has finished but has not completed, the mapping contents of the register mapper entry in the unified main mapper are moved to the intermediate register mapper, and the unified register mapper entry is released, thus increasing a number of unified main mapper entries available for reuse.”
US Patent Application Publication No. 2011/0087865 “Intermediate Register Mapper” filed Apr. 14, 2011 by Barrick et al., and incorporated herein by reference teaches “A method, processor, and computer program product employing an intermediate register mapper within a register renaming mechanism. A logical register lookup determines whether a hit to a logical register associated with the dispatched instruction has occurred. In this regard, the logical register lookup searches within at least one register mapper from a group of register mappers, including an architected register mapper, a unified main mapper, and an intermediate register mapper. A single hit to the logical register is selected among the group of register mappers. If an instruction having a mapper entry in the unified main mapper has finished but has not completed, the mapping contents of the register mapper entry in the unified main mapper are moved to the intermediate register mapper, and the unified register mapper entry is released, thus increasing a number of unified main mapper entries available for reuse.”
U.S. Pat. No. 6,314,511 filed Apr. 2, 1998 “Mechanism for freeing registers on processors that perform dynamic out-of-order execution of instructions using renaming registers” by Levy et al., incorporated by reference herein teaches “freeing renaming registers that have been allocated to architectural registers prior to another instruction redefining the architectural register. Renaming registers are used by a processor to dynamically execute instructions out-of-order in either a single or multi-threaded processor that executes instructions out-of-order. A mechanism is described for freeing renaming registers that consists of a set of instructions, used by a compiler, to indicate to the processor when it can free the physical (renaming) register that is allocated to a particular architectural register. This mechanism permits the renaming register to be reassigned or reallocated to store another value as soon as the renaming register is no longer needed for allocation to the architectural register. There are at least three ways to enable the processor with an instruction that identifies the renaming register to be freed from allocation: (1) a user may explicitly provide the instruction to the processor that refers to a particular renaming register; (2) an operating system may provide the instruction when a thread is idle that refers to a set of registers associated with the thread; and (3) a compiler may include the instruction with the plurality of instructions presented to the processor. There are at least five embodiments of the instruction provided to the processor for freeing renaming registers allocated to architectural registers: (1) Free Register Bit; (2) Free Register; (3) Free Mask; (4) Free Opcode; and (5) Free Opcode/Mask. The Free Register Bit instruction provides the largest speedup for an out-of-order processor and the Free Register instruction provides the smallest speedup.”
“Power ISA™ Version 2.06 Revision B” published Jul. 23, 2010 from IBM® and incorporated by reference herein teaches an example RISC (reduced instruction set computer) instruction set architecture. The Power ISA will be used herein in order to demonstrate example embodiments, however, the invention is not limited to Power ISA or RISC architectures. Those skilled in the art will readily appreciate use of the invention in a variety of architectures.
“z/Architecture Principles of Operation” SA22-7832-08, Ninth Edition (August, 2010) from IBM® and incorporated by reference herein teaches an example CISC (complex instruction set computer) instruction set architecture.