Single-threaded microprocessors are defined by the fact that, although multiple "concurrent" processes appear to be running simultaneously, in reality, only one process or thread of execution is actually running at any time. This distinction becomes slightly blurred where multiple execution blocks, or arithmetic logic units (ALUs) can execute in parallel. In superscalar processors, multiple instructions may be issued and executed each cycle. Still, these instructions come from a single thread of execution. Simultaneous multithreaded processors, on the other hand, allow instructions from different threads to execute simultaneously.
FIG. 1 is a block diagram of a multithreaded instruction pipeline 10. Multiple program counters 12 (PCs) are maintained, one per thread. The example of FIG. 1 shows two PCs, one for each of two threads. The PCs are used to fetch instructions for their respective threads from the instruction cache 14 or other memory. The fetched instructions are queued up in the instruction queue 16 and then issued.
Registers within the virtual register files 18 are accessed as necessary for the issued instructions. The instructions are then sent to the execution box 20 for execution, which, in a superscalar architecture, may contain several arithmetic logic units (ALUs) which execute in parallel. Different ALUs may have different functions. For instance, some ALUs perform integer operations while others perform floating point operations. Finally, memory is accessed in block 22.
FIG. 2A is a chart illustrating a typical allocation 30 of ALU slots in a non-simultaneous multithreading superscalar processor. In this example, there are four ALUs, each represented by one of the four columns labeled 0-3. Instructions are allocated to ALUs as the ALUs become available. Thus, at time slot 32, two ALUs have been allocated. In the next cycle, time slot 34, three ALUs are allocated. However, there are many empty slots in which some of the ALUs sit idle, e.g., ALUs 2 and 3 in time slot 32. By allowing multiple threads to execute simultaneously, it is the goal of designers to fill these empty slots as often as possible to fully utilize the processor's resources.
FIG. 2B is a chart 40 similar to that of FIG. 2A, but illustrating the allocation of ALU slots in a simultaneous multithreading superscalar system. Allocated instructions associated with one thread, say thread 0, are indicated with a dot, while instructions associated with another thread, say thread 1, are indicated with an X. For example, in the time slot 42, ALUs 0 and 1 are allocated to instructions from thread 0, while ALUs 2 and 3 are allocated to instructions from thread 1. In time slot 44, ALUs 0, 1 and 3 are allocated to thread 0 while ALU 2 is allocated to thread 1.
While there may still be idle ALU slots, as in time slot 46, a comparison with FIG. 2A shows that idle ALU slots are far fewer in a simultaneous multithreading system. Thus, simultaneous multithreading systems are more efficient and, while not necessarily speeding up the execution of a single thread, dramatically speed up overall execution of multiple threads, compared to non-simultaneous multithreading systems.
FIG. 3 illustrates the concept of virtual to physical memory mapping. Typically, a program, or a thread, executes in its own virtual address space 50, organized into blocks called pages 52. Pages of physical memory 58 are allocated as needed, and the virtual pages 54 are mapped to the allocated physical pages 58 by a mapping function 54. Typically, this mapping function 54, or page list as it is more commonly known, is a large table stored in memory, requiring long memory access times.
To reduce these long lookup times, a relatively small cache, called a table lookaside buffer (TLB) is maintained. The TLB holds mappings for recent executing instructions with the expectation that these instructions will need to be fetched again in the near future. The TLB is generally a content-addressable memory (CAM) device having a virtual address as its lookup key.
During a context switch, in which the executing process is swapped out and replaced with another process, much of the cached memory used by the first process becomes invalid. However, some processes may use common instruction code. There are various methods for sharing code, such as mapping different virtual addresses from different address spaces to the same physical address. This is often done with system memory space.