As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multi-threading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A SIMD or vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, an SIMD or vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.
The aforementioned techniques may also be combined, resulting in a multi-threaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to an SIMD execution unit to process “vectors” of data points at the same time. In addition, it is also possible to employ multiple execution units in the same microprocessor to provide additional parallelization. The multiple execution units may be specialized to handle different types of instructions, or may be similarly configured to process the same types of instructions.
Irrespective of the number of distinct hardware threads, or execution paths, supported in a processor architecture, the operating system and other software that executes on a microprocessor will often need to execute a number of distinct and parallel instruction streams that exceeds the number of available execution paths on the microprocessor. These instruction streams, which may take the form of software threads, tasks, or processes, among other constructs, will hereinafter be referred to as “processes,” although the invention is not limited to any particular terminology or nomenclature.
Whenever the number of processes requiring execution by a microprocessor exceeds the number of available execution paths, multiple processes are allocated to and executed on each individual execution path, typically by allocating time slices on each execution path to different processes. While the processes assigned to a given execution path technically are not executed in parallel, by enabling each process to execute for a period of time and switching between each process, each process is able to progress in a reasonable and fair manner and thus maintain the appearance of parallelism.
The introduction of time-based multithreading of this nature, however, creates some inefficiencies as a result of switching between executing different processes in a given execution path. In particular, whenever an operating system scheduling algorithm determines that a currently running process needs to give up utilization of a hardware thread and grant it to another process, the scheduler causes a timer interrupt, which triggers an interrupt handler to perform a context switch. A context switch typically consists of saving or otherwise preserving the context, or working state, of the process that was previously being executed, and is now being switched out, and restoring the context, or working state, of the process about to be executed, or switched in.
The working state of a process typically includes various state information that characterizes, from the point of view of a process, the state of the system at a particular point in time, and may include various information such as the contents of the register file(s), the program counter and other special purpose registers, among others. Thus, by saving the working state when a process is switched out, or suspended, and then restoring the working state when a process is switched in, or resumed, the process functionally executes in the same manner as if the process was never interrupted.
One undesirable side effect of performing a context switch, however, is the latency that is associated with saving one context and loading another context. Loading a new context can consume hundreds of execution cycles due to the large numbers of registers present in modern processor architectures. In addition, since the memory cache is usually filled with data from the formerly running process at the time of the context switch, the context for a process that was saved the last time that process was executed may no longer be cached, so attempting to load a new context often results in a cache miss and the additional delay associated with loading the context from a lower level of memory.
Attempts have been made to reduce the adverse impacts of context switches, typically by attempting to prefetch instructions and/or data that might be used by a process once it resumes execution. However, saving and restoring the context itself can still add significant latency to a context switch.
Therefore, a significant need continues to exist in the art for a manner of minimizing the adverse performance impact associated with context switching.