Given the continually increased reliance on computers in contemporary society, computer technology has had to advance on many fronts to keep up with increased demand. One particular subject of significant research and development efforts is parallelism, i.e., the performance of multiple tasks in parallel.
A number of computer software and hardware technologies have been developed to facilitate increased parallel processing. From a software standpoint, multithreaded operating systems have been developed, which permit computer programs to concurrently execute in multiple “threads” so that multiple tasks can essentially be performed at the same time. Threads generally represent independent paths of execution for a program. For example, for an e-commerce computer application, different threads might be assigned to different customers so that each customer's specific e-commerce transaction is handled in a separate thread.
From a hardware standpoint, computers increasingly rely on multiple microprocessors to provide increased workload capacity. Furthermore, some microprocessors have been developed that support the ability to execute multiple threads in parallel, effectively providing many of the same performance gains attainable through the use of multiple microprocessors. In contrast with single-threaded microprocessors that only support a single path of execution, multithreaded microprocessors support multiple paths of execution such that different threads assigned to different execution paths are able to progress in parallel.
Irrespective of the number of separate execution paths that are supported in the underlying hardware, however, the operating systems in multithreaded computers are typically designed to execute multiple threads on each individual execution path, typically by allocating time slices on each execution path to different threads. While the threads assigned to a given execution path technically are not executed in parallel, by enabling each thread to execute for a period of time and switching between each thread, each thread is able to progress in a reasonable and fair manner and thus maintain the appearance of parallelism.
While multithreading in this nature can significantly increase system performance, however, some inefficiencies exist as a result of switching between executing different threads in a given execution path. In particular, whenever an execution path switches between executing different threads, an operation known as a context switch much be performed. A context switch typically consists of saving or otherwise preserving the working state of the thread that was previously being executed, and is now being switched out, and restoring the working state of the thread about to be executed, or switched in.
The working state of a thread includes various state information that characterizes, from the point of view of a thread, the state of the system at a particular point in time, and may include various information such as the contents of the register file(s), the program counter and other special purpose registers, among others. Thus, by saving the working state when a thread is switched out, or suspended, and then restoring the working state when a thread is switched in, or resumed, the thread functionally executes in the same manner as if the thread was never interrupted.
One undesirable side effect of performing a context switch in many environments, however, is the increased occurrence of cache misses once a thread is switched back in. Caching is a technique that has been universally utilized in modern computer architectures, and is used to address the latency problems that result from the speed of microprocessors relative to the speed of the memory devices used by microprocessors to access stored data.
In particular, caching attempts to balance memory speed and capacity with cost by using multiple levels of memory. Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main storage memory that uses dynamic random access memory devices (DRAM's) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory devices (SRAM's) or the like. Often multiple levels of cache memories are used, each with progressively faster and smaller memory devices. Also, depending upon the memory architecture used, cache memories may be shared by multiple microprocessors or dedicated to individual microprocessors, and may either be integrated onto the same integrated circuit as a microprocessor, or provided on a separate integrated circuit.
Moreover, some cache memories may be used to store both instructions, which comprise the actual programs that are being executed, and the data being processed by those programs. Other cache memories, often those closest to the microprocessors, may be dedicated to storing only instructions or data.
When multiple levels of memory are provided in a computer architecture, one or more memory controllers are typically relied upon to swap needed data from segments of memory addresses, often known as “cache lines”, between the various memory levels to attempt to maximize the frequency that requested data is stored in the fastest cache memory accessible by the microprocessor. Whenever a memory access request attempts to access a memory address that is not cached in a cache memory, a “cache miss” occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance hit.
Caching depends upon both temporal and spatial locality to improve system performance. Put another way, when a particular cache line is retrieved into a cache memory, there is a good likelihood that data from that cache line will be needed again, so the next access to data in the same cache line will result in a “cache hit” and thus not incur a performance penalty.
Other manners of accelerating performance in connection with caching include techniques such as instruction prefetching, branch prediction and data prefetching. Instruction prefetching, for example, is typically implemented in a microprocessor, and attempts to fetch instructions from memory before they are needed, so that the instructions will hopefully be cached when they are actually needed. Branch prediction, which is also typically implemented in a microprocessor, extends instruction prefetching by attempting to predict which branch of a decision will likely be taken, and then prefetching instructions from the predicted branch. Data prefetching, which is often implemented in a separate component from a microprocessor (but which may still be disposed on the same integrated circuit device), attempts to detect patterns of data access and prefetch data that is likely to be needed based upon any detected patterns.
From the perspective of an executing thread, therefore, as a particular thread executes, more and more of the instructions and data used by a thread will progressively become cached, and thus the execution of the thread will tend to be more efficient the longer the thread is executed.
However, given that the same premise applies to all of the threads executing in a multithreaded computer, whenever a thread is suspended as a result of a context switch, and then is later resumed as a result of another context switch, it is likely that some or all of the instructions and data that were cached prior to suspending the thread will no longer be cached when the thread is resumed (principally due to the caching of instructions and data needed by other threads that were executed in the interim). A greater number of cache misses then typically occur, thus negatively impacting overall system performance. Prefetching and branch prediction, which rely on historical data, also typically provide little or no benefit for a resumed thread upon its initial resumption of execution, as the prefetching of instructions and data cannot be initiated until after the thread resumes its execution.
Therefore, a significant need has arisen in the art for a manner of minimizing the adverse performance impact associated with context switching in a multithreaded computer.