Many different types of computing systems have attained widespread use around the world. These computing systems include personal computers, servers, mainframes and a wide variety of stand-alone and embedded computing devices. Sprawling client-server systems exist, with applications and information spread across many Personal Computer PC networks, mainframes and minicomputers. In a distributed system connected by networks, a user may access many application programs, databases, network systems, operating systems and mainframe applications. Computers provide individuals and businesses with a host of software applications including word processing, spreadsheet, accounting, e-mail, voice over Internet protocol telecommunications, and facsimile.
Users of digital processors such as computers continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. In addition, processing speed has increased much more quickly than that of main memory accesses. As a result, cache memories, or caches, are often used in many such systems to increase performance in a relatively cost-effective manner. Many modern computers also support “multi-tasking” in which two or more programs are run at the same time. An operating system controls the alternation between the programs, and a switch between the programs or between the operating system and one of the programs is called a “context switch.”
Additionally, multi-tasking can be performed in a single program, and is typically referred to as “multi-threading.” Multiple program actions can be processed concurrently using multi-threading. Most multi-threading processors work exclusively on one thread at a time. For example, a multi-threading processor may execute n instructions from thread a, then execute n instructions from thread b, where n is an integer and threads a and b are instructions for two different programs or from the same program. There also exist fine-grain multi-threading processors that interleave different threads on a cycle-by-cycle basis, i.e., n=1. Both types of multi-threading interleave the instructions of different threads on long-latency events.
Modern computers include at least a first level cache L1 and typically a second level cache L2. This dual cache memory system enables storing frequently accessed data and instructions close to the execution units of the processor to minimize the time required to transmit data to and from memory. L1 cache is typically on the same chip as the execution units. L2 cache is external to the processor chip but physically close to it. Ideally, as the time for execution of an instruction nears, instructions and data are moved to the L2 cache from a more distant memory. When the time for executing the instruction is near imminent, the instruction and its data, if any, is advanced to the L1 cache.
As the processor operates in response to a clock, data and instructions are accessed from the L1 cache for execution. A cache miss occurs if the data or instructions sought are not in the cache. The processor would then seek the data or instructions in the L2 cache. A cache miss may occur at this level as well. The processor would then seek the data or instructions from other memory located further away. Thus, each time a memory reference occurs which is not present within the first level of cache the processor attempts to obtain that memory reference from a second or higher level of cache. The benefits of a cache are maximized whenever the number of cache hits greatly exceeds the number of cache misses. When a cache miss occurs, execution of the instructions of the current thread is suspended while awaiting retrieval of the expected data or instructions. During this time, while the system is awaiting the data or instructions for the thread, the processor execution units could be operating on another thread. In a multi-threading system the processor would switch to another thread and execute its instructions while operation on the first thread is suspended. Thus, thread selection logic is provided to determine which thread to be next executed by the processor.
A common architecture for high performance, single-chip microprocessors is the reduced instruction set computer (RISC) architecture characterized by a small simplified set of frequently used instructions for rapid execution. Thus, in a RISC architecture, a complex instruction comprises a small set of simple instructions that are executed in steps very rapidly. As semiconductor technology has advanced, the goal of RISC architecture has been to develop processors capable of executing one or more instructions on each clock cycle of the machine. Execution units of modern processors therefore have multiple stages forming an execution pipeline. On each cycle of processor operation, each stage performs a step in the execution of an instruction. Thus, as a processor cycles, an instruction advances through the stages of the pipeline. As it advances it is executed. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished. A cache miss may occur when an instruction is at any stage of the pipeline. Ideally, when this occurs, an instruction of a different thread is placed in the pipeline at the next processor cycle and the suspended thread is advanced without execution.
Multithreading permits the processors' pipeline to do useful work on different threads when a pipeline stall condition is detected for the current thread. Various thread selection logic units have been proposed for controlling the order in which thread instructions are sent to be executed. For example, a task-level dynamic scheme would entail continual execution of a first thread until a long latency event, such as a cache miss, occurs. Then, execution of the first thread is stopped and execution of a second thread commences. Execution of the second thread would continue until another long-latency event occurs. The objective of this scheme is to keep the processor as busy as possible executing instructions. However, the processor is required to handle data hazards because the same thread is likely to execute for many cycles. A data hazard occurs when data in a stage of the pipeline is required by another stage. Data forwarding between the stages of the processor pipeline is then necessary to ensure that data from one stage of the pipeline is available to another stage.
An instruction level dynamic scheme for thread selection is also known in the art. In this scheme, when a thread is dispatched, it remains active for only one cycle. The objective is to share the processor pipeline as fairly as possible between active threads, even if it increases the likelihood of all threads waiting simultaneously. The processor pipeline is still required to handle data hazards because the same thread can execute for many cycles back-to-back when only one thread is active.
Another scheme for thread selection is cycle-level Time Division Multiplex (TDM). In this method, all threads are dispatched in a fixed time division multiplexed pattern at each cycle. If a thread waits, (for example, because of a long latency read), its assigned cycle is lost. The objective is to eliminate data hazards in the processor pipeline. This is achieved when the number of multiplexed threads is at least equal to the number of stages in the pipeline. Note that processor use is not optimal because the wait cycles are multiplexed among the execution cycles causing idle cycles in the processor pipeline.
Thus, there is a need for a thread-switching method that minimizes data hazards and makes optimal use of the processor.