1. Field of the Invention
The invention generally relates to the field of computers and, more particularly, to computer architecture.
2. Description of the Related Art
Early computer processors (also called microprocessors) included a single central processing unit (CPU) or instruction execution unit that executed only one instruction at a time. As is well known, an execution unit executes programs having instructions stored in main memory by fetching instructions of the program, decoding the instructions, and executing the instructions one after the other. In response to the need for improved performance, several techniques, e.g., pipelining, superpipelining, superscaling, speculative instruction execution and out-of-order instruction execution, have been implemented to extend the capabilities of early processors.
Pipelined architectures break the execution of instructions into a number of stages, where each stage corresponds to one step in the execution of the instruction. Pipelined designs increase the rate at which instructions can be executed by allowing a new instruction to begin execution before a previous instruction is finished executing. Pipelined architectures have been extended to superpipelined or extended pipeline architectures, where each execution pipeline is broken down into even smaller stages. Superpipelining increases the number of instructions that can be executed in the pipeline at any given time.
Superscalar processors generally refer to a class of microprocessor architectures that include multiple pipelines that process instructions in parallel. Superscalar processors typically execute more than one instruction per clock cycle, on average. Superscalar processors allow parallel instruction execution in two or more instruction execution pipelines. In this manner, the number of instructions that may be processed is increased due to parallel execution. Each of the execution pipelines may have differing number of stages. Some of the pipelines may be optimized for specialized functions, such as integer operations or floating point operations, and in some cases execution pipelines are optimized for processing graphic, multimedia, or complex math instructions.
The goal of superscalar and superpipeline processors, is to execute multiple instructions per cycle (IPC). Instruction-level parallelism (ILP) available in programs written to operate on the processor can be exploited to realize this goal. However, many programs are not coded in a manner that can take full advantage of deep, wide instruction, execution pipelines in modern processors. Many factors, such as low cache hit rate, instruction interdependency, frequent access to slow peripherals and branch mispredictions cause the resources of a superscalar processor to be used inefficiently.
Superscalar architectures require that instructions be dispatched for execution at a sufficient rate. Conditional branching instructions create a problem for instruction fetching because an instruction fetch unit (IFU) cannot know with certainty which instructions to fetch, until conditional branch instructions are resolved. Also, when a branch is detected, the target address of the instructions following the branch must be predicted to supply those instructions for execution.
Various processor architectures have used a branch prediction unit to predict the outcome of branch instructions, allowing the IFU to fetch subsequent instructions according to the predicted outcome. These instructions are speculatively executed to allow the processor to make forward progress during the time the branch instruction is resolved.
Another technique to increase processing power is provided by multiprocessing. Multiprocessing is a hardware and operating system (OS) feature that allows multiple processors to work together to share workload within a computing system. In a shared memory multiprocessing system, all processors have access to the same physical memory. One limitation of multiprocessing is that programs that have not been optimized to run as multiple processes may not realize significant performance gain from multiple processors. However, improved performance is achieved where the OS is able to run multiple programs concurrently, each running on a separate processor.
Multithreaded software is a recent development that allows applications to be split into multiple independent threads, such that each thread can be assigned to a separate processor and executed independently in parallel as if the thread were a separate program. The results of these separate threads are reassembled to produce a final result. By implementing each thread on a separate processor, multiple tasks are handled in a fast, efficient manner. The use of multiple processors allows various tasks or functions to be handled by other than a single CPU so that the computing power of an overall computer system is enhanced. However, because conventional multiprocessors are implemented using a plurality of discrete integrated circuits, communication between the devices limits system clock frequency and the ability to share resources between processors. As a result, conventional multiprocessor architectures result in duplication of resources which increases cost and complexity.
In order to, for example, reduce duplication of resources, various designers have implemented chip multiprocessors (CMPs). A CMP is essentially a symmetric multi-processor (SMP) implemented on a single integrated circuit. Similar to an OS for an SMP system, an OS for a CMP is required to schedule and coordinate system resources for processor cores of the CMP. In a typical case, multiple processor cores of the CMP share memory, of a memory hierarchy, and various interconnects. In general, a computer system that implements one or more CMPs allows for increased thread-level parallelism (TLP). As is well know, threads include instruction sequences, derived from a program, that perform divisible tasks. OSs generally implement threads in one of two ways: preemptive multithreading or cooperative multithreading. In preemptive multithreading, an OS determines when a context switch should occur. In contrast, cooperative multithreading relies on the threads themselves to relinquish control once the threads are at a stopping point. This can create problems if a thread is waiting for a resource to become available. A disadvantage of preemptive multithreading is that the OS may make a context switch at an inappropriate time, causing priority inversion or other undesirable effects, which may be avoided by cooperative multithreading.
In at least one CMP, a single instruction fetch unit (IFU) has been utilized to service multiple processor cores. In a typical situation, each of the processor cores or strands may initiate multiple fetch requests. Depending upon whether a cache miss occurs, a packet may return out-of-order. In a typical situation, an out-of-order (OOO) packet may be detected and replayed through an IFU pipeline until an in-order packet is received and reaches a fetch buffer. Unfortunately, repeatedly replaying an OOO packet may cause various problems, such as excessive IFU traffic, increased turn-around time on token rotation through fetcher arbiters and unnecessary switching (resulting in increased power consumption) within the processor cores.
What is needed is a technique for handling out-of-order packets that reduces out-of-order packet replay.