The central processing unit ("CPU") of a computer system fetches program instructions and data from the system's memory, performs the logical and mathematical operations on that data as specified by the instructions, and stores the results of those operations back into the system's memory. The sequence in which the CPU performs these tasks is also dictated by the program instructions. An excellent reference to this background section is found in Chapter 6 of Hennessy and Patterson, Computer Architecture--A Quantitative Approach (1990).
The performance of a particular CPU is measured by the time it requires to execute a particular task or program. The CPU time to execute a program can be expressed as: CPU time=(instructions per program)*(clock cycles per instruction)*(clock period). Thus, CPU performance is dependent on each of these characteristics of CPU design. These characteristics are governed by interdependent design factors and therefore cannot be affected in isolation from one another. For example, the CPU of a reduced instruction set computer (RISC) is organized in a manner which greatly simplifies the instruction set that the CPU is capable of processing. This streamlined hardware organization and the accompanying simplified instruction set architecture decreases the clock period and the clock cycles per instruction (CPI). Because the instruction set is limited, however, the number of instructions required to execute the given task necessarily increases commensurately with the task's complexity.
The most widely accepted technique for increasing CPU throughput is called pipelining. Pipelining increases CPU performance predominantly through the reduction of CPI, although it can also reduce the CPU clock period to a lesser extent. Pipelining is a technique whereby instruction execution is broken down into a series of steps. Each step in the pipeline, known as a pipestage, completes a designated portion of an instruction's complete execution. Each pipestage adds to the execution in the same way that the station of an assembly line adds to the complete manufacture of a product. The instruction leaves the pipeline's final pipestage completely executed, just as a product leaves the assembly line completely assembled.
Ideally, a number of instructions equal to the number of pipestages comprising the pipeline may be overlapped in execution, each instruction occupying a different pipestage. If the CPU has sufficient resources, and earlier pipestages do not depend upon the completed results of later pipestages, each pipestage can independently perform its function (on the instruction currently occupying it) in parallel with the other pipestages. Further, if the average time a CPU requires to completely execute an instruction is divided equally between the pipestages, the speedup in CPU throughput for pipelined execution over sequential execution will be equal to the number of pipelined stages. Thus for an ideal pipeline comprised of five pipestages, five instructions will be executed in the average time required to execute one instruction sequentially; the speedup in throughput is five times. Notice that the pipeline does not decrease the average time to execute a single instruction, but rather decreases overall average execution time by completing more instructions per unit of time.
Assuming in the above example that the CPU clock cycles at the same rate with which the instructions move from one pipestage to the next, sequential execution yields a CPI of five whereas the ideal pipeline yields a CPI of one. There are physical limitations on what appears at first blush to be an unlimited ability to increase throughput by increasing the number of stages in a pipeline. First, splitting the execution of an instruction into stages of equal time duration is nearly impossible. The time for each pipestage will therefore be necessarily constrained to that of the slowest pipestage; instructions are advanced through the pipeline at a constant rate and each pipestage must be complete before it can pass its results to the next pipestage. Further, there is an overhead associated with the implementation and control of the pipeline; the results of each pipestage must be clocked into latches, creating delays which add to the time required to complete each pipestage. Finally there are practical limitations to the depth of any pipeline because the average time required to execute a single instruction remains relatively fixed.