This invention relates to digital processors, and in particular to pipelined CPUs in digital processors.
A general purpose computer processes data by executing one or more of several predefined instructions in a particular sequence. An example of a computing machine is a hand held calculator. In this machine, the predefined instructions (the instruction set) may include only the arithmetic operations of addition, subtraction, multiplication and division. Data and the required sequence of instructions are input by the user one by one and an arithmetic calculation results.
The set of sequential instructions that a computer executes to produce a desired result is called a program. In general purpose machines with large instruction sets, the programs may be very large. Since computers execute the instructions much faster than users can input them, it is desirable to store the programs in electronic memories so that the computer can automatically read the instructions and thereby run at top speeds.
Most modern stored-program data processing systems are based on the Von Neumann model. The Von Neumann computer design is based upon three key concepts:
Data and instructions are stored in a single read-write memory. PA1 The contents of this memory are addressable by location, without regard to the type of data contained in that location. PA1 Execution occurs in a sequential fashion (unless explicitly modified) from one instruction to the next. PA1 1. Fetch an instruction from memory. PA1 2. Decode the fetched instruction to interpret the instruction. PA1 3. Fetch from memory any operands (data on which the instruction operates) required by the instruction. PA1 4. Perform the operation defined by the instruction. PA1 5. Store the results of the operation in memory for future reference.
The primary circuits of the Von Neumann computer can be broadly grouped into two parts: a memory and a central processing unit (CPU). The memory holds the data and the instructions for the computer system. The CPU can be considered the brain of the system. It contains electronic logic that sequentially fetches and executes the stored instructions.
Data in most digital computers is represented in the form of binary numbers. Each location in memory is capable of storing a binary number (the maximum size of which depends upon the type of computer system). The program or set of sequential instructions that the CPU executes is stored in a particular region of memory. An instruction may occupy more than one location in memory. The first part of each instruction is called an opcode. The opcode is a unique binary number that tells the CPU which instruction it is. Most instructions have other parts that may contain operands (data to be processed) or operand specifiers. Operand specifiers inform the CPU where to find the operands that the instruction requires. These operands may be anywhere in memory or in certain temporary memory locations inside the CPU.
In general, the CPU performs the following operations to execute an instruction:
Different sets of hardware (called functional units) within the CPU carry out these operations. The functional units of a CPU usually include various registers (memory elements) and an arithmetic and logic unit (ALU). The registers store temporary results and instruction operands (data on which an instruction operates). The ALU uses combinatorial logic to process the data present at its inputs. The output of the ALU depends upon the control signals provided to it, and is obtained from the input by performing an arithmetic operation or a logical (shifting or boolean) operation. The processing in the CPU is done by channeling data from operand registers through the ALU into result registers. The data may be channeled through the ALU many times for complex instructions.
Data is transferred between the basic elements of the CPU through common busses (set of wires that carry related signals). The data transfers are dependant on the type of instruction currently being executed and are initiated by a central controller. The CPU controller sends a sequence of control signals to the various registers of the CPU, telling the registers when to put data on the common read bus (going to the inputs of the ALU) and when to get data off the common write bus (coming out of the ALU). The CPU controller also tells the ALU what operation to perform on the data from the input to the output. In this way, the controller of the CPU may initiate a sequence of data transfers starting with fetching the instruction from main memory, fetching corresponding data, passing the data between the ALU and the various temporary storage registers, and finally writing processed data back to main memory.
The various implementations of a controller fall under two main categories: hardwired and microprogrammed. Hardwired controllers use combinatorial logic and some state registers to produce a sequence of control signals. These control signals depend upon the type of instruction just fetched and the result of the execution of the previous instruction. The microprogrammed controller performs the same function but uses a ROM or RAM controlled state machine to produce the control signals from previous state and instruction inputs.
Hardwired controllers are tailored for a particular instruction set, and the logic used to implement them becomes increasingly complex as the complexity of the instruction set increases. Microprogrammed controllers are more general purpose devices in that changes of the control store can be used to change the microinstruction flow without changing the hardwired logic. While the hardwired controllers are fast, microprogrammed controllers provide more flexibility and ease of implementation.
In the simplest implementation of a microprogrammed controller, each CPU instruction corresponds to a microflow stored in the control store. As used herein, a micro-flow refers to a micro-programmed subroutine. Each bit or decoded field of a microinstruction corresponds to the level of a control signal. Sequencing through a series of such microinstructions thus produces a sequence of control signals. In a microprogrammed controller, each CPU instruction invokes at least one micro-flow (which may be just one microinstruction long for small one cycle CPU instructions) to generate control signals which control ALU operations and data transfers on the CPU internal busses.
Computers are often classified into complex instruction set computers (CISCs) and reduced instruction set computers (RISCs) on the basis of the instruction sets that their CPUs support. CISCs commonly have a large instruction set with a large variety of instructions, while RISCs typically have a relatively small set of simple instructions. Since RISC CPUs have a few simple instructions, they can afford to use the fast hardwired controllers. CISC CPUs usually use microprogrammed controllers because of ease of implementation. Some CPUs may use a plurality of controllers: hardwired and microprogrammed, to control various subsections of the CPU.
Since a machine operation may depend on the completion of a previous machine operation, the functional units operate on instructions sequentially. As a result, in a simple computer design, each functional unit is only being used for a fraction of the duration of the instruction execution.
The iterative fetch and execute scheme of the Von Neumann machine has been modified in many ways to produce faster computers. One such architectural modification is a technique known as pipelining. Pipelining significantly increases CPU performance by overlapping execution of several instructions in the CPU. In a pipelined architecture, different functional units process different instructions simultaneously.
An example of a pipelined CPU is described by Sudhindra N. Mishra in "The VAX 8800 Microarchitecture," Digital Technical Journal, Feb. 1987, pp. 20-33.
Pipeline processing is like an assembly line where assembly of many items happens simultaneously, but at any time each item is at a different stage of the assembly process. Pipelining allows overlapped execution of several instructions, thereby increasing the effective execution speed (or throughput) of each instruction.
Since each functional unit can handle only one instruction at a time, it is necessary that all functional units advance the instructions that they are processing in a synchronized manner. Unlike the assembly line analogy, however, functional units in a pipelined computer may require variable amounts of time depending upon the instruction they are processing. If one of the functional units takes a long time to perform its function on a particular instruction, all the functional units that follow in the pipeline must wait for it to finish before they can advance their respective instructions. This results in a pipeline stall. Pipeline stalls can also occur if a particular instruction needs the results of the previous instruction. The instruction that needs the results may stall the pipeline starting at the operand fetch unit, waiting for the previous instruction to produce the operands that the stalled instruction requires.
In known RISC systems, most instructions use the various CPU functional units for equal amounts of time. Pipelining in RISCs can thus be accomplished by overlapping the execution of CPU instructions, as described above. On the other hand, some CISC instructions can be quite complex, requiring long periods of time to execute, while other CISC instructions may be relatively simple and require much less time to execute. The disparity in functional unit usage among various CISC instructions would make the CISC pipeline stall often and for relatively long periods of time. For this reason, the pipelining of CISC CPU instructions is more difficult.
Various CISC instructions may have different sizes of microflows. Since each microinstruction provides control signals for one cycle to all elements of the various functional units, in some CISC machines the microinstructions are pipelined instead of the CPU instructions (as commonly done in RISC machines). This reduces stalling because the time of execution of each microinstruction is the same. In a microinstruction pipeline, each stage uses a few bits in the microinstruction that correspond to the functional unit of that stage. After each functional unit has made use of the microinstruction that controlled its activity during a cycle, it passes this microinstruction to the next functional unit in the pipeline in the next cycle. The first functional unit gets a new microinstruction. In this way, the fundamental principle of pipelining--overlapped instruction execution to utilize various functional units in parallel--is realized.
A basic rule governing control of most pipelined processors is that all functional stages of the pipeline simultaneously advance their states to the next functional stage. This is the conventional pipeline advancement technique in which all stages advance or stall in lockstep. This is necessary because each functional unit transmits its processed state to the following unit while it receives a new state from the preceding unit.
In an optimal pipeline, a new instruction enters each functional unit of the pipeline every cycle. In order to sustain this rate and prevent pipeline stalls, the instructions must be free of dependencies. A dependency occurs when one instruction requires data or resources that are only available after the execution of a previous instruction, and the data or resources are not yet available. When the operands or resources become available, the dependency is resolved, the stall condition is removed, and the pipeline is allowed to advance.
In many systems, three consecutive stages of the CPU instruction pipeline or micropipeline are devoted to:
Accessing operands
Calculating addresses or performing operations on operands
Issuing memory reads of addresses just calculated, or storing in memory results just produced in the previous segment.
In such machines, dependencies may not only create stalls but also cause deadlocks. A deadlock is a state when the pipeline freezes and can advance no more.
Memory data may be processed by first reading the data from memory into the CPU registers, and then using a second instruction some time later to access the data in the CPU registers. If the first instruction that fetches operands and the second instruction that uses them are consecutive, a deadlock state results. As the first instruction is in the second pipeline stage (of the three stages mentioned above) where the addresses of the operands in memory are being calculated, the second instruction is in the first stage trying to access the operands. A request has not yet been made to the memory for the operands because this happens in the third stage. Since the operands are not yet available, the first stage stalls the pipeline to give each stage of the pipeline enough time to perform its function. The stall prevents the pipeline from moving and the first instruction gets stuck in the second stage and does not advance to the third stage. Thus the operands never arrive and the second instruction, in the first stage, continues to stall the pipeline. This results in a deadlock.
There are a number of ways to solve the deadlock problem. The first solution is to avoid the problem by requiring that there be at least one instruction between a memory read and the instruction that accesses the data. This is not practical since unprivileged users, who can be either inexperienced or malicious, may create instruction streams that can cause a deadlock in the machine.
In certain pipeline designs, an instruction is never injected into the pipeline unless it will complete without stalling. Such pipelines are commonly referred to as issue-oriented pipelines. In this type of pipeline, the hardware detects the conflict between first and second instructions, and does not allow the second instruction to enter the pipeline until the memory reference in the first one is started.
This invention is another more efficient solution to the pipeline deadlock problem.