A general purpose computer processes data by executing one or more of several predefined instructions in a particular sequence. An example of a computing machine is a hand held calculator. In this machine, the predefined instructions (the instruction set) may include only the arithmetic operations of addition, subtraction, multiplication and division. Data and the required sequence of instructions are input by the user one by one and an arithmetic calculation results.
The set of sequential instructions that a computer executes to produce a desired result is called a program. In general purpose machines with large instruction sets, the programs may be very large. Since computers execute the instructions much faster than users can input them, it is desirable to store the programs in electronic memories so that the computer can automatically read the instructions and thereby run at top speeds.
Most modern stored-program data processing systems are based on the Von Neumann model. The Von Neumann computer design is based upon three key concepts:
Data and instructions are stored in a single read-write memory. PA1 The contents of this memory are addressable by location, without regard to the type of data contained in that location. PA1 Execution occurs in a sequential fashion (unless explicitly modified) from one instruction to the next.
The primary circuits of the Von Neumann computer can be broadly grouped into two parts: a memory and a Central Processing Unit (CPU). The memory holds the data and the instructions for the computer system. The CPU can be considered the brain of the system. It contains electronic logic that sequentially fetches and executes the stored instructions.
Data in most digital computers is represented in the form of binary numbers. Each location in memory is capable of storing a binary number (the maximum size of which depends upon the type of computer system). The program or set of sequential instructions that the CPU instruction may occupy more than one location in memory. The first part of each instruction is called an opcode. The opcode is a unique binary number that tells the CPU which instruction it is. Most instructions have other parts that may contain operands (data to be processed) or operand specifiers. Operand specifiers inform the CPU where to find the operands that the instruction requires. These operands may be anywhere in memory or in certain temporary memory locations inside the CPU.
In general, the CPU performs the following operations to execute an instruction:
1. Fetch an instruction from memory. PA0 2. Decode the fetched instruction to interpret the instruction. PA0 3. Fetch from memory any operands (data on which the instruction operates) required by the instruction. PA0 4. Perform the operation defined by the instruction. PA0 5. Store the results of the operation in memory for future reference.
Different sets of hardware (called functional units) within the CPU carry out these operations. The functional units of a CPU may contain various registers (memory elements) and arithmetic and logic units (ALUs). The registers store temporary results and instruction operands (data on which an instruction operates). The ALU uses combinatorial logic to process the data present at its inputs. The output of the ALU depends upon the control signals provided to it, and is obtained from the input by performing an arithmetic operation or a logical (shifting or boolean) operation. The processing in the CPU is done by channeling data from operand registers through the ALU into result registers. The data may be channeled through the ALU many times for complex instructions.
Data is transferred between the basic elements of the CPU through common busses (set of wires that carry related signals). The data transfers are dependant on the type of instruction currently being executed and are initiated by a central controller. The CPU controller sends a sequence of control signals to the various registers of the CPU, telling the registers when to put data on the common read bus (going to the inputs of the ALU) and when to get data off the common write bus (coming out of the ALU). The CPU controller also tells the ALU what operation to perform on the data from the input to the output. In this way, the controller of the CPU may initiate a sequence of data transfers starting with fetching the instruction from main memory, fetching corresponding data, passing the data between the ALU and the various temporary storage registers, and finally writing processed data back to main memory.
The various implementations of a CPU controller fall under two main categories: hardwired and microprogrammed. Hardwired controllers use combinatorial logic and some state registers to produce a sequence of control signals. These control signals depend upon the type of instruction just fetched and the result of the execution of the previous instruction. The microprogrammed controller performs the same function but uses a ROM or RAM controlled state machine to produce the control signals from previous state and instruction inputs.
Hardwired controllers are tailored for a particular instruction set, and the logic used to implement them instruction set increases. Microprogrammed controllers are more general purpose devices, in that changes in the contents of the control store can be used to change the microinstruction flow, without changing the hardwired logic. While the hardwired controllers are fast, microprogrammed controllers provide more flexibility and ease of implementation.
In the simplest implementation of a microprogrammed CPU controller, each CPU instruction corresponds to a micro-flow stored in the control store. As used herein, a micro-flow refers to a micro-programmed subroutine. Each bit or decoded field of a micro-instruction corresponds to the level of a control signal. Sequencing through a series of such microinstructions thus produces a sequence of control signals. In a microprogrammed controller, each CPU instruction invokes at least one micro-flow (which may be just one micro-instruction long for small one cycle CPU instructions) to generate control signals which control ALU operations and data transfers on the CPU internal busses.
Computers are often classified into complex instruction set computers (CISCs) and reduced instruction set computers (RISCs) on the basis of the instruction sets that their CPUs support. CISCs commonly have a large instruction set with a large variety of instructions, while RISCs typically have a relatively small set of simple instructions. Since RISC CPUs have a few simple instructions, they can afford to use the fast hardwired controllers. CISC CPUs usually use microprogrammed controllers because of ease of implementation.
The simple configuration of data processing computers specified in the Von Neumann model of computation is frequently subject to enhancements in an effort to increase the computer's efficiency and usefulness. One such enhancement is the use of "virtual memory" techniques that allow programs to address more instruction and data memory space than is physically available. The portions of program or data that are not currently in use are stored in disk storage and are transferred when needed into physical memory. This loading of pages from disk when a nonresident memory location is accessed (i.e. when a "page fault" occurs) is called "demand paging."
In systems having virtual memory, a high speed associative memory called a "translation lookaside buffer" or "TLB" is often used to quickly translate virtual addresses into their physical memory address equivalents. The translation buffer caches the most recently used virtual-to-physical address translations. If a desired translation is not present in the translation buffer (i.e. a TLB "miss"), the translation process must halt, and so must the instruction which requested the faulting memory access. The desired translation is then read from a slower translation table in memory (which may itself be initially non-resident) and the translation loaded into the TLB. The construction and operation of the translation buffer is further described in Levy & Eckhouse, Jr., Computer Programming and Architecture--The VAX-11, Digital Equipment Corporation (1980) pp. 358-359.
Another enhancement technique frequently applied to the basic Von Neumann model of computation is directed not at the memory configuration, but at the execution scheme employed by the processing unit. The proven architectural modification of "pipelining" can significantly increase instructions in the CPU, thus engaging each functional unit in productive work for a greater overall percentage of time. In a pipelined CPU, the multiple functional units concurrently execute the basic constituent segments of a plurality of CPU instructions.
An example of a pipelined CPU is described by Sudhindra N. Mishra in "The VAX 8800 Microarchitecture," Digital Technical Journal, February 1987, pp. 20-33.
Since each functional unit can handle only one instruction at a time, it is necessary that all functional units in a pipeline advance the instructions that they are processing in a synchronized manner. Unlike in the assembly line analogy, however, the functional units in pipelined computer may require variable amounts of time depending upon the instruction that they are currently processing. If one of the functional units takes a long time to perform its function on a particular instruction, all the functional units that follow in the pipeline must wait for it to finish before they can advance their respective instructions to the next phase of the pipeline. This delay for the purpose of maintaining synchronization is known as a pipeline "stall". Pipeline stalls can also occur if a particular instruction needs to use results of a previous instruction in the pipeline which has not completed execution. The instruction that needs the results may stall the pipeline starting at the operand fetch unit, waiting for the previous instruction to pass through the pipeline and produce the operand that the stalled instruction requires.
In known RISC systems, most instructions use the various CPU functional units for equal amounts of time. Pipelining in RISCs can thus be accomplished by overlapping the execution of the simple CPU instructions, as described above. On the other hand, some CISC instructions can be quite complex, requiring numerous CPU register/ALU transfers and long periods of time to execute. Other CISC instructions may be relatively simple and require fewer transfers and much less time to execute. The disparity in functional unit usage among various CISC instructions would make a CISC instruction pipeline stall often and for relatively long periods of time. For this reason, the pipelining of CISC CPU instructions is more difficult.
CISC instructions of varying complexity may have correspondingly different sizes of microflows. Since each microinstruction provides the lowest-level control signals for one CPU cycle to all elements of the various functional units, in some CISC machines the execution of microinstructions is pipelined instead of the CPU instructions. This reduces stalling because the time of execution of each microinstruction is more nearly the same. In a microinstruction pipeline, each stage uses a few bits in the microinstruction that correspond to the functional unit of that stage. After each functional unit is done with the microinstruction that controlled its activity during a cycle, it passes this microinstruction to the next functional unit in the pipeline for the next cycle. The first functional unit gets a new microinstruction each cycle. In this way, the fundamental principle of pipelining--the overlapped instruction execution to utilize the various functional units in parallel--is realized.
CPUs which incorporate the above mentioned and other fine-tuned to execute typical instruction sequences. Thus the typical, or most frequent, sequences execute quickly. Atypical instruction sequences result in unusual conditions called pipeline "exceptions" which may force the CPU to change the flow of program execution. Depending on the instruction architecture, exceptions called "faults" may arise in the middle of execution of an instruction. In a computer system having virtual memory, for example, a "page fault" will occur during instruction operand fetching when the addressed operand does not reside in physical memory. In this case the current instruction cannot be completed, but it is desirable to use the CPU itself to carry out the demand paging to bring the desired operand from disk to physical memory.
Exceptions as described above, are infrequent and must not degrade performance of the typical case. The pipe stage logic must detect exceptions, but may be freed from the burden of correcting exceptional conditions. Once an exception has been detected during the normal operation of the pipeline, the processor must employ either hardware or software means to remedy the faulting condition, then cause the pipeline to resume normal operation. While the actual exception handling tends not to be performance-critical, the process must recover smoothly, leaving the pipeline ready for efficient subsequent instruction execution.
If the processor responds to exceptional conditions asynchronously (i.e. some time after the condition has already passed), the exception handling software or hardware must not only remedy the faulting condition, but also restore any erroneous results or information regarding the state of the CPU written during the time after the exception was detected and before it was resolved. Such backtracking or "rewinding" tends to be inefficient, since it delays the resumption of normal operation of the CPU. Furthermore, with asynchronous exception handling, the faulting instruction must be reexecuted upon restarting the pipeline, so that the operation it specifies can be successfully carried out.
If the pipelined CPU detects and responds to exceptional conditions synchronously, at the same time relative to the exception, the pipeline can be halted before any inaccurate instruction results are written. If the writing of results by the exception handler is prohibited, normal pipeline operation can be resumed as soon as the exception handler has fixed the faulting conditions, again starting with the re-execution of the faulting instruction.
It is desirable to employ a method of implementing recoverable pipeline exceptions which minimizes the amount of additional logic required to process the exception, and which allows for fast resumption of normal pipeline flow. In addition, it would be beneficial to eliminate the redundancy of executing the faulting instruction twice.