1. Field of the Invention
This invention is directed to pipelined digital computers, and more particularly to the pipelining of register information between instruction decoding and instruction execution. The invention specifically relates to the logging of information about changes made to the contents of general purpose registers by decoded but not yet executed instructions, so that the contents of the registers can be restored when an exception occurs.
2. Description of the Background Art
A large part of the existing software base, representing a vast investment in writing code, database structures and personnel training, is for complex instruction set or CISC type processors. These types of processors are characterized by having a large number of instructions in their instruction set, often including memory-to-memory instructions with complex memory accessing modes. The instructions are usually of variable length, with simple instructions being only perhaps one byte in length, but the length ranging up to dozens of bytes. The VAX (Trademark) instruction set by Digital Equipment Corporation is a primary example of CISC and employs instructions having one to two byte opcodes plus from zero to six operand specifiers, where each operand specifier is from one byte to many bytes in length. The size of the operand specifier depends upon the addressing mode, size of displacement (byte, word or longword), etc. The first byte of the operand specifier describes the addressing mode for that operand, while the opcode defines the number of operands: one, two or three. When the opcode itself is decoded, however, the total length of the instruction is not yet known to the processor because the operand specifiers have not yet been decoded. Another characteristic of VAX (Trademark) instructions is the use of byte or byte string memory references, in addition to quadword or longword references; that is, a memory reference may be of a length variable from one byte to multiple words, including unaligned byte references.
The variety of powerful instructions, memory accessing modes and data types available in a variable-length CISC instruction architecture should result in more work being done for each line of code (actually, compilers do not produce code taking full advantage of this). Whatever gain in compactness of source code is accomplished at the expense of execution time. Particularly as pipelining of instruction execution has become necessary to achieve performance levels demanded of systems presently, the data or state dependencies of successive instructions, and the vast differences in memory access time vs. machine cycle time, produce excessive stalls and exceptions, slowing execution.
When CPUs were much faster than memory, it was advantageous to do more work per instruction, because otherwise the CPU would always be waiting for the memory to deliver instructions--this factor lead to more complex instructions that encapsulated what would be otherwise implemented as subroutines. When CPU and memory speed became more balanced, the advantages of complex instructions is lessened, assuming the memory system is able to deliver one instruction and some data in each cycle. Hierarchical memory techniques, as well as faster access cycles, and greater memory access bandwidth, provide these faster memory speeds. Another factor that has influenced the choice of complex vs. simple instruction type is the change in relative cost of off-chip vs. on-chip interconnection resulting from VLSI construction of CPUs. Construction on chips instead of boards changes the economics--first it pays to make the architecture simple enough to be on one chip, then more on-chip memory is possible (and needed) to avoid going off-chip for memory references. A further factor in the comparison is that adding more complex instructions and addressing modes as in a CISC solution complicates (thus slows down) stages of the instruction execution process. The complex function might make the function execute faster than an equivalent sequence of simple instructions, but it can lengthen the instruction cycle time, making all instructions execute slower; thus an added function must increase the overall performance enough to compensate for the decrease in the instruction execution rate.
Despite the performance factors that detract from the theoretical advantages of CISC processors, the existing software base as discussed above provides a long-term demand for these types of processors, and of course the market requires ever increasing performance levels. Business enterprises have invested many years of operating background, including operator training as well as the cost of the code itself, in applications programs and data structures using the CISC type processors which were the most widely used in the past ten or fifteen years. The expense and disruption of operations to rewrite all of the code and data structures to accommodate a new processor architecture may not be justified, even though the performance advantages ultimately expected to be achieved would be substantial. Accordingly, the basic objective to provide high-level performance in a CPU which executes an instruction set of the type using variable length instructions and variable data widths in memory accessing.
The typical pipelined digital computer for executing variable-length CISC instructions has three main parts, the I-box or instruction unit which fetches and decodes instructions, the E-box or execution unit which performs the operations defined by the instructions, and the M-box or memory management unit which handles memory and I/O functions. An example of such a digital computer system is shown in U.S. Pat. No. 4,875,160, issued Oct. 17, 1989 to John F. Brown and assigned to Digital Equipment Corporation. Such a machine is constructed using a single-chip CPU device, clocked at very high rates, and is microcoded and pipelined.
Theoretically, if the pipeline can be kept full and an instruction issued every cycle, a processor can execute one instruction per cycle. To this goal, macroinstruction pipelining is employed (instead of microinstruction pipelining), so that a number of macroinstructions can be at various stages of the pipeline at a given time. Queuing is provided between units of the CPU so that there is some flexibility in instruction execution times; the execution of stages of one instruction need not always wait for the completion of these stages by a preceding instruction. Instead, the information produced by one stage can be queued until the next stage is ready. But data dependencies still create bubbles in the pipeline as results generated by one instruction but not yet available are needed by a subsequent instruction which is ready to execute. In addition, it is sometimes necessary to "flush" the pipeline to remove information about a macroinstruction when an exception occurs for that macroinstruction or when the macroinstruction is in a predicted branch path for a prediction which is found to be incorrect.
When coupled with software, precise exception reporting can provide a robust and reliable environment for the computer programmer. Higher-level features such as demand paging and arithmetic exception handlers can be built on top of the exception architecture. The overlapped execution of instructions in a pipelined processor, however, makes precise exception reporting difficult. Many implementations choose to define an architectural "commit point", generally the point at which architectural state actually is modified, and exception conditions are synchronized to that point. Architectural state can be modified before the commit point as long as the changes are archived so that they can be "backed out" in the event of an exception. Many instructions can be decoded ahead of instruction execution using a history table or register log (RLOG) to archive the architectural changes due to operand processing.
History tables are typically organized in a manner that maintains instruction independence so that data pertinent to a particular instruction can be removed when the instruction is retired. The table allocates storage space to record the maximum number of state changes per instruction. Such a scheme is used in the VAX 8700 (Trademark) digital computer manufactured by Digital Equipment Corporation of Maynard, Mass. This micropipelined design only allows one instruction of ahead of execution to make architectural changes to the GPRs during instruction decode. A state change is recorded for every operand, even if the change is zero. In the event of an exception, the GPRs are simply restored from the entire history table; entries that are zero have no net effect. This scheme, however, is inefficient when extended to macropipelined implementations that allow many instructions to be decoded ahead of execution. Space is wasted for recording state changes of operands having a state change of zero, and the process of backing out of the history table becomes time consuming because the recorded changes of zero take as much time to restore as real changes. Valid bits could be added to the history table to mark the real changes and save some backing-out time, but the valid bits would add complexity to the hardware and control.
A register log for a macropipelined VAX (Trademark) digital computer sold by Digital Equipment Corporation is disclosed in Murray et al. U.S. Pat. No. 5,167,026 filed Feb. 3, 1989, entitled "Simultaneously or Sequentially Decoding Multiple Specifiers of a Variable Length Pipeline Instruction Based on Detection of Modified Value of Specifier Registers, " corresponding to European Patent Application Pub. No. 0381469 published Aug. 8, 1990. Register numbers and associated changes are logged in entries of an RLOG queue. To permit more than one entry in the RLOG queue to be associated with each instruction, a three-bit tag is incremented modulo-six when each instruction is decoded, and the tag is appended to the microcode "fork" address for the instruction and stored with the "fork" address in an instruction queue prior to instruction execution. The tag points to one of six RLOG counters. When an entry is added to the RLOG queue for an instruction, the corresponding RLOG counter is incremented. When an instruction is retired, the RLOG counter corresponding to the instruction is reset. The number of valid entries in the RLOG queue is obtained by summing all of the values of the RLOG counters. When an exception or interrupt occurs, the RLOG entries are unwound from the RLOG queue by accessing all of the valid entries in the RLOG queue. Moreover, for unwinding from a mispredicted branch, the instruction unit and the execution unit are flushed of only the valid entries corresponding to the instructions in the mispredicted path that were just decoded but not yet executed. This is done by using a flush counter which is set to the value of the execution unit tag plus a "number to keep" which specifies the number of instructions which have been correctly decoded and for which their results should be left in the queues 23. During the restoration process, the flush counter is used to select the RLOG counters corresponding to the instructions having information to be restored and accessed for flushing. Although this RLOG scheme is workable for multiple decoded but not yet executed instructions, it requires rather complex control logic.