1. Field of the Invention
This invention is directed to pipelined digital computers, and more particularly to the pipelining of register information between instruction decoding and instruction execution. The invention specifically relates to a register scoreboard scheme for preventing register access conflicts between the fetching of register operands for instruction execution and the retiring of register destination operands.
2. Description of the Background Art
A large part of the existing software base, representing a vast investment in writing code, database structures and personnel training, is for complex instruction set or CISC type processors. These types of processors are characterized by having a large number of instructions in their instruction set, often including memory-to-memory instructions with complex memory accessing modes. The instructions are usually of variable length, with simple instructions being only perhaps one byte in length, but the length ranging up to dozens of bytes. The VAX (Trademark) instruction set by Digital Equipment Corporation is a primary example of CISC and employs instructions having one to two byte opcodes plus from zero to six operand specifiers, where each operand specifier is from one byte to many bytes in length. The size of the operand specifier depends upon the addressing mode, size of displacement (byte, word or longword), etc. The first byte of the operand specifier describes the addressing mode for that operand, while the opcode defines the number of operands: one, two or three. When the opcode itself is decoded, however, the total length of the instruction is not yet known to the processor because the operand specifiers have not yet been decoded. Another characteristic of VAX (Trademark) instructions is the use of byte or byte string memory references, in addition to quadword or longword references; that is, a memory reference may be of a length variable from one byte to multiple words, including unaligned byte references.
The variety of powerful instructions, memory accessing modes and data types available in a variable-length CISC instruction architecture should result in more work being done for each line of code (actually, compilers do not produce code taking full advantage of this). Whatever gain in compactness of source code is accomplished at the expense of execution time. Particularly as pipelining of instruction execution has become necessary to achieve performance levels demanded of systems presently, the data or state dependencies of successive instructions, and the vast differences in memory access time vs. machine cycle time, produce excessive stalls and exceptions, slowing execution.
When CPUs were much faster than memory, it was advantageous to do more work per instruction, because otherwise the CPU would always be waiting for the memory to deliver instructions--this factor lead to more complex instructions that encapsulated what would be otherwise implemented as subroutines. When CPU and memory speed became more balanced, the advantages of complex instructions is lessened, assuming the memory system is able to deliver one instruction and some data in each cycle. Hierarchical memory techniques, as well as faster access cycles, and greater memory access bandwidth, provide these faster memory speeds. Another factor that has influenced the choice of complex vs. simple instruction type is the change in relative cost of off-chip vs. on-chip interconnection resulting from VLSI construction of CPUs. Construction on chips instead of boards changes the economics--first it pays to make the architecture simple enough to be on one chip, then more on-chip memory is possible (and needed) to avoid going off-chip for memory references. A further factor in the comparison is that adding more complex instructions and addressing modes as in a CISC solution complicates (thus slows down) stages of the instruction execution process. The complex function might make the function execute faster than an equivalent sequence of simple instructions, but it can lengthen the instruction cycle time, making all instructions execute slower; thus an added function must increase the overall performance enough to compensate for the decrease in the instruction execution rate.
Despite the performance factors that detract from the theoretical advantages of CISC processors, the existing software base as discussed above provides a long-term demand for these types of processors, and of course the market requires ever increasing performance levels. Business enterprises have invested many years of operating background, including operator training as well as the cost of the code itself, in applications programs and data structures using the CISC type processors which were the most widely used in the past ten or fifteen years. The expense and disruption of operations to rewrite all of the code and data structures to accommodate a new processor architecture may not be justified, even though the performance advantages ultimately expected to be achieved would be substantial. Accordingly, a basic objective is to provide high-level performance in a CPU which executes an instruction set of the type using variable length instructions and variable data widths in memory accessing.
A typical pipelined digital computer for executing variable-length CISC instructions has three main parts, the I-box or instruction unit which fetches and decodes instructions, the E-box or execution unit which performs the operations defined by the instructions, and the M-box or memory management unit which handles memory and I/O functions. An example of such a digital computer system is shown in U.S. Pat. No. 4,875,160, issued Oct. 17, 1989 to John F. Brown and assigned to Digital Equipment Corporation. Such a machine is constructed using a single-chip CPU device, clocked at very high rates, and is microcoded and pipelined.
Theoretically, if the pipeline can be kept full and an instruction issued every cycle, a processor can execute one instruction per cycle. To this goal, macroinstruction pipelining is employed (instead of microinstruction pipelining), so that a number of macroinstructions can be at various stages of the pipeline at a given time. Queuing is provided between units of the CPU so that there is some flexibility in instruction execution times; the execution of stages of one instruction need not always wait for the completion of these stages by a preceding instruction. Instead, the information produced by one stage can be queued until the next stage is ready. But data dependencies still create bubbles in the pipeline as results generated by one instruction but not yet available are needed by a subsequent instruction which is ready to execute. In addition, it is sometimes necessary to "flush" the pipeline to remove information about a macroinstruction when an exception occurs for that macroinstruction or when the macroinstruction is in a predicted branch path for a prediction which is found to be incorrect.
Register access conflicts in a pipelined computer are typically resolved by a register scoreboard. Each potential dependency is recorded as a single bit set when a register source operand is decoded, and another single bit set when a register destination operand is decoded. The use of a register for fetching an operand is stalled if that register is indicated as the destination for a decoded but not yet executed instruction. In a similar fashion, the modification of a register by pre-processing of an auto-mode specifier is stalled if that register is indicated as a register source operand of a not-yet executed instruction.
In high-performance pipelined computers, a set of queues are inserted between an instruction decoding unit and an instruction execution unit. Because of the queues, the use or modification of a register during the fetching or pre-processing of an operand may conflict with a plurality of decoded but not yet executed instructions. One scheme for resolving that conflict is a register scoreboard queue, as described in Murray et al., Multiple Instruction Preprocessing with Data Dependency Resolution for Digital Computer, U.S. Pat. No. 5,142,631, issued Aug. 25, 1992, corresponding to European Patent Application Pub. No. 0,380,850, published Aug. 8, 1990. In such a register scoreboard queue, a register source mask and a register destination mask are generated for each decoded instruction, the masks are loaded into a queue, a composite source mask is generated by a logical OR of all of the source masks in the queue, and a composite destination mask is generated by a logical OR of all of the destination masks in the queue. Although such a register scoreboard queue is workable for resolving conflicts for multiple decoded but not yet executed instructions, it requires more circuitry than is necessary for certain pipelined processor designs. Moreover, U.S. Pat. No. 5,142,631 does not show the handling of multiple outstanding conflicts to a single register. Instead, as described in U.S. Pat. No. 5,142,631, instruction decoding is stalled when the read-after-write conflict is detected between the mask that is generated for the instruction being decoded and any of the masks in the register scoreboard queue.