This invention is directed to digital computers, and more particularly to improved pipelined CPU devices of the type constructed as single-chip integrated circuits.
A large part of the existing software base, representing a vast investment in writing code, in establishing database structures and in personnel training, is for complex instruction set or CISC type processors. These types of processors are characterized by having a large number of instructions in their instruction set, often including memory-to-memory instructions with complex memory accessing modes. The instructions are usually of variable length, with simple instructions being only perhaps one byte in length, but the length ranging up to dozens of bytes. The VAX.TM. instruction set is a primary example of CISC and employs instructions having one to two byte opcodes plus from zero to six operand specifiers, where each operand specifier is from one byte to many bytes in length. The size of the operand specifier depends upon the addressing mode, size of displacement (byte, word or longword), etc. The first byte of the operand specifier describes the addressing mode for that operand, while the opcode defines the number of operands: one, two or three. When the opcode itself is decoded, however, the total length of the instruction is not yet known to the processor because the operand specifiers have not yet been decoded. Another characteristic of processors of the VAX type is the use of byte or byte string memory references, in addition to quadword or longword references; that is, a memory reference may be of a length variable from one byte to multiple words, including unaligned byte references.
The variety of powerful instructions, memory accessing modes and data types available in a VAX type of architecture should result in more work being done for each line of code (actually, compilers do not produce code taking full advantage of this). Whatever gain in compactness of source code is accomplished at the expense of execution time. Particularly as pipelining of instruction execution has become necessary to achieve performance levels demanded of systems presently, the data or state dependencies of successive instructions, and the vast differences in memory access time vs. machine cycle time, produce excessive stalls and exceptions, slowing execution.
When CPUs were much faster than memory, it was advantageous to do more work per instruction, because otherwise the CPU would always be waiting for the memory to deliver instructions--this factor lead to more complex instructions that encapsulated what would be otherwise implemented as subroutines. When CPU and memory speed became more balanced, the advantages of complex instructions is lessened, assuming the memory system is able to deliver one instruction and some data in each cycle. Hierarchical memory techniques, as well as faster access cycles, and greater memory access bandwidth, provide these faster memory speeds. Another factor that has influenced the choice of complex vs. simple instruction type is the change in relative cost of off-chip vs. on-chip interconnection resulting from VLSI construction of CPUs. Construction on chips instead of boards changes the economics--first it pays to make the architecture simple enough to be on one chip, then more on-chip memory is possible (and needed) to avoid going off-chip for memory references. A further factor in the comparison is that adding more complex instructions and addressing modes as in a CISC solution complicates (thus slows down) stages of the instruction execution process. The complex function might make the function execute faster than an equivalent sequence of simple instructions, but it can lengthen the instruction cycle time, making all instructions execute slower; thus an added function must increase the overall performance enough to compensate for the decrease in the instruction execution rate.
Despite the performance factors that detract from the theoretical advantages of CISC processors, the existing software base as discussed above provides a long-term demand for these types of processors, and of course the market requires ever-increasing performance levels. Business enterprises have invested many years of operating background, including operator training as well as the cost of the code itself, in applications programs and data structures using the CISC type processors which were the most widely used in the past ten or fifteen years. The expense and disruption of operations to rewrite all of the code and data structures to accommodate a new processor architecture may not be justified, even though the performance advantages ultimately expected to be achieved would be substantial. Accordingly, it is the objective to provide high-level performance in a CPU which executes an instruction set of the type using variable length instructions and variable data widths in memory accessing.
The typical VAX implementation has three main parts, the I-box or instruction unit which fetches and decodes instructions, the E-box or execution unit which performs the operations defined by the instructions, and the M-box or memory management unit which handles memory and I/O functions. An example of these VAX systems is shown in U.S. Pat. No. 4,875,160, issued Oct. 17, 1989 to John F. Brown and assigned to Digital Equipment Corporation. These machines are constructed using a single-chip CPU device, clocked at very high rates, and are microcoded and pipelined.
Theoretically, if the pipeline can be kept full and an instruction issued every cycle, a processor can execute one instruction per cycle. In a machine having complex instructions, there are several barriers to accomplishing this ideal. First, with variable-sized instructions, the length of the instruction is not known until perhaps several cycles into its decode. The number of opcode bytes can vary, the number of operands can vary, and the number of bytes used to specify an operand can vary. The instructions must be decoded in sequence, rather than parallel decode being practical. Secondly, data dependencies create bubbles in the pipeline as results generated by one instruction but not yet available are needed by are subsequent instruction which is ready to execute. Third, the wide variation in instruction complexity makes it impractical to implement the execution without either lengthening the pipeline for every instruction (which worsens the data dependency problem) or stalling entry (which creates bubbles).
Thus, in spite of the use of contemporary semiconductor processing and high clock rates to achieve the most aggressive performance at the device level, the inherent characteristics of the architecture impede the overall performance, and so a number of features must be taken advantage of in an effort to provide improved system performance as is demanded by users.
Pipelined computer implementations gain performance by dividing instruction processing into pieces and overlapping executing of the pieces in autonomous functional units. In practice, the ability to achieve overlap and high efficiency in the pipeline can be restricted by architecture specifications. Many architecture specifications, including the VAX architecture, enforce strict read and write ordering to guarantee deterministic results from instruction sequences and to avoid data corruption in common memory. Many CISC architectures, including the VAX architecture, also specify instructions that require memory requests in addition to operand requests to accomplish their specified behavior. Pipelined implementations of computers that require strict read and write ordering and support instructions that do memory requests in addition to operand requests need a way to synchronize instruction decode, instruction execution, and memory requests among the autonomous functional units.
Micropipelined processors gain performance by splitting instruction processing into pieces and overlapping execution, but macroinstructions (machine level instructions) are only started when the previous instruction completes. Strict order of memory requests is enforced by this serialization; operand requests and any additional memory requests associated with one instruction are made before the subsequent instruction is started. There is no synchronization problem.
Macropipelined processors gain additional performance by decoupling instruction decode and instruction execution allowing multiple macroinstructions to exist in the pipeline at various stages of processing at one time. Some CISC architectures, other load/store architectures, and architectures that do not require memory access other than for operands, can enforce read and write ordering by queuing the memory requests generated by operand evaluation in the order that instructions are decoded. Other RISC architectures do not require strict read and write ordering.
Macropipelined processors for architectures that generate memory accesses in addition to those generated by operand processing need a method to synchronize instruction decode, instruction execution, and memory request functions. A method of synchronization detects instructions that may cause out-of-order read and write references and shuts off instruction decode. Instruction execution proceeds until the instruction in question is finished, then decode resumes. In this way, the macropipeline is disabled for a period and processing proceeds serially, much in the manner of the micropipelined design. This effective, straightforward method loses the advantage of overlapped instruction processing during the synchronization period.
The goal then is to provide a computer that adheres to existing standard architecture specifications (e.g., a CISC architecture such as VAX) and yet delivers the highest possible performance. Changing the architecture, to RISC, for example, to eliminate the pipeline synchronization problem is not a possibility in view of the existing software base. The objective is to provide a macroinstruction-pipelined implementation that preserves the architecturally-defined read and write ordering.
Another issue is that of synchronizing the passing of instruction context across autonomous functional unit boundaries in a pipelined computer implementation. A feature is simplifying the selection of context dependent execution flows and creating possibilities for greater instruction overlap.
The ability to achieve overlap and high efficiency in a pipelined processor can be restricted by architecture specifications. Some CISC architectures specify instructions for which the operand context changes the flow of execution. Instructions specified by the VAX architecture that use variable bit field operands require a different execution flow depending on the operand context.
In micropipelined CISC processors, where performance is gained by splitting instruction processing into pieces and overlapping execution, macroinstructions are only started when the previous instruction completes. Operand processing and instruction execution flow is known ahead of time. There is no synchronization problem, nor is there any opportunity for additional execution overlap. By nature, many RISC architectures deliberately limit the breadth of operand types so that execution flow is predetermined.
In macropipelined processors, where performance is gained by de-coupling instruction decode and instruction execution to allow multiple macroinstructions to exist in the pipeline at various stages of processing at one time, if the execution flow for an instruction depends on operand context then the pertinent operand must be identified before the specific execution flow can begin. One method of synchronization the execution flow to the operand context is simply to hold off issuing the instruction from instruction decode unit until the operands are identified. The instruction context is modified accordingly to select a specific execution flow. In this way, the macropipeline is disabled for a period and processing proceeds serially, much in the manner of the micropipelined design. This straightforward method loses the advantage of overlapped instruction processing during the synchronization period, and may create a critical path the logic that modifies instruction context.
Thus, another objective is to provide computers of standard architecture in a macroinstruction pipelined implementation that supports split execution flows based on operand context while achieving maximum pipeline overlap.