1. Field of the Invention
This invention is concerned with methods for partitioning the large instruction sets of mainframe computing systems in order that such partitioned sets can be run by a plurality of microprocessors. More particularly, this invention relates to methodology for partitioning mainframe instruction sets to obtain the most effective cost/performance emulation of the mainframe instruction set through microprocessor implementation thereof.
2. Description of the Prior Art
One noteworthy characteristic of this era of integrated circuits is that higher performance computers use lower levels of integration. This is the result of individual optimizations across the performance spectrum. Since the price of a state-of-the-art silicon chip is, on balance, independent of the level of integration, the price per gate is lower for microcomputers that for super computers. One result of this situation has been the complete reversal of Grosch's Law which stated that payment of twice as much for a computer would provide four times as much processing power. This meant that one would achieve the best cost/performance from the largest computer that could be justified when its resources were shared among many unrelated users. As amended by the most recent technological advances and designs, the reversal of Grosch's Law now implies that the best cost/performance will be obtained from the smallest computer that will perform an application in an acceptable time.
Large scale integration or LSI has played a major role in the cost/performance improvements of all computing systems, particularly in reducing storage costs. However, LSI has been much more effective in reducing the costs of low performance processors having simple architectures than of high performance processors having complex architectures. This property of LSI favors implementing high performance computers using large numbers of low performance processors and storage chips. However, this implementation is difficult to apply to existing complex architectures intended for uni-processors that process a single stream of instructions. This limitation is best understood by considering the basic nature and effect of LSI on digital designs.
Recent improvements in the cost/performance of digital computer systems have been driven by the availability of increasingly denser LSI chips. Denser LSI memory chips, with reduced costs per bit stored, have direct and obvious applicability to digital systems over the entire application range from hand held calculators to super computers. Denser LSI logic chips, however, apply most naturally to digital systems near the low end of the performance and complexity spectrum.
LSI, as previously noted, applies naturally to very small digital systems. The logic portion of a hand calculator, microwave oven, or wrist watch, including the necessary memory and I/O device interfaces, can be implemented on a single LSI microcomputer chip. A small personal computer can be readily realized by using a single microprocessor chip, to implement the entire instruction set of the computer, together with other LSI chips which implement the interfaces between the microprocessor and the memory, keyboard, display tube, disks, printers, and communication lines. This is an example of partitioning a digital system's function for implementation by several LSI chips. This functional partitioning method is simple, well known, and straightforward because the instruction processing function can be accomplished entirely by a single chip.
Methods of applying LSI technology to the implementation of still more powerful digital systems, in which the state of the LSI art does not permit implementing the entire instruction processing function on a single LSI chip, are far less obvious. A first approach would be simply to wait until technology advances far enough to contain a desired architecture, of a given complexity, on a single chip. Unfortunately, this approach has its pitfalls. For example, the architecture of each generation's state-of-the-art microprocessor was determined by the then current capability of the technology, which explains why today's leading microprocessors lack floating-point instructions. The most significant disadvantage of this method is that it precludes implementing a pre-defined architecture that does not happen to fit within one chip in the current technology. This has led to the major software problems inherent in having each generation of microprocessors implement an essentially new architecture.
Another method of employing LSI in the larger, more complex processing systems is to partition the instruction execution function so that the data flow is on one chip and the microcode that controls the data flow is on one or more other chips. This method is the obvious application of LSI technology, separately, to the data flow and to the control store. Unfortunately, this method relinquishes the main advantage of LSI implementation, namely, that of having the control store and the data flow that it controls, both on the same chip. In most processors, the critical path runs from control store, to data flow, to arithmetic result, to address of the next control store word. Its length, in nanoseconds, determines the microcycle time and hence the instruction processing rate of the processor. For a given power dissipation, a critical path that remains wholly on one LSI chip results in a shorter cycle time than that of a critical path that must traverse several inches of conductor and a number of chip-to-card pin connections.
This off-chip microcode partitioning method also requires what LSI technology is least adept at providing, namely, large numbers of pins. The data flow chip needs at least a dozen pins to tell the control store what microword to give it next. Even worse, the data flow chip needs from 16 to 100 pins to receive that control word. A processor using this method is often limited to roughly 16-bit control words, and hence a vertical microprogram that can control only one operation at a time, whereas a far higher performance processor could be designed if a 100-bit control word were available. If available, such 100-bit control words would permit a horizontal microprogram that can control several operations in each micro-cycle and thus perform a given function in fewer cycles. It should be noted that the off-chip microcode partitioning method has been particularly successful when applied to bit-slice processors, in which the data flow is not reduced to a single chip, but rather is a collection of chips, each of which implements a particular group of bits thoughout the data flow. Bit-slice processors usually employ bipolar technologies whose densities are limited by the number of gates available, or the ability to cool them, rather than by the number of pins on the chips. The off-chip microcode partitioning method applies to FET implementations only in more unusual cases where many pins are available and the chip density happens to exactly match the number of gates needed to implement the data flow of a desired processor. The Toshiba T88000 16-bit microprocessor happens to meet these conditions. Such an implementation can be best viewed as a bit-slice design in which the implementable slice width has widened to encompass the entire desired dataflow.
Each major microprocessor manufacturer has faced the need to implement an architecture more complex that can be put onto a single LSI chip. Some needed to implement pre-existing architectures in order to achieve software compatibility with installed machines. Others sought to enhance the functions of existing successful one-chip microprocessors by adding further instructions.
For example, Digital Equipment Corporation needed a low-end implementation of their PDP-11 minicomputer architecture. They chose the off-chip microcode partitioning method. The result was the LSI 11 four-chip set manufactured first by Western Digital Corporation and then by Digital Equipment Corporation itself.
Intel Corporation needed to add hardware computational power, particularly floating-point instructions, to its 8086 microprocessor systems. For this purpose, they developed a "co-processor", the 8087. A processing system containing both an 8086 chip and an 8087 chip operates as follows. The chips fetch each instruction simultaneously. If the instruction is one that the 8086 can execute, it executes the instruction and both chips fetch the next instruction. If the instruction is one that the 8087 executes, the 8087 starts to execute it. In the usual case where a main store address is required, the 8086 computes the address and puts it on the bus shared with the 8087. The 8087 uses that address to complete execution of the instruction and then signals the 8086 that it is ready for both of them to fetch the next instruction. Thus, each chip looks at each instruction and executes its assigned subset, but only the 8086 computes addresses.
Zilog Corporation similarly needed to add floating-point instructions to its Z8000 microprocessor and developed an Extended Processing Unit or EPU. A system containing a Z8000 and one or more EPUs works as follows. The Z8000 fetches an instruction. If the Z8000 can execute the instruction, it does so. Otherwise, the Z8000 issues a request for service by an EPU and supplies an identifier (ID) that it determines by examining the instruction. One EPU recognizes that ID as its own and begins executing. The EPU can use special wires to the Z8000 to instruct the Z8000 to move necessary data back and forth between the EPU and the main store. The Z8000 proceeds to fetch and execute more instructions while the EPU is working, and only stops to wait for the EPU if it requests service by the same EPU while that EPU is still busy. Thus, it is the responsibility of the Z8000 to start the EPU and respond to commands from the EPU. A great deal of execution overlap is possible in such a system.
National Semiconductor Corporation had a similar requirement to add floating-point instructions to its NS-16000 microprocessor systems. It called the NS-16000 a "master" and called the computational processor a "slave". In a system containing a master and a slave, the master fetches instructions and executes them if it can. When the master fetches an instruction it cannot execute, it selects a slave to begin execution. The master sends the instruction and any needed data to the slave, waits for the slave to signal completion, receives the result, and proceeds to fetch the next instruction. Thus, the master never overlaps its execution with the slave's execution and is responsible for knowing what the slave is doing and what it needs.
Data General Corporation needed an LSI implementation of its Eclipse minicomputer architecture. The resulting MicroEclipse family employs a one-chip processor that contains the data flow as well as the horizontal (35-bit) and vertical (18-bit) microcode for executing the most performance-critical instructions in the architecture. This processor can call for vertical microwords from an off-chip control store, as necessary, to execute the rest of the instructions in the architecture by making use of the on-chip horizontal microwords. This is a variant of the other approaches described above with some of the advantages of both the off-chip control-store method and the instruction-set partitioning of a main frame instruction set method.
Designs that partitioned off I/O functions for implementation on dedicated microprocessors were common and none of the advanced microprocessor partitioning methods previously discussed had yet appeared when the present invention was conceived. Partitioning of functions within a central processing unit for implementation on separate processors had been employed in super computers. Their goal was separate execution units for fixed-point, floating-point, and perhaps decimal instructions, that could overlap execution to achieve maximum throughput.