Processor designs and programs to run on processors can trace their evolution from basic mathematical principles set out by the British mathematician A. M. Turing in the 1930s, whose “Turing Machine” represents a mathematical model of a sequential computational process. Sequential control concepts may be attributed to even earlier machines of Babbage in the 1800s. The idea of a sequential process was embodied in the von Neumann processor architecture developed in the 1940s, which had a number of important characteristics that have been maintained in most commercial processors today. The salient characteristics of these processors to note herein are that the program and data are stored in sequentially addressed memory and use a single sequential instruction stream made up of single-address single-operation instructions sequenced by an instruction counter. See, for example, “Computer Architecture Concepts and Evolution” by G. A. Blaauw and F. P. Brooks, Jr., Addison-Wesley, 1997, p. 589, (subsequently referenced herein as Blaauw and Brooks). Even though over the years there have been many types of processors and software languages developed for the creation of programs to accomplish various functions, most commercial machines are still based on Turing and von Neumann principles. The overriding architectural philosophy of most commercial processors embeds a control structure based on sequential principles with the program's arithmetic/logical function. Because of this inherent embedding from the beginning of processor developments, it can be understood why the sequential instruction fetch mechanism of providing a sequence of instruction addresses by an instruction counter has remained basically the same throughout the history of processors. There have been a few exceptions with one being the IBM 650 processor, Blaauw and Brooks pp. 648-664, announced in 1953 where a fetched instruction contained a next instruction address field. But, this mechanism still embedded a program's control structure with its arithmetic logic function because the next instruction address field was included as part of the 650 instruction format of its instruction set comprising load, store, arithmetic, shift, input/output (I/O), and branch instructions. Further, it was discounted as being inefficient for future architectures and has not been pursued in any new processor design.
Another related idea is that of microprogrammed processors which used microinstructions to implement, via a mircoprogram stored in an internal microstore, “higher-level” more complex instructions. The microinstructions were many times hidden from the programmer who only used the higher level more complex instruction set of the processor. Microinstructions are primitive level instructions containing “implementation-derived” control signal bits that directly control primitive operations of the processor and usually differed in each processor implementation, Blaauw and Brooks pp. 71-75. This microprogramming mechanism still embeds the microprogram's control structure with, in this case, primitive operations because any microinstruction that contained a microstore next instruction address field also included control signal bits that directly control primitive operations of the processor. Some of the disadvantages of microprogramming are associated with the cost and performance impact of the microstore and microprogram control unit, lack of uniformity between implementations, additional programming and documentation costs.
In order to obtain higher levels of instruction parallelism in a processor architecture based on von Neumann principles, packed data, see, for example, “Intel MMX for Multimedia PCs”, by A. Peleg, S. Wilkie, and U. Weiser, Communications of the ACM, January 1997, Vol. 40, No. 1; vector, see, for example, “An Introduction to Vector Processing”, by P. M. Johnson of Cray Research, Inc., Computer Design, February 1978, pp. 89-97; and very long instruction word (VLIW) architectures, see, for example, “The ManArray Embedded Processor Architecture”, by G. G. Pechanek and S. Vassiliadis, Proceedings of the 26th Euromicro Conference: “Informatics: inventing the future”, Maastricht, The Netherlands, Sep. 5-7, 2000, Vol. I, pp. 348-355 and more specifically U.S. Pat. Nos. 6,151,668, 6,216,223, 6,446,190, and 6,446,191, have been developed.
In the packed data mechanism, an instruction specifies multiple operations on data units containing multiple data elements, such as a 64-bit data unit consisting of eight 8-bit data elements. This packed data construct is used in arithmetic/logical instructions that are embedded with a program's control structure and does not affect the sequential instruction fetch rules of the basic architecture. In vector machines, a vector instruction specifies an operation on a block of data and provides hardware resources to support the repetitive operations on the block of data. Vector instructions are still fetched in a sequential manner and vector machines still use the standard control structures embedded in the instruction stream. In the traditional VLIW case, a single addressable long instruction unit is made up of multiple single instructions words where the packing of the instructions in the VLIW is based upon independence of operation. In the indirect VLIW case, as described in the above listed patents, a single addressable standard width instruction from a primary instruction stream causes the indirect fetch of a VLIW from one or multiple local caches of VLIWs. In both of these VLIW architectures, a program's control structure is still embedded with the program's arithmetic/logical function and the architectures adhere to the sequential instruction fetch rules of a classic sequential machine.
There are difficulties for improving processor performance beyond what these architectures allow that ultimately stem from the basic embedding of a program's control structure with its arithmetic logic function coupled with the sequential instruction counter fetching rules under which the processor architectures are based. To get at the basic issues involved, one of these difficulties can be stated as, how can multiple instructions be issued per cycle given the way programs are written as sequential steps including both functional steps and control, call/return and branching, steps? The primary commercial attempts to solve this problem have resulted in superscalar and VLIW architectures. Both architectures use a mechanism to analyze a sequential program for opportunities to issue multiple instructions in parallel. In the superscalar case, the analysis mechanism is embedded in hardware requiring significant memory and complex logic to support look-ahead and multiple issue rules evaluation. For three issue and larger machines, the memory and logic overhead becomes increasingly large and complex leading to extended and expensive development and testing time. In the VLIW case, the multiple issue analysis mechanism is embedded in a compiler in order to minimize hardware complexity while still supporting large issue rates. This technique has great value but the analysis results are applied to VLIW hardware that still is based on a sequential program counter instruction fetch approach where control instructions are embedded with functional instructions in the program instruction stream. One of the consequences of this embedding tied with a sequential program counter addressing fetch rule has been the use of fixed-size VLIW memories in both the traditional VLIW and the indirect VLIW approaches mentioned earlier. This has led to inefficiencies in using VLIW architectures generally and lost flexibility due to either increased use of NOPs for cases when all the instruction slots of a VLIW cannot be used or in overhead latency to load VLIWs when those VLIWs may be of single or short use duration.
Another difficulty to be faced in improving processor performance concerns whether vector operations can be efficiently supported in a processor design? Vector operations have typically been treated as data processing operations of an application specific nature. Operations on vectors are generally defined as multi-cycle operations requiring significant embedded hardware vector registers and control logic. Traditionally, vector functionality has been treated as excessive and only special purpose machines have been built to support vector operations.
Another difficulty lies in the code density of superscalar, VLIW, and vector machines and concerns whether the code density can be improved by compressing the instruction stream? Instruction compression is presently treated as an add-on mechanism to improve code density of an existing processor architecture. Consequently, instruction compression mechanisms must deal with mixed function and control instructions within the program and many times need to use inventive mechanisms to deal with these embedded control instructions such as branches and calls/returns.
Therefore, there is needed a mechanism that can issue a variable number of instructions depending upon the available parallelism throughout a program without the large overhead of embedded look ahead and complex rules evaluation logic or fixed size VLIW memories. There is a further need for a mechanism that supports vector operations in a flexible fashion that is easily implemented. There is also a need for a mechanism that inherently supports techniques that can compress a program instruction stream.