The present invention involves an apparatus for executing instructions in a computing device. In particular, the present invention provides an apparatus having a pipeline architecture for executing instructions in a high-speed computing device having a plurality of functional units.
A critical consideration in the design of high-speed computing devices having multiple function units is the speedy and orderly execution of instructions. Prior art approaches in such apparata do not execute enough instructions per machine cycle to fully utilize the functional units of such computing devices, and, thereby, realize the high performance level achievable with such powerful instruction execution resources.
Prior art approaches to providing high-speed instruction execution utilize a pipelining of instructions to achieve the speed of execution desired. Typical prior art approaches are described in "An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors", by Acosta et al., IEEE Transactions on Computers, Vol. C-35, No. 9, September, 1986; in "Instruction Issue Logic in Pipelined Supercomputers", by Weiss et al., IEEE Transactions on Computers, Vol. C-33, No. 11, November, 1984; and in "Distributed Instruction Set Computer", by Wang et al., International Conference on Parallel Processing, August, 1988.
Among the design trade-offs involved in designing computing devices are the issues of instruction scheduling, instruction issue logic complexity, and hardware costs. Recent developments in large system integration approaches to hardware solutions have served to reduce the impact of hardware costs in the design of computing devices so that hardware solutions are more viable in designing computing devices than was the case in earlier times.
There are basically two ways that instruction code scheduling can be accomplished. First, such scheduling can be done at compile time by the software run by the computing device; this is referred to as "static" scheduling because it does not change as the program runs. Second, it can be done by the hardware at run time. This is referred to as "dynamic" scheduling. The two methods are not mutually exclusive. Using complex issue logic and dynamic scheduling of instructions, which allows instructions to begin execution out of order with respect to the compiled code sequence, relieves some of the burden on the compiler to generate a good schedule so that performance is not so dependent on the quality of the compiled code. Further, dynamic scheduling at issue time can take advantage of dependency information that is not available to the compiler at the time it would do static scheduling. However, such complex issue logic generally requires longer control paths which can necessitate a longer clock period.
Prior art approaches to such pipeline instruction issue logic problems often provided for inclusion of tags with instructions to indicate destinations associated with instructions. Such tags were generally associated with an instruction from the time the instruction was issued from memory storage until the time it was executed.
Weiss et al. discussed the use of such tags in conjunction with "reservation stations" assigned to each functional unit of the computing device. However, there was no provision that all data dependencies be resolved prior to the entering of instructions in Weiss' reservation station. Thus, the employment of such reservation stations as a buffer for a functional unit was inefficient since data dependencies could occur even while the instruction was located in the reservation station, thereby precluding the instruction's execution by the functional unit associated with the respective reservation station.
Acosta et al. described the addition of two fields to the instruction byte to identify data dependency information in terms of destination registers and source registers associated with the respective instructions. These data dependency fields operated much like the tags of the Weiss et al. approach, at least insofar as they accompanied the instructions throughout their handling and execution.
In contrast, the present invention employs a more truly dynamic scheduling approach for instruction execution in that the present invention employs first-in-first-out (FIFO) registers to ensure, through the inherent properties of such hardware, that the correct sequence of the instructions is maintained through their execution. Thus, no tags accompany the instructions through execution, and pipelining of the instructions is more efficiently effected than is accomplished with prior art devices.
The present invention takes advantage of the increasing power of microprocessor technology. As the microprocessor art advances, more functions can be implemented on the same chip, and clock speed is reduced. The conventional approach to improve performance of a computing device has been to implement more complex instructions and hardware and employ a simpler compiler, since the complex instructions execute faster in hardware than in software. However, complex instructions implemented in hardware may slow down the remainder of the instruction set which comprises the program being executed by the computer. An alternate approach is to have a simple instruction set designed so that most instructions execute in one clock cycle with the optimization of codes being accomplished by the compiler. Such an approach utilizes hardware logic instead of microcoding for speed in decoding, and such a simple instruction set leaves more area on the microprocessor chip for implementing more functions.
An important performance measurement for microprocessors is the number of clock cycles required to execute one instruction, or inversely, the number of instructions executed per clock cycle. With a single functional unit employed by the computer, the limit of performance is one instruction per clock cycle. Additional decoding and functional units may be employed on a microprocessor chip so that instructions can be executed in parallel. In the present invention, the instruction set is used with multiple decoding and functional units, the primary goal of which is a parallel microprocessor architecture that can effect execution of two instructions per clock cycle.