The ever-growing requirement for high performance computers demands that state-of-the-art microprocessors execute instructions in the minimum amount of time. Over the years, efforts to increase microprocessor speeds have followed different approaches. One approach is to increase the speed of the clock that drives the processor. As the clock rate increases, however, the processor's power consumption and temperature also increase. Increased power consumption increases electrical costs and depletes batteries in portable computers more rapidly, while high circuit temperatures may damage the processor. Furthermore, processor clock speed may not increase beyond a threshold physical speed at which signals may traverse the processor. Simply stated, there is a practical maximum to the clock speed that is acceptable to conventional processors.
An alternate approach to improving processor speeds is to reduce the number of clock cycles required to perform a given instruction. Under this approach, instructions will execute faster and overall processor "throughput" will thereby increase, even if the clock speed remains the same. One technique for increasing processor throughput is pipelining, which calls for the processor to be divided into separate processing stages (collectively termed a "pipeline"). Instructions are processed in an "assembly line" fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the processor as a whole to become faster. "Superpipelining" extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, for example, a processor in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can be processed simultaneously in the pipeline, with the processing of one instruction completed during each clock cycle. Therefore, the instruction throughput of an N stage pipelined architecture is, in theory, N times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every N clock cycles.
Another technique for increasing overall processor speed is "superscalar" processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (i.e., the execution of an instruction does not depend upon the execution of any other instruction), processor throughput is increased in proportion to the number of instructions processed per clock cycle ("degree of scalability"). If, for example, a particular processor architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the processor is theoretically tripled.
Processor speed may also be increased by implementing instructions that are tailored to specific tasks. For example, MMX.TM. instructions (MMX.TM. is a trademark of Intel Corporation) are used in high performance processors to change individual bytes of data within words (16 bits), double words (32 bits) and quad words (64 bits). Previously, reading and changing an individual byte within a larger group of bytes required masking out and shifting the other bytes. This required multiple processor instructions and necessarily slowed down the processor throughput. While MMX.TM. instructions address this problem by allowing individual bytes to be modified in a single instruction, there is still much room for improvement. It would be advantageous to be able to shift and reorder data by the byte, by the word, by the double word and by the quad word. Additionally, it would be advantageous to implement such new processor functions in a way that uses (or reuses) existing processor circuitry, thereby minimizing redesign and transistor count.
Therefore, there is a need in the art for an improved instruction set that increases overall microprocessor throughput. More particularly, there is a need in the art for improved microprocessors that can route and shift data in bytes (8 bits), words (16 bits), double words (32 bits) and quad words (64 bits). There is a still further need in the art for improved microprocessors that implement such routing and shifting functions by reusing existing microprocessor circuitry.