1. Technical Field
The present invention relates in general to multimedia facilities within processors and in particular to facilities within processors executing permute or shift operations for multimedia applications. Still more particularly, the present invention relates to a fast shift decode mechanism for facilities within processors executing permute or shift operations for multimedia applications.
2. Description of the Related Art
Multimedia applications are increasing, leading to an increased demand for multimedia facilities within processors. Processors, such as the PowerPC.TM. processor available from IBM Corporation of Armonk, N.Y., are increasingly incorporating such multimedia facilities. In the case of the PowerPC.TM., the multimedia facility is the vector multimedia extensions (VMX) facility.
One of the sub-units of the VMX multimedia processor engine is the vector permute unit (VPU). This unit is responsible for performing byte reordering, packing, unpacking, byte shifting, etc. In particular, this unit is responsible for performing byte reordering for the VMX vperm (vector permute) instruction of the PowerPC.TM. architecture, which reorders bytes within a source operand VA or VB according to target designations within quadword operand VC.
At the core of the VPU is a 32:16 byte-wide crossbar which can place any of 32 source bytes into any of 16 target byte positions. The current implementation of the crossbar network is a set of 16 33:1 byte-wide passgate multiplexers. Each 33:1 multiplexer is controlled by 32 selects which may select from any source byte of operands VA or VB to a common target byte and a "zero select" that is utilized to select zeros in the shift cases or in cases when the crossbar is not being utilized. FIG. 3 depicts a simple diagram of the crossbar. The flow for target byte 0 of the crossbar output is shown, and includes a 33:1 multiplexer capable of passing any byte of operands VA or VB to target byte 0 of the crossbar output. Multiplexer selects vpca.sub.-- sel.sub.-- 0.sub.-- 0 through vpca.sub.-- sel.sub.-- 31.sub.-- 0 are employed to select a byte from input operand VA or input operand VB to be passed to crossbar output xbar.sub.-- out.sub.-- 0. The mechanism shown for target byte 0 is replicated for target bytes 1 through 15.
The selects for each multiplexer for each respective target byte are of the form vperm.sub.-- sel.sub.-- X.sub.-- Y, where X is the source byte and Y is the target byte. The decoding required for generating the required crossbar selects is illustrated in FIG. 4. For the VMX vperm instruction, 32 selects for each target byte are generated by decoding the lower 5 bit of the respective byte in the operand VC register. That is, for target byte 0, the crossbar selects are generated by a 5-to-32 bit decode of bits 5-8 in byte 0 of the operand VC. Similarly, the crossbar selects for target byte 1 are generated from a 5-to-32 decode of bits 5-8 of VC byte 1, etc.
The vperm.sub.-- sel.sub.-- X.sub.-- Y lines will then need to be qualified (vperm.sub.-- qual) by verifying that the current instruction being executed is indeed a vperm instruction. This qualification requirement creates a critical timing path problem since the vperm.sub.-- qual signal will have a minimum fanout of 512 (thirty-two selects per target byte with sixteen target bytes). The critical path through the VPU is from the decode of instruction operands, through the crossbar select generation, to the output of the crossbar. The required 512 fanout for the qualification signal vperm.sub.-- qual may introduce unacceptable latency within this critical path, and may increase the execution time for a vperm instruction beyond 1 processor cycle.
The crossbar within the VPU is also utilized for the VMX vslo (vector shift left by octet) and vsro (vector shift right by octet) instructions of the PowerPC.TM. architecture. These instructions shift the bytes of operand VA left or right by a number of bytes indicated within bits 121-124 of operand VB. Because the crossbar is employed to perform the shifting, the crossbar selects which are asserted as a result of decoding the vslo and vsro shift amounts must be similarly qualified with verification that the instruction being executed is, in fact, a vslo or vsro instruction.
It would be desirable, therefore, to provide mechanism for eliminating or reducing the qualification requirement for crossbar selects employed when performing the vperm, vslo, vsro, or equivalent instruction. It would further be advantageous if the mechanism permitted a one-cycle latency for execution of instructions employing the crossbar within the VPU.