The disclosure is generally directed to a single-instruction multiple-data (SIMD) processor and, more specifically, to a three-dimensional permute unit for an SIMD processor.
A vector permute unit is one of the fundamental functional blocks of a central processing unit (CPU) or processor. A vector permute unit enables a bit-wise or byte-wise exchange of individual bits or bytes within a data word of a processor. In other words, bytes may be permuted in a data word by a permute unit. In particular, a vector permute unit may be used to align and re-sort data into vectors, which may be used as input operands for parallel single-instruction multiple-data (SIMD) execution units. As one example, each byte of, for example, two 128-bit wide operands may be permutated into a 128-bit wide result vector byte-wise on byte boundaries. If an operand length is limited to 128 bits, a conventional vector permute unit may only support 128-bit SIMD operations, which is equivalent to four 32-bit operands.
To support future applications of 4-way double precision floating point operations or S-way SIMD 32-bit integer operations, the width of the data paths of a vector permute unit needs to be doubled. For example, two 256-bit input operands may be concatenated to provide a 512-bit wide input vector. The result may then be a 256-bit vector as a consequence of a byte-wise permutation of the source operands into a 32-byte result vector. State-of-the-art extensions to permute units would require that a crossbar be built as combinations of multilevel 8-bit multiplexers. In this case, the number of logic gates may be increased by a factor of four and data paths may suffer from wire length increase and fan-out. In this case, the performance of an extended crossbar supporting 512-bit to 256-bit byte-wise vector permutation may be dramatically reduced. For example, instead of a one cycle operation of the crossbar, latency may increase to three cycles.
A number of conventional approaches have been employed to handle increased data throughput and packaging. For example, U.S. Patent Application Publication No. 2008/0253085 discloses a 3-dimensional (3D) multi-layer modular computer architecture that improves computer mass, volume, and power densities through the use of 3D layered structures. As another example, U.S. Patent Application Publication No. 2011/0119508 discloses a computing system having a stack of microprocessor chips that are designed to work together in a multiprocessor system. The chips are interconnected with 3D through-vias, or alternatively by compatible package carriers that provide interconnections (logically the chips in a stack are interconnected via specialized cache coherent interconnections).