Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers.
Some processors in the past have implemented instructions to perform SIMD shifts, both logical and arithmetic, using a correction circuit to perform either a zero-fill (for left shifts or logical shifts) or a sign-fill (for arithmetic right shifts) on all supported vector element sizes. This approach is not as readily applicable to SIMD rotations though.
One commonly used technique for performing a fast rotation or shift on a single fixed width data, is called a funnel shifter. In a funnel shifter, the data is first replicated or duplicated and concatenated. Then an output of the desired rotation is selected from within the two concatenated copies. If a shift instead of a rotation is desired, the correction circuit technique may be employed to zero-fill or sign-fill the rotated output.
The problem with supporting SIMD rotations is related to making the shifter work for various different sizes or widths of data. In theory the duplication and concatenation could be modified to support more sizes or widths of data, but as this adds one or more additional selection stage to the shifter, it may limit the performance of the shifter and thereby reduce the performance advantages otherwise expected for example, from a wide or large width vector architecture. To date, potential solutions to such performance limiting issues have not been adequately explored.