Single instruction multiple data (SIMD) instructions may be used in processing systems for exploiting data parallelism. Data parallelism exists when a same or common task needs to be performed on two or more data elements of a data vector, for example. Rather than use multiple instructions, the common task may be performed on the two or more data elements in parallel by using a single SIMD instruction which defines the same instruction to be performed on multiple data elements in corresponding multiple SIMD lanes.
SIMD instructions may include one or more vector operands such as source and destination vector operands. Each vector operand would include two or more data elements. For SIMD instructions, all data elements belonging to the same vector operand may generally be of the same bit-width. However, some SIMD instructions may specify mixed-width operands where data elements of a first vector operand may be of a first bit-width and data elements of a second vector operand may be of a second bit-width, where the first and second bit-widths differ from each other. Execution of SIMD instructions with mixed-width operands may involve several challenges.
FIGS. 1A-C illustrate examples of challenges involved in conventional implementations for executing SIMD instructions with mixed-width operands. With reference to FIG. 1A, a first conventional implementation for executing SIMD instruction 100 is illustrated. It is assumed that SIMD instruction 100 may be executed by a conventional processor (not shown) which supports a 64-bit instruction set architecture (ISA). This means that instructions such as SIMD instruction 100 may specify operands with bit-widths up to 64-bits. The 64-bit operands may be specified in terms of 64-bit registers or a pair of 32-bit registers.
The object of SIMD instruction 100 is to execute the same instruction on each data element of source operand 102. Source operand 102 is a 64-bit vector comprising eight 8-bit data elements labeled 0-7. Source operand 102 may be stored in a single 64-bit register or a pair of 32-bit registers. The same instruction or common operation to be executed on each of the eight data elements 0-7 may be, for example, multiplication, square function, left-shift function, increment function, addition (e.g., with a constant value or immediate fields in the instruction or with values provided by another vector operand), etc., the result of which may consume more than 8-bits, and up to 16-bits of storage for each of the eight resulting data elements. This means that the result of SIMD instruction 100 may consume twice the storage space that source operand 102 may consume, i.e., two 64-bit registers or two pairs of 32-bit registers.
Since the conventional processor configured to implement SIMD instruction 100 does not include instructions which specify operands of bit-widths greater than 64-bits, SIMD instruction 100 may be divided into two component SIMD instructions 100X and 100Y. SIMD instruction 100X specifies the common operation to be performed on data elements labeled with even-numbers (or “even-numbered data elements”) 0, 2, 4, and 6 of source operand 102. SIMD instruction 100X specifies destination operand 104x which is 64-bits wide and includes 16-bit data elements labeled A, C, E, and G, each of which i composed of high (H) 8-bits and low (L) 8-bits. The results of the common operation on even-numbered 8-bit data elements 0, 2, 4, and 6 of source operand 102 are correspondingly written to 16-bit data elements A, C, E, and G of destination operand 104x. SIMD instruction 100Y is similar to SIMD instruction 100X with the difference that SIMD instruction 100Y specifies the common operation on data elements labeled with odd-numbers (or “odd-numbered data elements”) 1, 3, 5, and 7 of source operand 102 with the results to be written to 16-bit data elements B, D, F, H of destination operand 104y which is also a 64-bit operand similar to destination operand 104x of SIMD instruction 100X. In this manner, each of the SIMD instructions 100X and 100Y can specify one 64-bit destination operand, and together, SIMD instructions 100X and 100Y can accomplish the execution of the common operation on each of the data elements 0-7 of source operand 102. However, due to the two separate instructions needed to implement SIMD instruction 100 increases code space.
FIG. 1B illustrates a second conventional implementation of SIMD instruction 100 using a different set of component SIMD instructions 120X and 120Y. SIMD instructions 120X and 120Y each specify the common operation on each of the 8-bit data elements 0-7 of source operand 102. SIMD instruction 120X specifies destination operand 124x into which the low (L) 8-bits of the results are to be written, to corresponding 8-bit result data elements A-H of destination operand 124x (while the high (H) 8-bits of the results are discarded). Similarly, instruction 120Y specifies destination operand 124y into which the high (H) 8-bits of the results are to be written, to corresponding 8-bit data elements A-H of destination operand 124y (while the low (L) 8-bits of the results are discarded). This second conventional implementation of SIMD instruction 100 also suffers from increased code space for the two component SIMD instructions 120X and 120Y. Moreover, as can be appreciated, the second conventional implementation also incurs wastage of power in calculating and discarding either the high (H) 8-bits (e.g., in executing instruction 120X) or the low (L) 8-bits (e.g., in executing instruction 120Y) for each of the data elements 0-7 of source operand 102.
FIG. 1C illustrates a third conventional implementation of SIMD instruction 100 using yet another set of component SIMD instructions 140X and 140Y, which are similar in some ways to SIMD instructions 100X and 100Y of FIG. 1A. The difference lies in which ones of the data elements of source operand 102 are operated on by each SIMD instruction. In more detail, rather than even-numbered 8-bit data elements, SIMD instruction 140X specifies the common operation to be performed on the lower four data elements 0-3 of source operand 102. The results are written to 16-bit data elements A, B, C, D of destination operand 144x. However, execution of SIMD instruction 140X involves the spreading out of the results of the operation on the lower four 8-bit data elements (spanning 32 bits) across all 64-bits of destination operand 140X. SIMD instruction 144y is similar and specifies the spreading out of the results of operation on the upper four 8-bit data elements 4-7 of source operand 102 across 16-bit data elements E, F, G, H of 64-bit destination operand 144y. Apart from increased code size as in the first and second conventional implementations, these spreading out data movements as seen in the third conventional implementation may need additional hardware such as a crossbar.
Accordingly, there is a need for improved implementations of mixed-width SIMD instructions which avoid the aforementioned drawbacks of the conventional implementations.