A single instruction multiple data stream (SIMD) instruction enables a computer that supports parallel processing to base on a single instruction such as an add instruction to perform a single operation on more than one data stream in parallel. Such an SIMD instruction can be used for speeding up processing and utilizing registers with adequate number of bytes in an efficient manner. A scalar type SIMD instruction is different from a parallel (non-scalar) SIMD instruction in that the operation specified in a scalar SIMD instruction is carried out on only one of the multiple data elements while the operation specified by a parallel SIMD instruction is carried out on all of the multiple data elements. This difference is illustrated in FIGS. 1(a) and 1(b).
A parallel SIMD instruction operates simultaneously on all data elements. FIG. 1(a) (Prior art) shows how different data elements in two parallel registers are added in parallel via a parallel SIMD instruction “ADD PS Xmm1, Xmm2”, where “ADD PS” indicates that it is a parallel scalar (PS) add (ADD) instruction and that registers Xmm1 and Xmm2 store the multiple data elements (operands) that are to be added in parallel. In this example, each register has 128 bits, corresponding to four data elements, each of which has 32 bits. The data elements in register Xmm1 have floating point values of 3.5, 12.8, 0.32, and 1.0, respectively. The data elements in register Xmm2 have floating point values of 4.3, 7.1, 2.65, and 4.0, respectively. When the parallel SIMD instruction “ADD PS Xmm1, Xmm2” is performed, the values in corresponding data elements of the two registers are added simultaneously, yielding values of 7.8, 19.0, 2.97, and 5.0, respectively. The addition result is stored in the destination register Xmm1 (which is also a source register).
A scalar SIMD instruction performs computation on only one data element stored in each of the operand parallel registers, as illustrated in FIG. 1(b) (Prior art). An example scalar SIMD instruction “ADD SS Xmm1, Xmm2” performs an addition operation on single data elements stored at the lowest 32 bits of the two operand registers (i.e., Xmm1 and Xmm2). In the illustrated example, only data 1.0 and 4.0 (that occupy the lowest 32 bits of Xmm1 and Xmm2) are added, yielding 5.0 to be stored at the lowest 32 bits of the destination register (Xmm1). During this operation, the upper bits (i.e., bits 33-127) of the registers should remain unchanged. That is, the execution of the “ADD SS” instruction needs to ensure the integrity of all the upper bits of the registers.
To ensure the integrity of the upper 96 bits, conventional solutions extract the single data elements from involved registers (e.g., extract 1.0 and 4.0 from the lower 32 bits of the registers) from parallel registers, place extracted data elements elsewhere to perform the computation, and then merge the result (e.g., 5.0) into the intended destination parallel register (e.g., Xmm1). This involves four separate steps of operations, namely two extraction operations to extract the source data elements (e.g., 1.0 and 4.0), one computation operation (e.g., ADD SS), and a merging operation to merge the result (e.g., 5.0) back to the destination parallel register (e.g., Xmm1).