Certain types of applications often require the same operation to be performed on a large number of data items (referred to as “data parallelism”). Single Instruction Multiple Data (SIMD) refers to a type of instruction that causes a processor to perform an operation on multiple data items. SIMD technology is especially suited to processors that can logically divide the bits in a register into a number of fixed-sized data elements, each of which represents a separate value. For example, the bits in a 256-bit register may be specified as a source operand to be operated on as four separate 64-bit packed data elements (quad-word (Q) size data elements), eight separate 32-bit packed data elements (double word (D) size data elements), sixteen separate 16-bit packed data elements (word (W) size data elements), or thirty-two separate 8-bit data elements (byte (B) size data elements). This type of data is referred to as “packed” data type or a “vector” data type, and operands of this data type are referred to as packed data operands or vector operands. In other words, a packed data item or vector refers to a sequence of packed data elements, and a packed data operand or a vector operand is a source or destination operand of a SIMD instruction (also known as a packed data instruction or a vector instruction).
The SIMD technology, such as that employed by the Intel® Core™ processors having an instruction set including x86, MMX™, Streaming SIMD Extensions (SSE), SSE2, SSE3, SSE4.1, and SSE4.2 instructions, has enabled a significant improvement in application performance. An additional set of SIMD extensions, referred to the Advanced Vector Extensions (AVX) (AVX1 and AVX2) and using the Vector Extensions (VEX) coding scheme, has been released (see, e.g., see Intel® 64 and IA-32 Architectures Software Developers Manual, October 2011; and see Intel® Advanced Vector Extensions Programming Reference, June 2011).
Compression problems arise then not all results of SIMD/vector calculations are valid. In such cases, it is desirable to compress the valid results so that they go one after another in the same order as in the original vector before being saved to memory. The most prominent example of such a vector is 512-bit width vector in which every byte is a result of calculations and a 64-bit mask shows which byte is valid. For example, if the corresponding bit of the mask is one, then the byte is valid and if the corresponding bit of the mask is zero, then the byte is invalid. It would be beneficial in such cases to “compress” all of the valid results into contiguous storage locations prior to saving the results to memory (e.g., placing all of the valid results together on one side of the register and discarding the invalid results).