An instruction set, or instruction set architecture (ISA), is the part of the computer architecture related to programming, and may include the native data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I/O). It should be noted that the term instruction generally refers herein to a macro-instruction—that is instructions that are provided to the processor for execution—as opposed to micro-instructions or micro-ops—that result from a processor's decoder decoding macro-instructions). The instruction set architecture is distinguished from the microarchitecture, which is the internal design of the processor implementing the ISA. Processors with different microarchitectures can share a common instruction set.
An instruction set includes one or more instruction formats. A given instruction format defines various fields (number of bits, location of bits) to specify, among other things, the operation to be performed and the operand(s) on which that operation is to be performed. A given instruction is expressed using a given instruction format and specifies the operation and the operands. An instruction stream is a specific sequence of instructions, where each instruction in the sequence is an occurrence of an instruction in an instruction format.
Scientific, financial, auto-vectorized general purpose, RMS (recognition, mining, and synthesis)/visual and multimedia applications (e.g., 2D/3D graphics, image processing, video compression/decompression, voice recognition algorithms and audio manipulation) often require the same operation to be performed on a large number of data items (referred to as “data parallelism”). Single Instruction Multiple Data (SIMD) refers to a type of instruction that causes a processor to perform the same operation on multiple data items. SIMD technology is especially suited to processors that can logically divide the bits in a register into a number of fixed-sized data elements, each of which represents a separate value. For example, the bits in a 64-bit register may be specified as a source operand to be operated on as four separate 16-bit data elements, each of which represents a separate 16-bit value. As another example, the bits in a 256-bit register may be specified as a source operand to be operated on as four separate 64-bit packed data elements (quad-word (Q) size data elements), eight separate 32-bit packed data elements (double word (D) size data elements), sixteen separate 16-bit packed data elements (word (W) size data elements), or thirty-two separate 8-bit data elements (byte (B) size data elements). This type of data is referred to as the packed data type or vector data type, and operands of this data type are referred to as packed data operands or vector operands. In other words, a packed data item or vector refers to a sequence of packed data elements; and a packed data operand or a vector operand is a source or destination operand of a SIMD instruction (also known as a packed data instruction or a vector instruction).
By way of example, one type of SIMD instruction specifies a single vector operation to be performed on two source vector operands in a vertical fashion to generate a destination vector operand (also referred to as a result vector operand) of the same size, with the same number of data elements, and in the same data element order. The data elements in the source vector operands are referred to as source data elements, while the data elements in the destination vector operand are referred to a destination or result data elements. These source vector operands are of the same size and contain data elements of the same width, and thus they contain the same number of data elements. The source data elements in the same bit positions in the two source vector operands form pairs of data elements (also referred to as corresponding data elements; that is, the data element in data element position 0 of each source operand correspond, the data element in data element position 1 of each source operand correspond, and so on). The operation specified by that SIMD instruction is performed separately on each of these pairs of source data elements to generate a matching number of result data elements, and thus each pair of source data elements has a corresponding result data element. Since the operation is vertical and since the result vector operand is the same size, has the same number of data elements, and the result data elements are stored in the same data element order as the source vector operands, the result data elements are in the same bit positions of the result vector operand as their corresponding pair of source data elements in the source vector operands. In addition to this exemplary type of SIMD instruction, there are a variety of other types of SIMD instructions (e.g., that have only one or has more than two source vector operands; that operate in a horizontal fashion; that generate a result vector operand that is of a different size, that have a different size of data elements, and/or that have a different data element order). It should be understood that the term destination vector operand (or destination operand) is defined as the direct result of performing the operation specified by an instruction, including the storage of that destination operand at a location (be it a register or at a memory address specified by that instruction) so that it may be accessed as a source operand by another instruction (by specification of that same location by the another instruction.
Certain instruction set architectures allow multiple vector and scalar operations to complete in parallel and update the instruction set architecture register set. These instruction set architectures can be leveraged to implement compression/decompression instructions and algorithms such as instructions based on run-length encoding (RLE).
RLE is a form of lossless data compressing where sequences of data in a stream of data are compressed when those sequences contain one or more sets of consecutive data values. Rather than storing each data element in the set of consecutive data values, a single element with the value is stored followed by an element with the count of consecutive elements. This form of compression is most useful on data that contains many such runs.
For example, zero-based compression/decompression takes advantage of frequently occurring zero elements in data streams. In some data types, particularly data associated with graphics processing, it is common to have a significant portion of data contain the value zero and, in turn, many runs of zeros. RLE based on zero is often referred to as zero-based compression. Although other RLE schemes may be based on value other than zero if compression would benefit from another RLE value being selected.
FIG. 1 illustrates a prior art method for decompressing a zero-based stream. FIG. 1 starts with setting variable, such as a variable named INDEX, to zero in operation 100. Flow moves from 100 to 110 where it is determined whether the stream still has valid input. If there is no more valid input in the source stream then the flow is done in 190. If there is valid input, then flow moves to operation 110 where a source element at the position indicated by the index is read into a variable, such as a temporary variable named TEMP; TEMP=SOURCE[INDEX]. Flow moves to operation 115 where the index is incremented, INDEX=INDEX+1, and then to operation 120 where it is determined whether TEMP is equal to 0; TEMP==INDEX. If TEMP is not equal to zero, then the operation moves to 122 where TEMP is output as being decompressed. If TEMP is equal to zero then the flow moves to operation 125 where the contents of the source stream at the position indicated by the index, which as previously incremented, are read into a variable named COUNT; COUNT=SOURCE[INDEX]. The COUNT variable now holds the run length for the decompression of 0. Flow moves from 125 to operation 127 where INDEX is incremented, INDEX=INDEX+1, then to operation 130 where a second index, e.g. INDEX2, is set to 0. From here flow moves to a small loop that checks whether INDEX2 s less than COUNT at operation 135, outputs TEMP at operation 140, increments INDEX2 as operation 145, and then repeats from 135 until INDEX2 is no longer less than COUNT, meaning that the number of zeros indicated by the run length have been output. Once the correct number of zeros has been output, flow moves back to operation 105 and continues until there is no more valid input. This form of decompression is highly iterative and involves a data element by data element evaluation of the input stream to output non-zero values and convert a zero value/count pair into an output stream.