Vector Processing
FIG. 1 shows a high level diagram of a processing core 100 implemented with logic circuitry on a semiconductor chip. The processing core includes a pipeline 101. The pipeline consists of multiple stages each designed to perform a specific step in the multi-step process needed to fully execute a program code instruction. These typically include at least: 1) instruction fetch and decode; 2) data fetch; 3) execution; 4) write-back. The execution stage performs a specific operation identified by an instruction that was fetched and decoded in prior stage(s) (e.g., in step 1) above) upon data identified by the same instruction and fetched in another prior stage (e.g., step 2) above). The data that is operated upon is typically fetched from (general purpose) register storage space 102. New data that is created at the completion of the operation is also typically “written back” to register storage space (e.g., at stage 4) above).
The logic circuitry associated with the execution stage is typically composed of multiple “execution units” or “functional units” 103_1 to 103_N that are each designed to perform its own unique subset of operations (e.g., a first functional unit performs integer math operations, a second functional unit performs floating point instructions, a third functional unit performs load/store operations from/to cache/memory, etc.). The collection of all operations performed by all the functional units corresponds to the “instruction set” supported by the processing core 100.
Two types of processor architectures are widely recognized in the field of computer science: “scalar” and “vector”. A scalar processor is designed to execute instructions that perform operations on a single set of data, whereas, a vector processor is designed to execute instructions that perform operations on multiple sets of data. FIGS. 2A and 2B present a comparative example that demonstrates the basic difference between a scalar processor and a vector processor.
FIG. 2A shows an example of a scalar AND instruction in which a single operand set, A and B, are ANDed together to produce a singular (or “scalar”) result C (i.e., AB=C). By contrast, FIG. 2B shows an example of a vector AND instruction in which two operand sets, A/B and D/E, are respectively ANDed together in parallel to simultaneously produce a vector result C, F (i.e., A.AND.B=C and D.AND.E=F). As a matter of terminology, a “vector” is a data element having multiple “elements”. For example, a vector V=Q, R, S, T, U has five different elements: Q, R, S, T and U. The “size” of the exemplary vector V is five (because it has five elements).
FIG. 1 also shows the presence of vector register space 107 that is different than general purpose register space 102. Specifically, general purpose register space 102 is nominally used to store scalar values. As such, when, any of execution units perform scalar operations they nominally use operands called from (and write results back to) general purpose register storage space 102. By contrast, when any of the execution units perform vector operations they nominally use operands called from (and write results back to) vector register space 107. Different regions of memory may likewise be allocated for the storage of scalar values and vector values.
Note also the presence of masking logic 104_1 to 104_N and 105_1 to 105_N at the respective inputs to and outputs from the functional units 103_1 to 103_N. In various implementations, for vector operations, only one of these layers is actually implemented—although that is not a strict requirement (although not depicted in FIG. 1, conceivably, execution units that only perform scalar and not vector operations need not have any masking layer). For any vector instruction that employs masking, input masking logic 104_1 to 104_N and/or output masking logic 105_1 to 105_N may be used to control which elements are effectively operated on for the vector instruction. Here, a mask vector is read from a mask register space 106 (e.g., along with input operand vectors read from vector register storage space 107) and is presented to at least one of the masking logic 104, 105 layers.
Over the course of executing vector program code each vector instruction need not require a full data word. For example, the input vectors for some instructions may only be 8 elements, the input vectors for other instructions may be 16 elements, the input vectors for other instructions may be 32 elements, etc. Masking layers 104/105 are therefore used to identify a set of elements of a full vector data word that apply for a particular instruction so as to effect different vector sizes across instructions. Typically, for each vector instruction, a specific mask pattern kept in mask register space 106 is called out by the instruction, fetched from mask register space and provided to either or both of the mask layers 104/105 to “enable” the correct set of elements for the particular vector operation.
LZ77 Compression Algorithm
Compression algorithms strive to reduce an amount of data without sacrificing the information within the data. One type of compression algorithm, referred to as the LZ77 algorithm, achieve compressions by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement “each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream”. (The “distance” is sometimes called the “offset” instead.)
To spot matches, the encoder keeps track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a “sliding window”, which is why LZ77 is sometimes called sliding window compression. The encoder keeps the most recent data within the sliding window to look for matches (and the decoder likewise will keep this data to interpret the matches the encoder refers to).
FIG. 3 shows a simple example of the basic process of an LZ77 encoding scheme. As observed in FIG. 3, the bit patterns of a preceding (earlier or older) portion 301 of a bit stream 300 are compared against a current portion 302 of the bit stream. If a sequence of bits is found in the current portion 302 that matches a sequence of bits in the preceding portion 301, the sequence of bits in the current portion 302 is replaced with a reference to the same sequence of bits in the earlier portion 301. For example, the bit sequence in the current portion 302 would be replaced with a reference to bit sequence 303 in the earlier portion 301. The reference that is inserted for bit sequence 302 would identify the length of bit sequence 302 (which also is the same as the length of bit sequence 303) and the location of bit sequence 303. Thus, upon decoding the compressed stream, when the decoder reaches the reference, it simply “refers” back to bit sequence 303 to reproduce the correct bit sequence for portion 302 of the decoded stream.
A more complicated but more effective version of the encoding process will compute a hash of the leading bytes (e.g., the 3 leading bytes of) the current portion (referred to as a “prefix”) and use that as an index into some data structure that holds bit strings of the earlier portion (or the locations of such bit strings) that hashed to the same value.
The LZ77 compression algorithm is used as part of the DEFLATE compression algorithm which is used to compress gzip, Zlib, PKZip and WinZip compression schemes.