Signal processing e.g. for audio and video processing usually requires that the same instruction is repeated for a plurality of data elements of a vector. An example of such a processor, here an SIMD processor, is shown in FIG. 1. The SIMD processor shown comprises a vector memory M, a register file RF and a plurality of Functional Units FU1, . . . , FUn.
The SIMD processor further has an instruction memory PM for storage of VLIW instructions, a VLIW controller CTRL coupled to the instruction memory and an addressing unit ACU+ for the memory.
The memory M has a storage capacity for a plurality of vectors, each comprising P words of m bits. For example P=8 and m=16. The functional units FUs support multiple operations in parallel on data stored in the register file, such as load and storage of data, multiplication, and multiply accumulate operations.
A load vector operation moves a vector from the SIMD memory M to the register file RF. A store vector operation moves a vector from the register file RF to the SIMD memory M.
In the SIMD processor the combination of register files plus functional units can be considered as P parallel data lanes that communicate with the memory having P parallel memory banks. For so-called inter-vector operations, the elementary operation is applied P times locally in each lane. So-called intra-vector operations work on the P elements of a single vector, and hence work across P lanes. Each of the P lanes is coupled to a respective memory bank.
Word sizes of data elements used by signal-processing applications (process data elements) often poorly match the word sizes of the memory locations of Digital Signal Processors (DSP). For example, a usual memory word size is 16 bits. However, an internal delay line of a DVB-T receiver has a capacity of 48 k samples of 10 bits. (48 k=3 frames of 8 complex samples of 2 words of 10 bits.) Mapping this onto the memory of a programmable DSP implies a mapping of 10 bit data elements to 16 bit storage locations. The overall memory requirements for this delay line thus increase from 480 kbit to 768 kbit [60% up!].
Ideally, this excess memory requirement is to be avoided by compression of data, so that the data elements to be stored in memory (storage data elements) have a size equal to that of the memory storage locations.
Compressed memory formats themselves are not new. Their implementation, however, imposes special challenges in the case of SIMD machines. By way of example this is illustrated with reference to an SIMD that processes vectors of P=8 words with a size of 16 bits. The so-called vector memory is a physical memory of lines of P×16 bits.
A conceptually simple solution to carry out a load operation of P=8 data elements of 10 bits from 5 compressed data elements (storage data elements) of 16 bits stored in the memory would be to introduce a multiplexer to spread an aligned vector across the P lanes. Likewise a store operation of P=8 data elements of 10 bits as 5 compressed data elements of 16 bits in the memory could be realized with such a multiplexer. This, however, would imply up to P sets of say 10 wires crossing the lane boundaries. Note that when compression for multiple data element sizes have to be supported (e, g. both 10 and 12 bits), such a multiplexer is needed for each size, for both load and for store operations. (And this would only support the case that both the compressed and the uncompressed vectors are aligned, requiring additional rotate operations.) As the VLSI layouts of SIMD machines are already wire limited, this would amount to a considerable area and power penalty.