Modern Digital Signal Processors (DSP) aim to combine high performance execution with low power consumption. In particular, they exploit the natural parallelism present in signal processing applications by simultaneously executing the same instruction on multiple data elements. This Single Instruction Multiple Data (SIMD) model usually requires that operands be packed in advance in “vector” registers.
Programmers and optimizing compilers use vectorization techniques to exploit the SIMD capabilities of DSP architectures. Such techniques reveal temporal and spatial locality in scalar source code and transform groups of scalar instructions into vector instructions. It is often very complicated to apply vectorization techniques to DSP architectures, as they typically have scarce resources with tight interdependencies between them. Vectorization is often further impeded by the memory architecture, which typically provides access to contiguous memory items only, and may suffer from additional memory alignment restrictions, while DSP computations may require access to data elements in an order that is neither contiguous nor memory-aligned. Packing data elements into and out of vector registers is usually done with special gather, scatter or permute instructions, which incur additional performance penalties and increase complexity.
Vectorization techniques that are adapted for use with DSP architectures and that overcome problems associated with conventional vectorization techniques would therefore be advantageous.