For some digital signal processing algorithms, input data has to be aggregated or organized before computation can start. For example, when selecting a scattered set of data elements from a linear input vector, the selected data elements have to be aggregated into a new linear output vector of subsequent data elements before computation can start. Such type of data aggregation or organization is for example required in the equalization processing of data subcarriers in 3GPP Long Term Evolution (LTE) communication systems. LTE is a wireless communication standard using Orthogonal Frequency Division Multiplexing (OFDM). An OFDM based communication system efficiently demodulates (or modulates) a certain number of radio frequency subcarriers through applying a (inverse) Fourier Transform. For LTE there are 12000 or 14000 OFDM symbols per second dependent on the mode. For a 20 MHz bandwidth carrier, each of the 1200 subcarriers per OFDM symbol comprises a modulation symbol. The modulation symbol may for example be a user data symbol or a pilot symbol used for channel estimation. The mapping of various types of information channels and signals on these subcarriers can be quite complex and scattered. Before doing equalization processing of the data channel, the data symbols in some receiver systems is extracted from the input OFDM symbol, and concatenated into a new vector. An example receiver system comprises a vector processor on which the equalization processing is mapped.
Dynamic Memory Allocation (DMA) is often used to copy data elements in an embedded system. Typically, a DMA unit transports rather coarse grained blocks of data elements. By the expression “rather coarse grained blocks of data” is meant blocks of data having a size in the range from tens to millions of bytes. The smaller the data blocks are, i.e. the shorter the data bursts are, the larger will the overhead of controlling and configuring the data transport be. DMA units often have support for a scatter gather operation, i.e. support for a linked list of DMA configurations that are executed in sequence. Selection of resource elements in LTE, i.e. selection of complex words of e.g. 32 bits, would be too fine grained for a DMA unit.
Classical Digital Signal Processors (DSPs) operate on scalar sample granularity. Therefore, they may be very selective in loading from and in storing to a memory. Further, selection of data elements may be done, while performing computational processing. Thus, there is no need to perform the copying of data elements into a temporary data vector. However, it should be noted that the selectivity of the classical DSPs, i.e. the scalar sample granularity, may add complex control structures in the signal processing software.
Further, a device loading and storing one selected scalar element at a time requires quite some time to aggregate all selected data elements.
As an example, assume that an LTE category 5 receiver is available. The LTE category 5 receiver is a receiver supporting a downlink data rate of 300 Mbit/s in a 20 MHz frequency band, and with a 4×4 Multiple Input Multiple Output (MIMO) receiver configuration (and therefore has 4 receive antennas). Further, assume that the data element extraction is mapped onto a scalar DSP with one data memory interface. Such a DSP would then be loaded for up to 134 Million of Instructions Per Second (MIPS) for this task. The processor load is calculated as 4 receive antennas*1200 data symbols/OFDM symbol/antenna*14000 OFDM symbols/s*2 pipelined processor cycles/data element selection for the load and store operations.
In addition, to select data elements of 4 receiver antenna streams, selection of data elements from 16 channel estimate streams has also to be performed. The selection pattern has the same selection grid as the data element selection of the received symbol streams and would result in a processor load of up to 670 MIPS, which is a significant overhead and has quite some latency. The processor load is calculated as 16 channel estimate streams*1200 estimates/OFDM symbol/channel estimate stream*14000 OFDM symbols/s*2 pipelined processor cycles/data element selection for the load and store operations.
Vector DSPs perform operations on vectors of data elements. An important class of operation is the so called Single Instruction Multiple Data (SIMD) operation, which operation performs the same arithmetic operation on a plurality of input vectors by means of an array of identical parallel processing units. The SIMD operation requires that the algorithm is vectorizable, meaning that input data elements should be able to be processed as vectors of adjacent data elements. By the available hardware parallelism, i.e. the parallel processing units, the vector DSP is capable of performing the SIMD operations which provide a higher computational throughput as compared to the computational throughput provided by the classical, non-vector, DSPs.
Typically, the vector DSP has a vector memory. Thus, the vector DSP may access a vector of subsequent data elements in one instruction. Further, the vector DSP often has a vector register file to store intermediate results in, whereby communication to the vector memory may be reduced. Furthermore, the vector DSP requires the high memory bandwidth that the vector memory can provide, so that the load and/or store throughput can keep up with the computational throughput.
Vector DSPs may have some support for data element reorganization within the vector. For example a shuffle unit can be available, which shuffle unit may be useful for selecting per output data element a specific input data element. The shuffle operation function can be based on a configuration pattern with absolute indices or relative distances.
A vector processor or similar device capable of loading and/or storing vectors of P data elements from a vector memory has a higher memory bandwidth than a device with a scalar load interface and/or a scalar store interface. For data element selection at the same throughput as the memory system, the vector processor must be capable to perform shuffling of input data elements and to aggregate the selected data elements into one or more output vectors. A vector processor with a shuffle unit, a vector register file, and with support for mask based element selection can perform this task. However, as the selection patterns can be dynamic, for example as in the LTE element extraction case, these shuffle and mask patterns have to be dynamically selected or calculated.
On-the-fly shuffle pattern and mask pattern calculation give computational overhead on top of the signal processing. Operations on Boolean data types and shuffle pattern data types have to be supported by the processor.
Not all processors support calculation of patterns. In some vector processors shuffle patterns have to be defined at compile time. Having all possible patterns hard coded is not possible. Patterns can be pre-stored and looked-up. Note that the number of total possible patterns in the general case is a function of the number of data elements in the vector. The number of total possible patterns can be calculated as P*2P, i.e. as the number of rotation offset positions (equal to P) times the number of possible mask vectors (equal to 2P), wherein P is the number of data elements in the vector. The number of patterns and therefore the required storage memory explodes with increasing vector size. The pattern table therefore has to be use-case dependent to make it more cost-efficient. This makes the pattern table and the lookup process more complex.
Calculating or loading of shuffle and mask patterns from tables gives a run time overhead, which may to some extend be shared among multiple streams. This overhead cannot be software pipelined with computational operations, for example of the equalization task.
WO 2006/033056 describes a microprocessor device and a method for shuffle operations. The microprocessor device comprises a vector processor architecture with a functional vector processor unit comprising first memory means for storing plural index vectors and processing means. The functional vector processor unit is arranged to receive a processing instruction and at least one input vector to be processed. The first memory means is arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. The functional vector processor unit further comprises pre-processing means arranged to receive a parameter and to process the elements of the one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector.
Adding a shuffle circuit to a vector processor in order to support data element selection is costly in terms of the area and power required. Further, shuffle circuits are expensive circuits, since the number of two-input multiplexers comprised in the shuffle circuit scales quadratically with the number P of data elements as (P−1)*P. Note that the area of a full shuffle circuit however is dominated by wiring and not so much by standard cells in current Complementary Metal-Oxide-Semiconductor (CMOS) technologies.