Digital signal processing may be implemented by way of a programmable digital signal processor (DSP) adapted to receive program instructions (e.g. chosen from a number of predefined instructions and possibly comprising one or more arguments) and execute operations accordingly. Digital signal processing may, for example, be applied on digital baseband signals (i.e. digital baseband signal processing). Furthermore, digital signal processing may be practiced in a variety of electronic appliances, e.g. wireless communication modems.
In a typical digital baseband signal processing application, streams of samples are processed according to the instructions fed to the DSP. The samples are typically complex numbers, each represented by a real part, I, having a value xI and an imaginary part, Q, having a value xQ. Commonly a complex number, x, may be expressed asx=xI+jxQ;x∈{Z}. 
Execution of a complex-complex multiplication instruction, z=x*y (i.e. an instruction to multiply a complex number, x, with another complex number, y, where x, y, z∈{Z}) on a DSP involves four real valued multiplications and two real valued additionszI=xIyI−xQyQ, andzQ=xIyQ+xQyI, wherez=zI+jzQ;z∈{Z}. Such a multiplication will be termed a complex multiplication in the following. Thus, to perform a complex multiplication with a latency of a single clock cycle, a calculation circuit with four parallel multipliers has to be deployed in the DSP.
One common operation that is performed on a stream of complex samples is scaling each sample value, x, with a real number, c:w=cx;x,w∈{Z}. Such a multiplication will be termed a real-complex multiplication in the following. In a DSP context, a real-complex multiplication may be performed by using a complex multiplication instruction where the imaginary part of the scaling argument is set to zero, i.e.ccompl=cI+jcQ;ccompl∈{Z}, wherecI=c, andcQ=0.Alternatively, a real-complex multiplication may be performed by using a real-real multiplication instruction (multiplication between two real numbers):wI=cxI;c,xI∈{R}, wQ=cxQ;c,xQ∈{R}, wherew=wI+jwQ;w∈{Z}. The first approach to real-complex multiplication requires deploying four parallel multipliers to perform the multiplication with a latency of a single clock cycle, while the second approach requires deploying two parallel multipliers. The second approach is typically preferable since it uses a lower number of multipliers and therefore consumes less power. Sometimes, the result is also ready earlier than for the first approach (lower latency) mainly due to the extra addition step required in the first approach. In some implementations, the first and second approaches both finish within a clock cycle and the second approach tolerates a higher clock frequency. In some implementations, the first approach requires two clock cycles while the second approach requires one clock cycle.
Real-complex multiplication (e.g. a scaling operation) may use vectors (e.g. scaling vectors or constant vectors) calculated using, for example, reciprocal, division, square-root, or reciprocal-square-root functions. These types of instructions are often critical in, for example, communication applications and may be an important contributor to determining the number of cycles and/or the amount of power (or other resources) a DSP has to spend to complete a specific processing task.
Vector digital signal processors (also termed digital signal vector processors herein) perform operations on vectors of data in stead of on individual samples. A DSP processing individual samples may be seen as a special case of digital signal vector processor. An important class of instructions for digital signal vector processors is single instruction multiple data (SIMD) instructions.
Generally, A SIMD instruction is a vector instruction that performs the same operation (e.g. an arithmetic operation) on each element of an input vector. In a typical digital signal vector processor implementation, the operation is performed by using an array of P identical parallel processing units when processing a vector with P elements.
For example, a real-real vector multiplication (P element-wise multiplications between two vectors, vrA, vrB, of length P) operation performed on a digital signal vector processor is typically based on P parallel multiplier hardware circuits:
for (int p=0; p<P; p++){  vrD[p] = vrA[p] * vrB[p]; //vrX[p]: element p of vector register X}
Digital signal vector processors may also support instructions that operate on complex data types. For such applications, a pair of adjacent vector elements (one even and one odd element) is typically interpreted as a complex value (the even element representing the real part value, and the odd element representing the imaginary part value). Hence, in such applications a real vector of length P may be interpreted as a complex vector of length P/2 (i.e. having P/2 complex elements). A complex-complex vector multiplication operation performed on a digital signal vector processor is typically based on P/2 parallel complex-complex multiplications:
for (int p=0; p<P; p=p+2){  vrD[p] = vrA[p] * vrB[p] − vrA[p+1] * vrB[p+1]; //re part  vrD[p+1] = vrA[p] * vrB[p+1] + vrA[p+1] * vrB[p]; //im part}
Based on available hardware parallelism, a digital signal vector processor can typically provide a higher computational throughput than a DSP that operates on a sample-by-sample basis.
It is possible to perform a real-complex multiplication on a vector processor, e.g. scaling of a complex vector X by respective real values cp, p=0, . . . , (P/2−1), using a real-real vector multiplication instruction (compare with the example above of using two instructions for real-real multiplication to perform a real-complex scalar multiplication). This may be accomplished if the real values, cp, are first organized in a real vector C of length P, where the real values are duplicated into respective adjacent even and odd elements. The duplication may be achieved by, for example, using a vector shuffle instruction.
Performance requirements for digital signal vector processors tend to increase with every product generation, for example, due to higher load from applications such as ever increasing data bit rates of radio communication according to various standards.
Higher performance requirements may, to a certain extent, be addressed by increasing the number of vector elements, P, processed per SIMD instruction. However, the hardware cost and the power consumption typically increase at least linearly with P. Also, some circuits (e.g. circuits performing vector instructions for reciprocal, square-root, and reciprocal square root) have a certain hardware cost which is not proportional to how often they are used. Hence, simply replicating these circuits P times (with increasing P) has a relatively high cost.
A possibility to lower the area cost is to reuse a circuit to perform operations on multiple vector elements in sequential clock cycles. However, this increases the latency of a single vector instruction which typically impacts the length of the instruction schedule and, thus, increases the execution time.
Thus, other methods to achieve the increasing performance requirements are typically needed or at least beneficial. Simplifying implementation of real-complex multiplications may be one such method to accommodate increasing performance demands.
Thus, there is a need for improved approaches to enabling real-complex multiplications, in particular for digital signal vector processors.