In current high-performance processor architectures, increasing performance by increasing clock frequency is reaching its limits due to physical limitations. Instead, other methods of increasing the performance are being exploited. One of the methods to do so is to increase the parallelism, i.e. the number of operations performed in parallel in a single clock cycle. Thereby, the single clock cycle may be the basic timing unit of a processor.
A familiar way to increase the parallelism is to exploit the Single Instruction, Multiple Data (SIMD) concept. In such SIMD processors, each single instruction acts on multiple data values simultaneously, performing the same operation on each of them. This is performed by a SIMD processor which may operate on fixed-length vectors. The fixed-length vectors may be also called rows or arrays and may comprise a number of data elements. For example, a 16-bit SIMD machine of width 32 works on rows of 32 elements, each being a 16-bit number, i.e. processes 32*16=5612 bits at once.
Operations will take arguments from the vector(s) according to the position within the vector, and generate a result. The result may be put either into an existing vector, as in case of the exemplified operation A=−A, or in a new vector as in case of the exemplified operation C=A+B, where A, B and C are vectors. For both cases, the computed elements of the result vector are located on the same position within this result vector, i.e. C[0]=A[0]+B[0], etc.
In FIG. 1, a first exemplified operation according to prior art is shown. A first vector 2 comprises the elements A[i], where i=1, . . . , N, and a second vector 4 comprises the elements B[i], where i=1, . . . , N. According to the shown example, the SIMD instruction is an adding function, wherein adding is performed in a pair-wise fashion corresponding elements of two of such vectors 2 and 4, result in a third result vector 6. For all i within the vector length, the result vector is computed according to following equationC[i]=A[i]+B[i]. 
It shall be understood that the SIMD operations are not limited to adding functions and that SIMD operations includes all element-wise functions.
An extension of the idea of a SIMD processor is the so-called vector processor. In addition to the capability of performing SIMD operations, the vector processor may be able to perform also so-called intra-vector operations. Intra-vector operations are operations which have interaction between the elements within a single vector. An example of such an operation is the calculation of the sum of elements within a vector. Such an operation cannot be performed as parallel operation on a pure SIMD machine, as such machines only operate on elements on the same position within the vectors. By way of example, intra-vector operations are addition of elements within a vector, which can be also called vector intra-add, finding the maximum or minimum element within a vector, and rearranging or permuting elements within a vector.
FIG. 2 shows a second exemplified operation according to prior art. More particularly, FIG. 2 illustrates an intra-operation on a complete vector. As can be seen from this Figure, the input elements in[i], i=0, . . . , 7, of vector 8 are summed and the result s0 is put into field 10.
A third exemplified operation according to prior art is depicted in FIG. 3. FIG. 3 gives an example of an intra-add operation on a segmented vector 12. The illustrated vector 12 is divided into a first segment 14 comprising the elements A[i], i=1, . . . , 4, and further segments indicated by reference sign 16 comprising the elements A[i], i=5, . . . , N. The elements of each segment 14 and 16 can be summed and put into respective result fields 18 and 20.
The concept of SIMD operations, and of intra-vector operations, is already well known in computing architectures. However, when mapping an algorithm on a vector processor, the length of the vectors in the processor do not always match the length of the data segments (chunks) that have to be processed in the algorithm. For example, consider a use case where the native vector length comprises a value of sixteen. However, the algorithm may divide the input-stream into segments of eight adjacent elements, which have to be accumulated. This is a typical situation in e.g. cellular communications based on Rake receivers, wherein the rake has a small spreading factor. According to the present example, the spreading factor is eight.
A simple vector intra-add operation according to FIG. 2 does not suffice to implement such an algorithm efficiently, as it will add all elements within a vector. Hence, in order to use a standard (full-width) intra-add, in a separate operation at first all elements, which do not belong to a particular segment, have to be zeroed, before the intra-vector addition is performed. Additionally, this process has to be repeated for each segment within the vector. Finally, it is likely the results have to be repacked in a result vector, to deliver the computed values in adjacent elements for further processing.
A segmented intra-add operation according to FIG. 3 provides a way to compute the partial sums efficiently, it does however not provide a way to collect the results in an efficient way. Further it only provides a solution for segment lengths that are a divisor of the vector length.
Therefore, it is an object of the present application to provide a method wherein the segment length is unlimited. Another object is to provide a method for collecting the result output stream in an efficient way. A further object is to improve the efficiency of the vector processor.