1. Field of the Invention
The present invention relates to a data processing apparatus and method for performing scan operations.
2. Description of the Prior At
One known technique for improving performance of a data processing apparatus is to provide circuitry to support execution of vector operations. Vector operations are performed on at least one vector operand, where each vector operand comprises a plurality of vector elements. Performance of the vector operation then involves applying an operation repetitively across the various vector elements within the vector operand(s).
In typical data processing systems that support performance of vector operations, a vector register file will be provided for storing the vector operands. Hence, by way of example, each vector register within a vector register file may store a vector operand comprising a plurality of vector elements.
In high performance implementations, it is also known to provide vector processing circuitry (often referred to as SIMD (Single Instruction Multiple Data) processing circuitry) which can perform operations in parallel on the various vector elements within the vector operands. In an alternative embodiment, scalar processing circuitry can still be used to implement the vector operation, but in this instance the vector operation is implemented by iterative execution of an operation through the scalar processing circuitry, with each iteration operating on different vector elements of the vector operands.
Through the use of vector operations, significant performance benefits can be realised when compared with the performance of an equivalent series of scalar operations.
One known type of operation is a scan operation, where an identified binary operation is applied repetitively to an increasing number of data elements. The binary operation can take a variety of forms, for example an add operation, multiply operation, minimum detection operation, maximum detection operation, etc. As a result of performance of the scan operation, a sequence of results is generated, each result relating to the application of the binary operation to a different number of the data elements. As a particular example, the scan operation may specify an add operation as the binary operation, such a scan add operation sometimes being referred to as a prefix sum operation. Considering an input sequence of numbers x0, x1, x2, . . . application of the scan add operation will produce a sequence of results y0, y1, y2, . . . ,
where:y0=x0 y1=x0+x1 y2=x0+x1+x2 
The following are examples of papers that describe scan operations:                S. Knowles, “A family of adders,” in Symposium on Computer Arithmetic, 1999. Proceedings. 14th IEEE. IEEE Comput. Soc, 1999;        G. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha, “Implementation of a portable nested data-parallel language,” Journal of Parallel and Distributed Computing, vol. 21, no. 1, April 1994;        S. Chatterjee, G. Blelloch, and M. Zagha, “Scan primitives for vector computers,” in Proc. SUPERCOMPUTING '90. IEEE Comput. Soc. Press, 1990; and        G. Blelloch, “Prefix Sums and Their Applications”, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa. 15213-3890, http://www.cs.cmu.edu/˜guyb/papers/Ble93.pdf.        
For performance reasons, it would be advantageous to vectorise such scan operations. FIG. 1 schematically shows the serialised implementation of a vector scan operation, considering an input vector operand comprising eight vector elements v0 to v7 and a scalar carry-in value s. As shown, such an approach requires N processing steps and N operations, where N is the number of vector elements within the vector operand, and accordingly in this example N=8.
Such an approach might be used in low-end systems that shy away from the hardware costs associated with using vector processing circuitry to perform scan operations. Its low number of operations makes it simple and energy-efficient, but the approach does not exploit the potential performance gains from parallelisation.
FIG. 2 shows a fully parallelised approach that could be used to speed up the performance of a vector scan operation when compared with the approach of FIG. 1, for example by providing a suitable SIMD processing circuit having eight lanes of parallel processing. As shown, the scan operation is split into four discrete parts, 30, 35, 40, 45, each part involving multiple operations. The first three parts 30, 35, 40 all operate solely on the vector elements, with the final part 45 then adding the scalar value 42 to each of the vector elements resulting from performance of the third part 40 of the scan operation.
In accordance with this approach, the number of processing steps required reduces to log2N+1 processing steps (the additional one processing step being required to incorporate the scalar carry-in value 42), but the number of operations is given by the equation:
  N  +            ∑              i        =        0                    log                              2            ⁢                                                  ⁢            N                    -          1                      ⁢          (              N        -                  2          i                    )      and hence the number of operations increases to 25. Whilst the performance benefits of such an approach are significant (in this example reducing the number of processing steps from 8 to 4), the increase in the number of operations gives rise to a significant increase in the energy consumption of the apparatus performing the vector scan operation. In particular, the dynamic energy consumption will increase due to the increase in the number of operations. In addition, the various operations required by the approach of FIG. 2 significantly increases the complexity and size of the processing circuitry required to execute those operations, which also gives rise to an increase in leakage current.
It would accordingly be desirable to provide an approach which enables a balance to be achieved between the performance, and the associated energy consumption, when performing vector scan operations.