Field of the Invention
The present invention relates to a data processing apparatus and method for performing segmented operations.
Description of the Prior Art
One known technique for improving performance of a data processing apparatus is to provide circuitry to support execution of vector operations. Vector operations are performed on at least one vector operand, where each vector operand comprises a plurality of data elements. Performance of the vector operation then involves applying an operation repetitively across the various data elements within the vector operand(s).
In typical data processing systems that support performance of vector operations, a vector register file will be provided for storing the vector operands. Hence, by way of example, each vector register within a vector register file may store a vector operand comprising a plurality of data elements.
In certain implementations, it is also known to provide vector processing circuitry (often referred to as SIMD (Single Instruction Multiple Data) processing circuitry) which provides multiple lanes of parallel processing in order to perform operations in parallel on the various data elements within the vector operands.
Through the use of vector operations, significant performance benefits can be realised when compared with the performance of an equivalent series of scalar operations.
For certain types of operations which can be vectorised to enable them to be executed in parallel within the various lanes of the vector processing circuitry, it is difficult to obtain efficient utilisation of the vector processing circuitry. For example, there are often operations which are performed on each iteration of a loop, but within each iteration the number of data elements to be processed by those operations can vary, such that there is a lack of regularity in the number of data elements to be processed in each iteration. Whilst for each iteration the various data elements may be able to be processed within respective lanes of the vector processing circuitry, this will not always lead to good utilisation of the available lanes of the vector processing circuitry. For example, if the vector processing circuitry has N lanes of parallel processing, it may often be the case that less than N data elements are processed in respect of several of the iterations, leading to inefficient utilisation of the vector processing circuitry. Further, due to the irregular nature of the data elements for each iteration, it has up to now been considered impractical to make more efficient use of the vector processing circuitry since it is unclear exactly how many lanes will be required on any particular iteration.
One example of an algorithm where such irregular numbers of data elements need to be processed is a sparse matrix multiplication algorithm, where a sparse matrix of first data elements is multiplied by a vector of second data elements in order to produce a number of multiplication results for each row of the sparse matrix. The multiplication results within each row are then accumulated in order to produce a result for each row. However, the number of multiplication results produced for each row is dependent on the number of non-zero data elements in each row of the sparse matrix, and hence the number of multiplication results can vary quite significantly between the various rows. Whilst the accumulation operation required to accumulate the multiplication results for any particular row lends itself to being performed using lanes of the vector processing circuitry, the number of lanes required for any particular iteration will vary, and this will tend to result in significant underutilisation of the vector processing circuitry, which will affect both performance and the energy consumption of the vector processing circuitry when performing those operations.
Recent attempts at solving irregular problems such as sparse matrix vector multiplication have focused on using throughput-oriented processors or graphics processing units (GPUs). Whilst GPUs are very good at overlapping computation with memory accesses and thus hiding latency, they experience difficulties when the irregularity of the data structures manifest as computational load imbalances. As a result, the efforts are only successful when special data formats are used or the underlying physical problem being modelled produces a well structured sparse matrix for example.
The following are examples of various papers that describe techniques for handling irregular data structures:
1. Shubhabrata Sengupta, Efficient Primitives and Algorithms for Many-core architectures, PhD Thesis, 2010.
2. G. E. Blelloch, J. C. Hardwick, J. Sipelstein, M. Zagha, S. Chatterjee, Implementation of a Portable Nested Data-Parallel Language, Journal of Parallel and Distributed Computing, Volume 21, Issue 1, April 1994.
3. B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, W. Schulte, SIMD Parallelization of Applications that Traverse Irregular Data Structures, 2013.
4. M. Billeter, O. Olsson, U. Assarsson, Proceedings of the Conference on High Performance Graphics 2009, Efficient Stream Compaction on Wide SIMD Many-Core Architectures.
It would be desirable to provide a mechanism for improving the utilisation of vector processing circuitry, that would enable better utilisation of the lanes of parallel processing when handling a variety of sets of data, for example the earlier described irregular data structures.