In data processing, it is common to perform operations on one-dimensional arrays of data called vectors. The microarchitecture of a data processor can be designed to take advantage of such operations. For example, when processing data vectors, a single instruction may be used multiple times, but the instruction only needs to be fetched and decoded once. Further the data may be at uniformly spaced locations, so register re-naming and address translation does not need to performed multiple times.
Data processors optimized for operating on data vectors may be called vector processors or array processors. A vector processor implements an instruction set containing instructions that explicitly operate on data vectors (usually multiple data elements), whereas general purpose data processors implement scalar instructions that operate on single data items. For example, some data processors implement SIMD (Single Instruction, Multiple Data) instructions to provide a form of vector processing on multiple (vectorized) data sets.
A disadvantage of using a special instruction set for vector operations is that a programmer or compiler must know in advance (or statically) when vector operations are to be performed and the amount of data to be processed. This is not always possible, since the number of data elements to be processed may itself depend on the input data.
Data processing systems commonly execute a number of threads. The execution threads may be performed serially on single serial processor using time-slicing, in parallel on a number of linked processing cores, or a combination thereof. In many applications, there is a desire to receive data from multiple data threads, perform operations on the data and pass the processed data to other execution threads. When multiple cores are used, the potential advantages of vector processing may not be achieved because of the resources needed to pass data between threads. For example, in the absence of dedicated hardware, a core-to-core transfer may take about 630 cycles using a software first-in, first-out (FIFO) buffer. Data transfer between sockets may take about 1500 cycles. In addition, cache misses may occur on both producer and consumer cores.
There exists a need for a data processor that can perform efficient vector processing in a multi-thread execution environment. Current approaches for auto-vectorization require that the data-flow bounds be determined statically. For example, it may be required that the loop bounds be known at compilation time, rather than determined dynamically during execution.