Embodiments of the present invention relate to data processing, and more particularly to processing of vector data.
Many processors have an architecture that readily supports execution of scalar instructions on scalar data. In other words, these architectures typically execute instructions to perform an operation on single data elements at a given time. In contrast, a vector processor can execute operations on multiple data elements simultaneously.
Most modem microprocessors are generally of a scalar-based architecture, although many such processors implement extensions to perform certain vector processing, commonly referred to as single instruction multiple data (SIMD) instructions. However, these processors are generally not designed to handle very wide data paths. Accordingly, SIMD instruction execution is limited to the standard data width of the data path of a processor, which is often 64 or 128 bits wide. In contrast, vector processors typically can handle vector operations on wider data paths.
Some processors include both scalar processor units and vector processor units. Typically these processor units are completely independent, and thus act as separate processors or co-processors. Accordingly, each processor consumes significant real estate, power and processing bandwidth.
Accordingly, a need exists for improved vector execution, while avoiding the impact of a full vector processor.