Modern microprocessors typically include a pipeline having different stages, including one or more front-end stages to obtain an instruction and then begin processing of the instruction. These stages place the instruction, which is often received in a so-called macro-instruction format, into a format usable by the processor, e.g., one or more micro-instructions or so-called μops. These tops are passed to further portions of the processor pipeline. For example, an out-of-order engine may reorder instructions from their program order to an order more efficient for processing purposes. From this out-of-order engine, instructions may be provided to one or more of multiple execution units. The execution units are the calculating engines of the processor and can perform various operations on the data such as various arithmetic and logic operations. Different processors may have different types of execution units. When results are obtained in these execution units, the resulting data can be provided to one or more back-end stages of the processor such as a reorder engine that can reorder instructions executed out of order back into program order. Back-end stages may further include a retirement unit to retire instructions that have been validly completed.
Historically, processors were configured to operate on scalar values, such as 8-bit, 16-bit, 32-bit or other width values. As processing speeds and transistor counts have increased, many processors have begun to incorporate vector units. Vector units are used to perform a single instruction on multiple data units, in which the instruction may be in so-called single instruction multiple data (SIMD) form. Such vector processing can be especially adapted for graphics and other compute intensive workloads. While certain user-level instructions have been introduced to perform some operations on vector data, there are still inefficiencies in processing vector data. Furthermore, while certain execution units are configured to handle vector operations, these hardware units also can be inefficient for certain vector processing.