The present invention relates to data processing and, more particularly, to data processors that operate on data subwords in parallel. A major objective of the invention is to enhance performance while executing parallel subword operations.
Much of modern progress is associated with advances in computer technology, which has provided increasing performance while lowering costs. One of the ways that performance has improved has been to increase the word size, i.e., the maximum number of bits that can be treated as a single unit by a data processor. Early processors manipulated information one byte (eight bits) at a time, while modern processors are characterized by word sizes of 32-bits, 64-bits, or greater.
To take advantage of the large word sizes available, some modern processors include instructions that treat source and destination registers as being composed of multiple subword values, and operate on these subwords in parallel. Such “parallel subword” instructions are typically used for multi-media applications, where a given computation is typically done over a large number of small values (for example, 8-bit or 16-bit pixels). Having instructions that perform this computation on multiple values packed into a register (for example, eight 8-bit or four 16-bit values in a 64-bit register) in parallel takes full advantage of the data path widths in the processor, and accelerates the computation.
The performance advantages offered by parallel subword instructions can be offset by instructions required to pack the operand data. One common form of instruction stores a subword result in the least-significant subword location of a register, setting all other subword locations with default zeroes. Such results can be shifted relative to each other (filling vacated locations with default zeros) so that they can be combined by adding or ORing to achieve the desired packing. In the case of a 64-bit register and one-byte results, seven shift operations and seven additions can be required for packing. Thus, a total of fourteen instructions are required to pack the subwords so that a parallel subword operation can be performed.
Specialized packing instructions can reduce the number of additional instructions required to pack results initially stored in the same subword location of different registers. For example, “Mix” instructions implemented in the Itanium 2 processor effectively perform shifting and combining so that only seven mix instructions are required to pack eight one-byte results into a single 64-bit register.
If instead of storing all results in the same (e.g., least-significant) subword location, instructions store results in different subword locations, separate shifting instructions are not required. Thus, the instructions can specify different subword locations for their respective result registers, filling all unspecified subword locations of the result register with default zeroes. In that case, only addition or OR instructions are required to pack results. For example, seven OR instructions can be used to pack eight one-byte results into a single register.
By storing successive subword results in different subword locations of the same register, the need for separate packing instructions can be eliminated. However, this requires that each instruction preserve the contents of subword locations not used to store the result. This can be accomplished by reading the result register, modifying the contents so read by replacing the specified subword location with result data, and then overwriting the result register with the modified data.
While these “self-packing” instructions eliminate the need for the additional packing instructions, they do require that a read port be dedicated to reading the result register so that some of its contents can be preserved. The read port so used is thus unavailable for operand data, thus reducing the amount of data that can be processed per instruction. Where there are only two register read ports, which is typically the case for general purpose processors, only one port remains available for operand data. Since parallel subword instructions require at least two read ports for operand data, this approach is not compatible with such instructions in the context of a general-purpose processor.
While it is possible to design a processor with three read register read ports, this is considered excessive in the context of general-purpose processor design, where the number of register read ports is typically two. Thus, only the first three approaches to packing results are compatible with parallel subword and other two-operand instructions in the context of a general-purpose processor design. What is needed is an approach that minimizes the instruction count required to pack parallel subword instruction results while being compatible with two-operand instructions in the context of a general-purpose processor.