1. Field of the Invention
This invention relates in general to data processing, and more particularly to performing parallel operations on multiple signed elements in a single register.
2. Description of the Related Art
In the highly competitive software industry software performance is a key consideration and often a factor by which similar software products differentiate from each other and by which software products are judged by their users. As a result the software industry is constantly looking at ways to improve software performance within the constraints of the hardware available.
In this context data processors function by performing operations on data elements through various registers in accordance with instructions provided by a computer program. For example, a data element is first loaded into a register and an appropriate operation is performed on the register, such an operation taking a single cycle of the processor. For example a register may comprise 8, 16, 32, 64, 128, etc. bits. Example operations include subtract, multiply by a constant, shift left logical, shift right logical, AND, and OR.
In this environment space in a register is often wasted by loading and performing operations on data elements which require a subset of the bits available in a register. For example, if a 16 bit register is used to perform an operation on a 4 bit data element, potentially 12 bits of the register are wasted for a given cycle. Accordingly U.S. Pat. No. 6,834,337 to Mitchell et al. describes how to convert computer processors with wide registers into (effectively) vector scalar processors without any hardware changes. Multiple signed independent elements are packed into a register and an operation performed on all of the elements in parallel. Operations like add, subtract, multiplication by constants, OR, AND, exclusive OR (XOR), and shift left logical only require one cycle independent of the number of elements packed into the register. However, the unpack (convert the dependent values back into independent elements within their allotted precision), repack, reconvert the independent values back into dependent elements), shift right arithmetic (SRA), and shift right logical (SRL) operations were shown as needing loops proportional to the number of elements (for unpack or repack operations) or the number of shifts (for SRA and SRL operations) in order to handle the propagation of the borrows from negative numbers correctly. This created significant overhead that decreased the overall efficiency.
U.S. Ser. No. 10/445,563 to Brady et. al describes a hardware assist to accomplish the PACK, UNPACK, and SRA/SRL operations in less cycles. However, whilst this assist is natural for FPGA designs since the assist did not have to disturb the basic arithmetic logic unit (ALU) design, it is not practical for purely software implementations.
Accordingly there is the need for a more efficient (i.e.: faster) software solution for the execution of unpack, repack, SRL, and SRA operations on a register into which multiple signed elements have been packed.