In modern processors, in general, in the following sequential order: the processor reads an instruction, a decoder in the processor decodes the instruction, and, then, the processor executes the instruction. In older processors the clock speed of the processor was generally slow enough that the reading, decoding and executing of each instruction could occur in a single clock cycle. However, modem microprocessors have improved performance by going to shorter clock cycles (that is, higher frequencies). These shorter clock cycles tend to make instructions require multiple, smaller sub-actions that can fit into the cycle time. Executing many such sub-actions in parallel, as in a pipelined and/or super-scalar processor, can improve performance even further. For example, although the cycle time of a present-day processor is determined by a number of factors, the cycle time is, generally, determined by the number of gate inversions that need to be preformed during a single cycle. Ideally, the execute stage determines the cycle time. However, in reality, this is not always the case. With the desire to operate at high frequency, the execute stage can be performed across more than one cycle, since it is an activity that can be pipelined. In a large number of workloads the added latency caused by the additional cycle(s) has only a small impact on processor performance. The ultimate goal of many systems is to be able to complete the execution of as many instructions as quickly and as efficiently as possible without adversely impacting the cycle time of the processor.
One way to increase the number of instructions, or equivalent instructions, that can be executed is to create a single instruction that can perform work that currently can only be accomplished by using multiple instructions without causing any timing problems during the execute phase. An instruction of this type can be especially effective in performing multiple additions both with and without accumulation of the results of the additions.