Some operations, such as public key cryptographic operations and long-integer arithmetic, require efficient multi-precision multiplication implementations. FIG. 1 depicts an example of a multiply-accumulate operation of a one data block multiplied by an eight data block (1×8 multiply-accumulate). Conventional implementations use modular exponentiation, which translates to performing a very large number of multi-precision multiplications and additions over multiple instructions.
Previously disclosed is a multiply-accumulate instruction with three operands which produces a result twice the width of the operand and was therefore defined to write to a pair of destination registers (for the low and high part of the result). The previous 3-operand multiply-accumulate instruction is defined as:Hin:Sn=Ai*Bn+Sn 
Each multiply operation generates 128 bits (64*64=128 bits) and each multiplication requires two additions (implying two independent carry chains):Sn=Sn+Lon Sn=Sn+Hin-1 
The previous multiply-accumulate operation requires a first instruction to perform a multiplication and an addition, and a second instruction to perform a second addition. It would need 8 64*64 bit multipliers in a data-path, whose cost is substantial. There are substantial other micro-operations (μops) that need to execute on the data-path. The μops consume precious 512-bit execution ports, limiting the ideal performance that could be achieved.