1. Field
This disclosure relates generally to computer processor technologies, and more specifically but not exclusively, to arithmetic operations in a processor.
2. Description
Many modern computing architectures provide a hardware reciprocal instruction, Y=recip(X), to calculate an approximate value for the reciprocal of an operant (e.g., X). Such a reciprocal instruction is very useful for implementing floating point division functions. It is also very useful for argument reduction in software implementation of many other algebraic and transcendental functions in general (e.g., cube root, sine, cosine, exponential, and logarithmic operations). For example, instead of implementing a floating point division operation A/B in hardware, a processor may first calculate recip(B) using the hardware reciprocal instruction followed by a multiplication operation between A and recip(B) because a floating point division has more complexity than addition, subtraction, and multiplication.
Typically, a hardware reciprocal instruction, Y=recip(X) has the following property:Y=(1/X)·(1−ε),|ε|≦Δ,  (1)where Δ is a uniform threshold. For example, Δ is of the order of 2−8,8 on an Intel® Itanium™ processor so that the reciprocal is accurate to at least about 8.8 significant bits. The approximate reciprocal Y can then be “refined” to a fully accurate reciprocal, or used in a refinement process to obtain a fully accurate quotient where X is the denominator. In the case where an approximate reciprocal is provided, a processing architecture usually offers additional support so that the above mentioned refinement can be conveniently calculated. The common additional support is the so called fused-multiply-add instruction where the value A×B+C is computed exactly before rounded to the floating-point format in question (as opposed to computing A*B first, rounding the result of A×B next, followed by adding C). The refinement process is effected by first computing Y=recip(X), then E=1−Y*X. An appropriate computation sequence involving Y and E follows. In many practical situations, it is observed that the value E lies in the critical path. However, the value recip(X) is in fact not needed in many cases. Thus, it is desirable to reduce the latency of the refinement process by removing recip(X) calculation in the critical process of the refinement process.